Provided by: slurm-llnl_2.6.5-1_amd64 bug

NAME

       srun_cr - run parallel jobs with checkpoint/restart support

SYNOPSIS

       srun_cr [OPTIONS...]

DESCRIPTION

       The  design  of  srun_cr is inspired by mpiexec_cr from MVAPICH2 and cr_restart form BLCR.
       It is a wrapper around the srun command to enable  batch  job  checkpoint/restart  support
       when used with SLURM's checkpoint/blcr plugin.

OPTIONS

       The  srun_cr  execute  line  options are identical to those of the srun command.  See "man
       srun" for details.

DETAILS

       After initialization, srun_cr registers a thread context callback function.  Then it forks
       a  process  and  executes  "cr_run --omit srun" with its arguments.  cr_run is employed to
       exclude the srun process from being dumped upon checkpoint.  All catchable signals  except
       SIGCHLD  sent  to  srun_cr  will  be forwarded to the child srun process.  SIGCHLD will be
       captured to mimic the exit status of srun when it exits.  Then srun_cr loops  waiting  for
       termination of tasks being launched from srun.

       The step launch logic of SLURM is augmented to check if srun is running under srun_cr.  If
       true, the environment variable SURN_SRUN_CR_SOCKET should be present, the value  of  which
       is  the  address  of  a  Unix  domain  socket  created  and listened to be srun_cr.  After
       launching the tasks, srun tires to connect to the socket and sends the job ID, step ID and
       the nodes allocated to the step to srun_cr.

       Upon  checkpoint,  srun_cr  checks to see if the tasks have been launched.  If not srun_cr
       first  forwards  the  checkpoint  request  to  the  tasks  by  calling   the   SLURM   API
       slurm_checkpoint_tasks() before dumping its process context.

       Upon  restart,  srun_cr  checks  to  see  if  the  tasks have been previously launched and
       checkpointed.  If true, the environment variable SLURM_RESTART_DIR is set to the directory
       of  the checkpoint image files of the tasks.  Then srun is forked and executed again.  The
       environment variable will be used by the srun command to restart execution  of  the  tasks
       from the previous checkpoint.

COPYING

       Copyright (C) 2009 National University of Defense Technology, China.  Produced at National
       University of Defense Technology, China (cf, DISCLAIMER).

       This  file  is  part  of  SLURM,  a  resource  management  program.   For   details,   see
       <http://slurm.schedmd.com/>.

       SLURM  is  free  software; you can redistribute it and/or modify it under the terms of the
       GNU General Public License as published by the Free Software Foundation; either version  2
       of the License, or (at your option) any later version.

       SLURM is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without
       even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
       GNU General Public License for more details.

SEE ALSO

       srun(1)