Provided by: slurm-llnl_2.6.5-1_amd64 bug

NAME

       srun_cr - run parallel jobs with checkpoint/restart support

SYNOPSIS

       srun_cr [OPTIONS...]

DESCRIPTION

       The  design of srun_cr is inspired by mpiexec_cr from MVAPICH2 and cr_restart form BLCR.  It is a wrapper
       around the  srun  command  to  enable  batch  job  checkpoint/restart  support  when  used  with  SLURM's
       checkpoint/blcr plugin.

OPTIONS

       The srun_cr execute line options are identical to those of the srun command.  See "man srun" for details.

DETAILS

       After  initialization, srun_cr registers a thread context callback function.  Then it forks a process and
       executes "cr_run --omit srun" with its arguments.  cr_run is employed to exclude the  srun  process  from
       being  dumped upon checkpoint.  All catchable signals except SIGCHLD sent to srun_cr will be forwarded to
       the child srun process.  SIGCHLD will be captured to mimic the exit status of srun when it  exits.   Then
       srun_cr loops waiting for termination of tasks being launched from srun.

       The  step  launch  logic  of  SLURM is augmented to check if srun is running under srun_cr.  If true, the
       environment variable SURN_SRUN_CR_SOCKET should be present, the value of which is the address of  a  Unix
       domain  socket  created  and listened to be srun_cr.  After launching the tasks, srun tires to connect to
       the socket and sends the job ID, step ID and the nodes allocated to the step to srun_cr.

       Upon checkpoint, srun_cr checks to see if the tasks have been launched.  If not  srun_cr  first  forwards
       the  checkpoint request to the tasks by calling the SLURM API slurm_checkpoint_tasks() before dumping its
       process context.

       Upon restart, srun_cr checks to see if the tasks have been  previously  launched  and  checkpointed.   If
       true, the environment variable SLURM_RESTART_DIR is set to the directory of the checkpoint image files of
       the tasks.  Then srun is forked and executed again.  The environment variable will be used  by  the  srun
       command to restart execution of the tasks from the previous checkpoint.

COPYING

       Copyright  (C) 2009 National University of Defense Technology, China.  Produced at National University of
       Defense Technology, China (cf, DISCLAIMER).

       This file is part of SLURM, a resource management program.  For details, see <http://slurm.schedmd.com/>.

       SLURM is free software; you can redistribute it and/or modify it under  the  terms  of  the  GNU  General
       Public License as published by the Free Software Foundation; either version 2 of the License, or (at your
       option) any later version.

       SLURM is distributed in the hope that it will be useful, but  WITHOUT  ANY  WARRANTY;  without  even  the
       implied  warranty  of  MERCHANTABILITY  or  FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public
       License for more details.

SEE ALSO

       srun(1)