xenial (1) cr_checkpoint.1.gz

Provided by: blcr-util_0.8.5-2.2_amd64 bug

NAME

       cr_checkpoint - checkpoints a process, process group, or session.

SYNOPSIS

       cr_checkpoint [options] ID

DESCRIPTION

       Invoking  cr_checkpoint causes a process (with or without all of its descendants), all processes within a
       process group, or all processes within a session, to be checkpointed.  The result is  a  checkpoint  file
       (or  a  directory with one checkpoint file per process) that contains all the state needed to restart the
       process(es) at a later time.  Checkpointed processes can be restarted via cr_restart(1).

       To be checkpointed by cr_checkpoint, a process must have the libcr.so library (or one of  its  relatives)
       loaded.  This can be achieved by starting the program with cr_run(1), or by linking your application with
       -lcr.  Or, the library may be loaded by other libraries you have linked with (such as a  checkpoint-ready
       MPI  library),  or  your  system's parallel job startup script, etc.  Check your system documentation for
       details.

   File creation/replacement
       By default (or if --atomic is passed) cr_checkpoint creates the new  context  file/directory  atomically:
       either  the checkpoint fails (and any existing context file/directory is unchanged), or it appears in the
       directory ready to be used by cr_restart.  If an existing checkpoint with the same file name  exists,  it
       will  either  be  be unmodified (if the new checkpoint fails for any reason), or replaced atomically (via
       rename(2).  If --backup[=NAME] is passed, any existing checkpoint will be backed up  instead,  either  to
       NAME or with a numbered extension (.~1~, .~2~, etc., with more recent checkpoints having higher numbers).
       If --clobber is passed, the checkpoint will immediately remove any existing checkpoint  files,  and  will
       write  the checkpoint directly out into the target file/directory: this option uses less disk space if an
       existing checkpoint is present, since the old checkpoint is immediately discarded, but if the  checkpoint
       fails,  the pre-existing checkpoint is lost.  Finally, if --noclobber is passed, then the checkpoint will
       fail if the target file/directory exists.

   File sync
       By default (or when --sync is passed), cr_checkpoint waits until the checkpoint is  complete  in  memory,
       and  additionally calls fsync(2) on all files and directories involved in the checkpoint (including back-
       up files) to disk before exiting.  Passing --nosync causes these fsync calls to be skipped.

   Timeout
       A maximum timeout in seconds can be set for a checkpoint via the --time flag:  if  the  checkpoint  takes
       longer  than this, cr_checkpoint will print an error mesage and exit with an error.  If a timeout occurs,
       the state of the process or processes that were being checkpointed is undefined.

   Signals
       By default checkpointed processes continue to run after a checkpoint is complete.  Alternatively, you may
       specify that they be stopped (via --stop), or terminated/aborted/killed (via --term, --abort, or --kill).
       This is done by sending the appropriate signal to every process that is part of the checkpoint.   If  the
       processes  were stopped at the time the checkpoint was requested, then --cont may be used to send SIGCONT
       to all processes after the checkpoint is completed.

   Memory mapped files
       By default, checkpoints do not include any files that are mmap()ed into the process address space  unless
       they  are already unlinked at the time the checkpoint is taken.  This is a space/time saving optimization
       under the assumption that the files required will still be present (and  uncorrupted)  at  restart  time.
       Typically  the  largest  savings  comes  from  not  saving  the executable file or dynamic (a.k.a shared)
       libraries.  However, options exist to cause the checkpoint to save these files as well.  The flag --save-
       exe  will  cause  the  executable  file to be included in the context file.  The flag --save-private will
       include in the context file any files that are mapped  with  the  MAP_PRIVATE  flag,  which  under  Linux
       includes the executable and dynamic/shared libaries.  The flag --save-shared is for saving files that are
       mapped with the MAP_SHARED flag.  Note that this is not the flag  you  want  for  shared  libraries.   At
       restart  any  file  saved  by  these flags will be mapped into the process regardless of whether any file
       exists at the original location.  If there is file at the original location it remains untouched  by  the
       restart.   Finally  --save-all and --save-none will cause all (or none) of these optional mmaped files to
       be saved.  The default is --save-none.  When passing multiple of these options they  are  processed  from
       left  to  right  with all options being additive, except for --save-none which cancels the effects of any
       these options appearing earlier.

   Checkpointing ptrace()ed processes
       There is (currently) no way to fully transparently deal with checkpoints  of  processes  that  are  being
       traced with ptrace(2).  Therefore, the default behavior (also available via --ptraced-error) is to return
       an error if any of the processes to be checkpointed are currently being ptraced.  However, there are  two
       other possible behaviors to choose among:

       --ptraced-skip
              Ptraced  processes  will  be siliently excluded from the checkpoint.  No error is generated unless
              this results in zero processes checkpointed.

       --ptraced-allow
              Ptraced processes will be checkpointed just  like  any  other  processes.   WARNING:  Because  the
              checkpointed  process and the BLCR kernel module must interact using signals and system calls, the
              debugger (or other tracer) may need to `continue' the target process(es), possibly more than once,
              to allow the checkpoint to complete.

   Checkpointing ptrace()ing processes
       There  is  (currently)  no way to fully transparently deal with checkpoints of processes that are tracing
       other processes using ptrace(2).  Therefore, the default behavior (also available via --ptracer-error) is
       to  return  an  error  if any of the processes to be checkpointed are currently ptracing other processes.
       However --ptracer-skip is available to cause cr_checkpoint to silently exclude such  processes  from  the
       checkpoint.  No error is generated in that case unless this would result in zero processes checkpointed.

OPTIONS

   General options:
       -v, --verbose
              print progress messages to stderr.

       -q, --quiet
              suppress error/warning messages to stderr.

       -?, --help
              print this message and exit.

       --version
              print version information and exit.

   Options for scope of the checkpoint:
       -T, --tree
              ID  identifies  a  process id.  It and all of its descendants are to be checkpointed.  This is the
              default.

       -p, --pid, --process
              ID identifies a single process id.

       -g, --pgid, --group
              ID identifies a process group id.

       -s, --sid, --session
              ID identifies a session id.

   Options for destination location of the checkpoint:
       -c, --cwd
              checkpoint saved as a single 'context.ID' file in cr_checkpoint's working directory (default).

       -d, --dir DIR
              checkpoint saved in new directory DIR, with one 'context.ID' file per process (unimplemented).

       -f, --file FILE
              checkpoint saved as FILE.

       -F, --fd FD
              checkpoint written to an open file descriptor.

   Options for creation/replacement policy for checkpoint files:
       --atomic
              checkpoint created/replaced atomically (default).

       --backup[=NAME]
              checkpoint created atomically, and any existing checkpoint backed up to NAME or *.~1~, *.~2~, etc.

       --clobber
              checkpoint written incrementally to target, overwriting any pre-existing checkpoint.

       --noclobber
              checkpoint will fail if the target file exists.

              These options are ignored if the destination is a file descriptor.

   Options for signal sent to process(es) after checkpoint:
       --run  no signal sent: continue execution (default).

       -S, --signal NUM
              signal NUM sent to all processess.

       --stop SIGSTOP sent to all processes.

       --term SIGTERM sent to all processes.

       --abort
              SIGABRT sent to all processes.

       --kill SIGKILL sent to all processes.

       --cont SIGCONT sent to all processes.

              Options in this group are mutually exclusive.  If more than one is given then only the  last  will
              be honored.

   Options for file system synchronization (default is --sync):
       --sync fsync checkpoint file(s) to disk (default).

       --nosync
              do not fsync checkpoint file(s) to disk.

   Options to save optional portions of memory:
       --save-exe
              save the executable file.

       --save-private
              save private mapped files.  (executables and libraries are mapped this way)

       --save-shared
              save shared mapped files.  (System V IPC is mapped this way).

       --save-all
              save all of the above.

       --save-none
              save none of the above (the default).

   Options for ptraced processes (default is --ptraced-error):
       --ptraced-error
              return an error if a checkpoint is requested of a process being ptraced.

       --ptraced-skip
              ptraced  processes  are silently excluded from the checkpoint request.  If the checkpoint scope is
              --tree, then this will also exclude any children of such processes.  No error is  produced  unless
              this results in zero processes checkpointed.

       --ptraced-allow
              checkpoint  ptraced  processes  normally.   WARNING: This may require the tracer to "continue" the
              target process(es), possibly more than once.

   Options for processes ptracing others (default is --ptracer-error):
       --ptracer-error
              return an error if a checkpoint is requested of a process which is ptracing others.

       --ptracer-skip
              processes ptracing others are silently excluded from the checkpoint request.   If  the  checkpoint
              scope is --tree, then this will also exclude any children of such processes.  No error is produced
              unless this results in zero processes checkpointed.

   Options for kernel log messages (default is --kmsg-error):
       --kmsg-none
              don't report any kernel messages.

       --kmsg-error
              on checkpoint failure, report on  stderr  any  kernel  messages  associated  with  the  checkpoint
              request.

       --kmsg-warning
              report on stderr any kernel messages associated with the checkpoint request, regardless of success
              or failure.  Messages generated in the absence of failure are considered to be warnings.

              Options in this group are mutually exclusive.  If more than one is given then only the  last  will
              be honored.  Note that --quiet suppresses all stderr output, including these messages.

   Misc Options:
       -t, --time SEC
              allow only SEC seconds for target to complete checkpoint (default: wait indefinitely).

EXAMPLES

       To checkpoint the process with process ID 23452, saving its state to file context.23452:

              cr_checkpoint -p 23452

       To checkpoint all the processes in process group 68473, and save them to file groupie:

              cr_checkpoint -g -f groupie 68473

       To  checkpoint all the process in session 8362, and save separate 'context.PID' files for each process in
       directory 'my_checkpoints':

              cr_checkpoint -s -d my_checkpoints 8362

BUGS

       Some features in this manpage may be unimplemented.

AUTHORS

       Jason Duell, Paul Hargrove, and Eric Roman, Lawrence Berkeley National Laboratory.

REPORTING BUGS

       Bug reports may be filed on the web at http://mantis.lbl.gov/bugzilla.

SEE ALSO

       cr_restart(1), cr_run(1)