xenial (7) opal_crs.7.gz

Provided by: openmpi-doc_1.10.2-8ubuntu1_all bug

NAME

       OPAL_CRS  -  Open  PAL  MCA  Checkpoint/Restart  Service (CRS): Overview of Open PAL's CRS framework, and
       selected modules.  Open MPI 1.10.2.

DESCRIPTION

       Open PAL can involuntarily checkpoint and restart sequential programs.  Doing so requires that  Open  PAL
       was compiled with thread support and that the back-end checkpointing systems are available at run-time.

   Phases of Checkpoint / Restart
       Open PAL defines three phases for checkpoint / restart support in a procress:

       Checkpoint
           When the checkpoint request arrives, the procress is notified of the request before the checkpoint is
           taken.

       Continue
           After a checkpoint has successfully completed, the same process as the checkpoint is notified of  its
           successful continuation of execution.

       Restart
           After  a  checkpoint  has  successfully  completed,  a  new  /  restarted  process is notified of its
           successful restart.

       The Continue and Restart phases are identical except for the process  in  which  they  are  invoked.  The
       Continue  phase  is invoked in the same process as the Checkpoint phase was invoked. The Restart phase is
       only invoked in newly restarted processes.

GENERAL PROCESS REQUIREMENTS

       In order for a process to use the  Open  PAL  CRS  components  it  must  adhear  to  a  few  programmatic
       requirements.

       First, the program must call OPAL_INIT early in its execution. This should only be called once, and it is
       not possible to checkpoint the process without it first having called this function.

       The program must call OPAL_FINALIZE before termination. This does a significant amount of cleanup. If  it
       is not called, then it is very likely that remnants are left in the filesystem.

       To  checkpoint  and  restart  a  process  you  must  use  the  Open PAL tools to do so. Using the backend
       checkpointer's checkpoint and restart tools will lead to undefined behavior.  To checkpoint a process use
       opal_checkpoint (opal_checkpoint(1)).  To restart a process use opal_restart (opal_restart(1)).

AVAILABLE COMPONENTS

       Open PAL ships with two CRS components: self and blcr.

       The following MCA parameters apply to all components:

       crs_base_verbose
           Set the verbosity level for all components. Default is 0, or silent except on error.

   self CRS Component
       The  self  component  invokes  user-defined  functions  to  save  and restore checkpoints. It is simply a
       mechanism for user-defined functions to be invoked  at  Open  PAL's  Checkpoint,  Continue,  and  Restart
       phases.  Hence,  the  only  data  that  is  saved  during the checkpoint is what is written in the user's
       checkpoint function. No libary state is saved at all.

       As such, the model for the self component is slightly differnt than for other  components.  Specifically,
       the  Restart  function is not invoked in the same process image of the process that was checkpointed. The
       Restart phase is invoked during OPAL_INIT of the new instance of the applicaiton (i.e.,  it  starts  over
       from main()).

       The self component has the following MCA parameters:

       crs_self_prefix
           Speficy a string prefix for the name of the checkpoint, continue, and restart functions that Open PAL
           will invoke during the respective stages. That is, by specifying  "-mca  crs_self_prefix  foo"  means
           that Open PAL expects to find three functions at run-time:

              int foo_checkpoint()

              int foo_continue()

              int foo_restart()

           By default, the prefix is set to "opal_crs_self_user".

       crs_self_priority
           Set the self components default priority

       crs_self_verbose
           Set the verbosity level. Default is 0, or silent except on error.

       crs_self_do_restart
           This  is  mostly  internally used. A general user should never need to set this value. This is set to
           non-0 when a the new process should invoke the restart callback in OPAL_INIT. Default is 0, or normal
           execution.

   blcr CRS Component
       The  Berkeley  Lab  Checkpoint/Restart (BLCR) single-process checkpoint is a software system developed at
       Lawrence Berkeley National Laboratory. See the project website for more details:

           http://ftg.lbl.gov/CheckpointRestart/CheckpointRestart.shtml

       The blcr component has the following MCA parameters:

       crs_blcr_priority
           Set the blcr components default priority.

       crs_blcr_verbose
           Set the verbosity level. Default is 0, or silent except on error.

   none CRS Component
       The none component simply selects no CRS component. All of the CRS function calls return immediately with
       OPAL_SUCCESS.

       This  component  is the last component to be selected by default. This means that if another component is
       available, and the none component was not explicity requested then OPAL will attempt to activate  all  of
       the available components before falling back to this component.

SEE ALSO

         opal_checkpoint(1), opal_restart(1)