oracular (7) opal_crs.7.gz

Provided by: openmpi-doc_4.1.6-13.3ubuntu2_all bug

NAME

       OPAL_CRS  -  Open  PAL  MCA  Checkpoint/Restart  Service (CRS): Overview of Open PAL's CRS
       framework, and selected modules.  Open MPI 4.1.6.

DESCRIPTION

       Open PAL can involuntarily checkpoint and restart sequential programs.  Doing so  requires
       that Open PAL was compiled with thread support and that the back-end checkpointing systems
       are available at run-time.

   Phases of Checkpoint / Restart
       Open PAL defines three phases for checkpoint / restart support in a procress:

       Checkpoint
           When the checkpoint request arrives, the procress is notified of  the  request  before
           the checkpoint is taken.

       Continue
           After  a  checkpoint has successfully completed, the same process as the checkpoint is
           notified of its successful continuation of execution.

       Restart
           After a checkpoint has successfully completed, a new / restarted process  is  notified
           of its successful restart.

       The  Continue  and  Restart  phases are identical except for the process in which they are
       invoked. The Continue phase is invoked in the same process as  the  Checkpoint  phase  was
       invoked. The Restart phase is only invoked in newly restarted processes.

GENERAL PROCESS REQUIREMENTS

       In  order  for  a  process  to  use  the  Open  PAL CRS components it must adhear to a few
       programmatic requirements.

       First, the program must call OPAL_INIT early in its execution. This should only be  called
       once, and it is not possible to checkpoint the process without it first having called this
       function.

       The program must call OPAL_FINALIZE before termination. This does a significant amount  of
       cleanup.  If  it  is  not  called,  then  it  is very likely that remnants are left in the
       filesystem.

       To checkpoint and restart a process you must use the Open PAL tools to do  so.  Using  the
       backend  checkpointer's  checkpoint and restart tools will lead to undefined behavior.  To
       checkpoint a process use opal_checkpoint (opal_checkpoint(1)).  To restart a  process  use
       opal_restart (opal_restart(1)).

AVAILABLE COMPONENTS

       Open PAL ships with two CRS components: self and blcr.

       The following MCA parameters apply to all components:

       crs_base_verbose
           Set the verbosity level for all components. Default is 0, or silent except on error.

   self CRS Component
       The  self  component invokes user-defined functions to save and restore checkpoints. It is
       simply a mechanism for user-defined functions to be  invoked  at  Open  PAL's  Checkpoint,
       Continue,  and Restart phases. Hence, the only data that is saved during the checkpoint is
       what is written in the user's checkpoint function. No libary state is saved at all.

       As such, the model for the self component is slightly differnt than for other  components.
       Specifically, the Restart function is not invoked in the same process image of the process
       that was checkpointed. The Restart phase is invoked during OPAL_INIT of the  new  instance
       of the applicaiton (i.e., it starts over from main()).

       The self component has the following MCA parameters:

       crs_self_prefix
           Speficy  a  string  prefix  for  the  name  of  the  checkpoint, continue, and restart
           functions that Open PAL  will  invoke  during  the  respective  stages.  That  is,  by
           specifying  "-mca  crs_self_prefix  foo"  means  that  Open  PAL expects to find three
           functions at run-time:

              int foo_checkpoint()

              int foo_continue()

              int foo_restart()

           By default, the prefix is set to "opal_crs_self_user".

       crs_self_priority
           Set the self components default priority

       crs_self_verbose
           Set the verbosity level. Default is 0, or silent except on error.

       crs_self_do_restart
           This is mostly internally used. A general user should never need to  set  this  value.
           This  is  set  to  non-0  when a the new process should invoke the restart callback in
           OPAL_INIT. Default is 0, or normal execution.

   blcr CRS Component
       The Berkeley Lab Checkpoint/Restart (BLCR) single-process checkpoint is a software  system
       developed  at  Lawrence  Berkeley  National  Laboratory.  See the project website for more
       details:

           http://ftg.lbl.gov/CheckpointRestart/CheckpointRestart.shtml

       The blcr component has the following MCA parameters:

       crs_blcr_priority
           Set the blcr components default priority.

       crs_blcr_verbose
           Set the verbosity level. Default is 0, or silent except on error.

   none CRS Component
       The none component simply selects no CRS component. All of the CRS function  calls  return
       immediately with OPAL_SUCCESS.

       This component is the last component to be selected by default. This means that if another
       component is available, and the none component was not explicity requested then OPAL  will
       attempt to activate all of the available components before falling back to this component.

SEE ALSO

         opal_checkpoint(1), opal_restart(1)