Provided by: lam-runtime_7.1.4-7.1build2_amd64 bug

NAME

       lamssi_checkpoint_restart - overview of LAM's MPI checkpoint / restart SSI modules

DESCRIPTION

       The  "kind"  for  checkpoint  /  restart SSI modules is "cr".  Specifically, the string "cr" (without the
       quotes) is the prefix that should be used with the  mpirun  command  line  with  the  -ssi  switch.   For
       example:

       mpirun -ssi cr blcr C my_mpi_program

       LAM/MPI  can  involuntarily checkpoint and restart parallel MPI jobs.  Doing so requires that LAM/MPI was
       compiled with thread support and that back-end checkpointing systems are available at run-time.  MPI jobs
       will  have  to  run  with  at  least  MPI_THREAD_SERIALIZED  support.   If  a  job  elects  to  run  with
       checkpoint/restart support and an available cr module is found, the job's thread level will automatically
       be promoted to MPI_THREAD_SERIALIZED.  See the User's Guide for more details.

   Checkpoint Phases
       LAM defines three phases for checkpoint / restart support in each MPI process:

       Checkpoint.
           When the checkpoint request arrives, before the actual checkpoint occurs.

       Continue.
           After a checkpoint has successfully completed, in the same process as the checkpoint was invoked in.

       Restart
           After a checkpoint has successfully completed, in a new / restarted process.

       The  Continue  and  Restart  phases are identical except for the process in which they are invoked -- the
       Continue phase is invoked in the same process as the Checkpoint phase was invoked.  The Restart phase  is
       only invoked in newly restarted processes.

AVAILABLE MODULES

       LAM  currently  has two cr modules: blcr and self.  In order for an MPI job to be able to be checkpointed
       and restarted, all of its MPI SSI modules must support checkpoint/restart.  Currently, this  means  using
       the  crtcp  RPI module or the gm RPI module when compiled with gm_get() support (see the User's Guide for
       more details).

   blcr CR Module
       The Berkeley Lab Checkpoint/Restart (BLCR) single-node checkpointer is a software  system  from  Lawrence
       Berkeley Labs.  See the project web page for more details: http://www.nersc.gov/research/ftg/checkpoint/.

       The blcr module has one SSI parameter:

       cr_blcr_priority
           blcr's default priority is 50.

   self CR Module
       The self module, when used with checkpoint/restart SSI modules, will invoke the user-defined functions to
       save and restore checkpoints. It is simply a mechanism for user-defined functions to be invoked at  LAM's
       Checkpoint,  Continue,  and  Restart  phases. Hence, the only data that is saved during the checkpoint is
       what is written in the user's checkpoint function. No MPI library state is saved at all.

       As such, the model for the self module  is  slightly  different  than,  for  example,  the  blcr  module.
       Specifically,  the  Restart  function  is  not  invoked in the same process image of the process that was
       checkpointed. The Restart phase is invoked during MPI_INIT of a new instance of the application (i.e., it
       starts over from main()).

       Multiple SSI parameters are available:

       cr_self_user_prefix
           Specify  a  string prefix for the name of the checkpoint, continue, and restart functions that should
           be invoked by LAM.  That is, specifying "-ssi cr_self_user_prefix foo" means that LAM expects to find
           three  functions  at run-time: int foo_checkpoint(), int foo_continue(), and int foo_restart().  This
           is a convenience parameter that can be used instead of the three parameters listed below.

       cr_self_user_checkpoint
           Name of the user function to invoke during the Checkpoint phase.

       cr_self_user_continue
           Name of the user function to invoke during the Continue phase.

       cr_self_user_restart
           Name of the user function to invoke during the Restart phase.

       If none of these parameters are specified and the self module is selected, it will use the default prefix
       lam_cr_self

       Finally, the usual priority SSI parameter is also available:

       cr_self_priority
           self's default priority is 25.

SEE ALSO

       lamssi(7), mpirun(1), LAM User's Guide