Provided by: lam-runtime_7.1.4-3.1_amd64 bug


       lamssi_checkpoint_restart - overview of LAM's MPI checkpoint / restart SSI modules


       The  "kind"  for  checkpoint / restart SSI modules is "cr".  Specifically, the string "cr"
       (without the quotes) is the prefix that should be used with the mpirun command  line  with
       the -ssi switch.  For example:

       mpirun -ssi cr blcr C my_mpi_program

       LAM/MPI  can  involuntarily  checkpoint  and restart parallel MPI jobs.  Doing so requires
       that LAM/MPI was compiled with thread support and that back-end checkpointing systems  are
       available  at  run-time.   MPI  jobs  will have to run with at least MPI_THREAD_SERIALIZED
       support.  If a job elects to run with  checkpoint/restart  support  and  an  available  cr
       module   is   found,   the   job's   thread   level  will  automatically  be  promoted  to
       MPI_THREAD_SERIALIZED.  See the User's Guide for more details.

   Checkpoint Phases
       LAM defines three phases for checkpoint / restart support in each MPI process:

           When the checkpoint request arrives, before the actual checkpoint occurs.

           After a checkpoint has successfully completed, in the same process as  the  checkpoint
           was invoked in.

           After a checkpoint has successfully completed, in a new / restarted process.

       The  Continue  and  Restart  phases are identical except for the process in which they are
       invoked -- the Continue phase is invoked in the same process as the Checkpoint  phase  was
       invoked.  The Restart phase is only invoked in newly restarted processes.


       LAM currently has two cr modules: blcr and self.  In order for an MPI job to be able to be
       checkpointed and restarted, all of its MPI SSI modules  must  support  checkpoint/restart.
       Currently,  this  means using the crtcp RPI module or the gm RPI module when compiled with
       gm_get() support (see the User's Guide for more details).

   blcr CR Module
       The Berkeley Lab Checkpoint/Restart (BLCR) single-node checkpointer is a  software  system
       from   Lawrence   Berkeley   Labs.    See   the   project   web  page  for  more  details:

       The blcr module has one SSI parameter:

           blcr's default priority is 50.

   self CR Module
       The self module, when used with checkpoint/restart SSI  modules,  will  invoke  the  user-
       defined  functions  to  save  and  restore checkpoints. It is simply a mechanism for user-
       defined functions to be invoked at LAM's Checkpoint, Continue, and Restart phases.  Hence,
       the  only  data  that  is  saved  during  the  checkpoint is what is written in the user's
       checkpoint function. No MPI library state is saved at all.

       As such, the model for the self module is slightly different than, for example,  the  blcr
       module. Specifically, the Restart function is not invoked in the same process image of the
       process that was checkpointed. The Restart phase is  invoked  during  MPI_INIT  of  a  new
       instance of the application (i.e., it starts over from main()).

       Multiple SSI parameters are available:

           Specify  a  string  prefix  for  the  name  of  the  checkpoint, continue, and restart
           functions  that   should   be   invoked   by   LAM.    That   is,   specifying   "-ssi
           cr_self_user_prefix  foo"  means that LAM expects to find three functions at run-time:
           int  foo_checkpoint(),  int  foo_continue(),  and  int  foo_restart().   This   is   a
           convenience parameter that can be used instead of the three parameters listed below.

           Name of the user function to invoke during the Checkpoint phase.

           Name of the user function to invoke during the Continue phase.

           Name of the user function to invoke during the Restart phase.

       If none of these parameters are specified and the self module is selected, it will use the
       default prefix lam_cr_self

       Finally, the usual priority SSI parameter is also available:

           self's default priority is 25.


       lamssi(7), mpirun(1), LAM User's Guide