Provided by: gridengine-common_6.2u5-7.4_all bug

NAME

       checkpoint - Sun Grid Engine checkpointing environment configuration file format

DESCRIPTION

       Checkpointing is a facility to save the complete status of an executing program or job and to restore and
       restart  from  this  so  called  checkpoint  at  a later point of time if the original program or job was
       halted, e.g.  through a system crash.

       Sun Grid Engine provides various levels of checkpointing support (see  sge_ckpt(1)).   The  checkpointing
       environment  described  here is a means to configure the different types of checkpointing in use for your
       Sun Grid Engine cluster or parts thereof. For that purpose you can define the operations which have to be
       executed in initiating a checkpoint generation, a migration of a checkpoint to another host or a  restart
       of  a  checkpointed  application  as  well  as  the list of queues which are eligible for a checkpointing
       method.

       Supporting different operating systems may easily force Sun Grid Engine  to  introduce  operating  system
       dependencies  for  the configuration of the checkpointing configuration file and updates of the supported
       operating system versions may lead to frequently changing implementation details.  Please  refer  to  the
       <sge_root>/ckpt directory for more information.

       Please  use  the  -ackpt,  -dckpt,  -mckpt  or  -sckpt  options  to  the  qconf(1)  command to manipulate
       checkpointing environments from the command-line or use the corresponding qmon(1) dialogue for  X-Windows
       based interactive configuration.

       Note,  Sun  Grid  Engine  allows  backslashes  (\)  be  used to escape newline (\newline) characters. The
       backslash and the newline are replaced with a space (" ") character before any interpretation.

FORMAT

       The format of a checkpoint file is defined as follows:

   ckpt_name
       The name of the checkpointing environment as defined for ckpt_name in sge_types(1).  To be  used  in  the
       qsub(1) -ckpt switch or for the qconf(1) options mentioned above.

   interface
       The type of checkpointing to be used. Currently, the following types are valid:

       hibernator
              The Hibernator kernel level checkpointing is interfaced.

       cpr    The SGI kernel level checkpointing is used.

       cray-ckpt
              The Cray kernel level checkpointing is assumed.

       transparent
              Sun Grid Engine assumes that the jobs submitted with reference to this checkpointing interface use
              a checkpointing library such as provided by the public domain package Condor.

       userdefined
              Sun  Grid  Engine  assumes  that the jobs submitted with reference to this checkpointing interface
              perform their private checkpointing method.

       application-level
              Uses all of the interface commands configured in the checkpointing object like in the case of  one
              of the kernel level checkpointing interfaces (cpr, cray-ckpt, etc.) except for the restart_command
              (see below), which is not used (even if it is configured) but the job script is invoked in case of
              a restart instead.

   ckpt_command
       A command-line type command string to be executed by Sun Grid Engine in order to initiate a checkpoint.

   migr_command
       A  command-line  type  command  string  to  be  executed  by  Sun  Grid  Engine  during  a migration of a
       checkpointing job from one host to another.

   restart_command
       A command-line type command string to be executed  by  Sun  Grid  Engine  when  restarting  a  previously
       checkpointed application.

   clean_command
       A  command-line  type  command  string  to  be  executed  by  Sun Grid Engine in order to cleanup after a
       checkpointed application has finished.

   ckpt_dir
       A file system location to which checkpoints of potentially considerable size should be stored.

   ckpt_signal
       A Unix signal to be sent to a job by Sun Grid Engine to initiate a checkpoint generation. The  value  for
       this  field  can either be a symbolic name from the list produced by the -l option of the kill(1) command
       or an integer number which must be a valid signal on the systems used for checkpointing.

   when
       The points of time when checkpoints are expected to be generated.  Valid values for  this  parameter  are
       composed  by  the  letters s, m, x and r and any combinations thereof without any separating character in
       between. The same letters are allowed for the -c option of the qsub(1) command which will  overwrite  the
       definitions in the used checkpointing environment.  The meaning of the letters is defined as follows:

       s      A  job is checkpointed, aborted and if possible migrated if the corresponding sge_execd(8) is shut
              down on the job's machine.

       m      Checkpoints are generated periodically at the min_cpu_interval interval defined by the queue  (see
              queue_conf(5)) in which a job executes.

       x      A  job  is  checkpointed,  aborted  and  if  possible  migrated  as soon as the job gets suspended
              (manually as well as automatically).

       r      A job will be rescheduled (not checkpointed) when the host on which the job  currently  runs  went
              into  unknown  state  and  the  time  interval reschedule_unknown (see sge_conf(5)) defined in the
              global/local cluster configuration will be exceeded.

RESTRICTIONS

       Note, that the functionality of any checkpointing, migration or restart procedures  provided  by  default
       with  the  Sun  Grid  Engine  distribution  as  well as the way how they are invoked in the ckpt_command,
       migr_command or restart_command parameters of  any  default  checkpointing  environments  should  not  be
       changed  or  otherwise the functionality remains the full responsibility of the administrator configuring
       the checkpointing environment.  Sun Grid Engine will just invoke these procedures and evaluate their exit
       status. If the procedures do not perform their tasks properly or are not invoked in a proper fashion, the
       checkpointing mechanism may behave unexpectedly, Sun Grid Engine has no means to detect this.

SEE ALSO

       sge_intro(1), sge_ckpt(1), sge__types(1), qconf(1), qmod(1), qsub(1), sge_execd(8).

COPYRIGHT

       See sge_intro(1) for a full statement of rights and permissions.

SGE 6.2u5                                            $Date$                                        CHECKPOINT(5)