Provided by: gridengine-common_6.2u5-7.3_all bug

NAME

       checkpoint - Sun Grid Engine checkpointing environment configuration file format

DESCRIPTION

       Checkpointing is a facility to save the complete status of an executing program or job and
       to restore and restart from this so called checkpoint at a later  point  of  time  if  the
       original program or job was halted, e.g.  through a system crash.

       Sun  Grid  Engine provides various levels of checkpointing support (see sge_ckpt(1)).  The
       checkpointing environment described here is a means to configure the  different  types  of
       checkpointing  in  use for your Sun Grid Engine cluster or parts thereof. For that purpose
       you can define the operations which  have  to  be  executed  in  initiating  a  checkpoint
       generation,  a  migration  of  a checkpoint to another host or a restart of a checkpointed
       application as well as the list of queues which are eligible for a checkpointing method.

       Supporting different operating systems may easily  force  Sun  Grid  Engine  to  introduce
       operating  system  dependencies  for  the configuration of the checkpointing configuration
       file and updates of the  supported  operating  system  versions  may  lead  to  frequently
       changing  implementation  details.  Please refer to the <sge_root>/ckpt directory for more
       information.

       Please use the -ackpt, -dckpt, -mckpt  or  -sckpt  options  to  the  qconf(1)  command  to
       manipulate  checkpointing  environments  from  the  command-line  or use the corresponding
       qmon(1) dialogue for X-Windows based interactive configuration.

       Note, Sun Grid Engine  allows  backslashes  (\)  be  used  to  escape  newline  (\newline)
       characters. The backslash and the newline are replaced with a space (" ") character before
       any interpretation.

FORMAT

       The format of a checkpoint file is defined as follows:

   ckpt_name
       The name of the checkpointing environment as defined for ckpt_name in sge_types(1).  To be
       used in the qsub(1) -ckpt switch or for the qconf(1) options mentioned above.

   interface
       The type of checkpointing to be used. Currently, the following types are valid:

       hibernator
              The Hibernator kernel level checkpointing is interfaced.

       cpr    The SGI kernel level checkpointing is used.

       cray-ckpt
              The Cray kernel level checkpointing is assumed.

       transparent
              Sun   Grid   Engine  assumes  that  the  jobs  submitted  with  reference  to  this
              checkpointing interface use a checkpointing library such as provided by the  public
              domain package Condor.

       userdefined
              Sun   Grid   Engine  assumes  that  the  jobs  submitted  with  reference  to  this
              checkpointing interface perform their private checkpointing method.

       application-level
              Uses all of the interface commands configured in the checkpointing object  like  in
              the case of one of the kernel level checkpointing interfaces (cpr, cray-ckpt, etc.)
              except for the restart_command (see below), which  is  not  used  (even  if  it  is
              configured) but the job script is invoked in case of a restart instead.

   ckpt_command
       A  command-line type command string to be executed by Sun Grid Engine in order to initiate
       a checkpoint.

   migr_command
       A command-line type command string to be executed by Sun Grid Engine during a migration of
       a checkpointing job from one host to another.

   restart_command
       A  command-line  type  command  string to be executed by Sun Grid Engine when restarting a
       previously checkpointed application.

   clean_command
       A command-line type command string to be executed by Sun Grid Engine in order  to  cleanup
       after a checkpointed application has finished.

   ckpt_dir
       A  file  system  location  to which checkpoints of potentially considerable size should be
       stored.

   ckpt_signal
       A Unix signal to be sent to a job by Sun Grid Engine to initiate a checkpoint  generation.
       The  value  for  this field can either be a symbolic name from the list produced by the -l
       option of the kill(1) command or an integer number which must be a  valid  signal  on  the
       systems used for checkpointing.

   when
       The  points  of time when checkpoints are expected to be generated.  Valid values for this
       parameter are composed by the letters s, m, x and r and any combinations  thereof  without
       any separating character in between. The same letters are allowed for the -c option of the
       qsub(1)  command  which  will  overwrite  the  definitions  in  the   used   checkpointing
       environment.  The meaning of the letters is defined as follows:

       s      A  job  is  checkpointed,  aborted  and  if  possible migrated if the corresponding
              sge_execd(8) is shut down on the job's machine.

       m      Checkpoints are generated periodically at the min_cpu_interval interval defined  by
              the queue (see queue_conf(5)) in which a job executes.

       x      A  job  is  checkpointed,  aborted and if possible migrated as soon as the job gets
              suspended (manually as well as automatically).

       r      A job will be rescheduled (not  checkpointed)  when  the  host  on  which  the  job
              currently  runs  went  into  unknown state and the time interval reschedule_unknown
              (see sge_conf(5))  defined  in  the  global/local  cluster  configuration  will  be
              exceeded.

RESTRICTIONS

       Note,  that  the  functionality  of  any  checkpointing,  migration  or restart procedures
       provided by default with the Sun Grid Engine distribution as well as the way how they  are
       invoked  in  the  ckpt_command,  migr_command or restart_command parameters of any default
       checkpointing environments should not be changed or otherwise  the  functionality  remains
       the  full  responsibility  of the administrator configuring the checkpointing environment.
       Sun Grid Engine will just invoke these procedures and evaluate their exit status.  If  the
       procedures do not perform their tasks properly or are not invoked in a proper fashion, the
       checkpointing mechanism may behave unexpectedly, Sun Grid Engine has no  means  to  detect
       this.

SEE ALSO

       sge_intro(1), sge_ckpt(1), sge__types(1), qconf(1), qmod(1), qsub(1), sge_execd(8).

COPYRIGHT

       See sge_intro(1) for a full statement of rights and permissions.