Provided by: gridengine-common_8.1.9+dfsg-7build1_all bug

NAME

       checkpoint - Grid Engine checkpointing environment configuration file format

DESCRIPTION

       Checkpointing is a facility to save the complete status of an executing program or job and
       to restore and restart from this so-called checkpoint at a later  point  of  time  if  the
       original program or job was halted, e.g.  through a system crash.

       Grid  Engine  provides  various  levels  of  checkpointing support (see sge_ckpt(1)).  The
       checkpointing environment described here is a means to configure the  different  types  of
       checkpointing  in  use for your Grid Engine cluster or parts thereof. For that purpose you
       can define the operations which have to be executed in initiating a checkpoint generation,
       a migration of a checkpoint to another host, or a restart of a checkpointed application.

       Supporting different operating systems may easily force Grid Engine to introduce operating
       system dependencies for the configuration of  the  checkpointing  configuration  file  and
       updates  of  the  supported  operating  system  versions  may  lead to frequently changing
       implementation  details.   Please  refer  to  the  <sge_root>/ckpt  directory   for   more
       information.

       Please  use  the  -ackpt,  -dckpt,  -mckpt  or  -sckpt  options to the qconf(1) command to
       manipulate checkpointing environments from  the  command-line  or  use  the  corresponding
       qmon(1) dialogue for X-Windows based interactive configuration.

       Note,  Grid  Engine  allows  backslashes  (\)  be  used  to escape newline characters. The
       backslash and the newline are replaced with a space character before any interpretation.

FORMAT

       The format of a checkpoint file is defined as follows:

   ckpt_name
       The name of the checkpointing environment in the format for ckpt_name in sge_types(5).  To
       be used in the qsub(1) -ckpt switch or for the qconf(1) options mentioned above.

   interface
       The type of checkpointing to be used. Currently, the following types are valid:

       hibernator
              The Hibernator kernel level checkpointing is interfaced.

       cpr    The SGI kernel level checkpointing is used.

       transparent
              Grid  Engine  assumes  that the jobs submitted with reference to this checkpointing
              interface use a checkpointing library such as provided by the free package Condor.

       userdefined
              Grid Engine assumes that the jobs submitted with reference  to  this  checkpointing
              interface perform their private checkpointing method.

       application-level
              Uses  all  of the interface commands configured in the checkpointing object like in
              the case of one of the kernel level checkpointing interfaces (cpr, etc.) except for
              the  restart_command  (see below), which is not used (even if it is configured) but
              the job script is invoked in case of a restart instead.

   ckpt_command
       A command-line type command string to be executed by Grid Engine in order  to  initiate  a
       checkpoint.  The following pseudo-variables are available to be substituted in the value:

       $host  The name of the host on which the command is executed.

       $ja_task_id
              The array job task index (0 if not an array job).

       $job_owner
              The user name of the job owner.

       $job_id
              Grid Engine's unique job identification number.

       $job_name
              The name of the job.

       $queue The  cluster  queue  name  of  the  master  queue instance, on which the command is
              started.

       $job_pid
              The process id of the job/task to checkpoint.

       $ckpt_dir
              See ckpt_dir below.

       $ckpt_signal
              See signal below.

       $sge_cell
              The SGE_CELL environment variable (useful for locating files).

       $sge_root
              The SGE_ROOT environment variable (useful for locating files).

   migr_command
       A command-line type command string to be executed by Grid Engine during a migration  of  a
       checkpointing  job  from  one host to another.  The same pseudo-variables are available as
       for ckpt_command.  Note that the command is expected to create a checkpoint itself  -  the
       checkpointing command isn't called automatically on migration.

   restart_command
       A  command-line  type  command  string  to  be  executed  by Grid Engine when restarting a
       previously checkpointed application.  The  same  pseudo-variables  are  available  as  for
       ckpt_command.

   clean_command
       A command-line type command string to be executed by Grid Engine in order to cleanup after
       a checkpointed application has finished.  The same pseudo-variables are available  as  for
       ckpt_command.

   ckpt_dir
       A  file  system  location  to which checkpoints of potentially considerable size should be
       stored.

   signal
       A Unix signal to be sent to a job by Grid Engine to initiate  checkpoint  generation.  The
       value for this field can either be a symbolic name from the list produced by the -l option
       of the kill(1) command or an integer number which must be a valid signal  on  the  systems
       used for checkpointing.

   when
       The  points  of time when checkpoints are expected to be generated.  Valid values for this
       parameter are composed from the letters s, m, x, r, and any combinations  thereof  without
       any separating character in between. The same letters are allowed for the -c option of the
       qsub(1) command which will overwrite the  definitions  in  the  checkpointing  environment
       used.  The meaning of the letters is as follows:

       s      A  job  is  checkpointed,  aborted  and, if possible, migrated if the corresponding
              sge_execd(8) is shut down on the job's host.  This  operation  is  handled  by  the
              specified migr_command.

       m      checkpoints  are generated periodically at the min_cpu_interval interval defined by
              the queue (see queue_conf(5)) in which a job executes.

       x      A job is checkpointed, aborted and, if possible, migrated as soon as the  job  gets
              suspended  (manually  as  well as automatically).  This operation is handled by the
              specified migr_command.

       r      A job will be rescheduled (not  checkpointed)  when  the  host  on  which  the  job
              currently   runs   goes   into   the   "unknown"   state   and  the  time  interval
              reschedule_unknown  (see  sge_conf(5))  defined   in   the   global/local   cluster
              configuration is exceeded.

ENVIRONMENT VARIABLES

       SGE_BINDING and SGE_CKPT_DIR may be specified on job submission.  See submit(1).

RESTRICTIONS

       Note that the functionality of any checkpointing, migration or restart procedures provided
       by default with the Grid Engine distribution, as well as the way how they are  invoked  in
       the  ckpt_command, migr_command or restart_command parameters of any default checkpointing
       environments, should  not  be  changed;  otherwise  the  functionality  remains  the  full
       responsibility  of  the  administrator  configuring  the  checkpointing environment.  Grid
       Engine will just invoke these procedures and evaluate their exit status. If the procedures
       do  not  perform  their  tasks  properly,  or  are  not  invoked  in a proper fashion, the
       checkpointing mechanism may behave unexpectedly; Grid Engine has no means to detect this -
       all  exit  codes  are  treated  as  successful  operation  except  for  the case of kernel
       checkpointing.

       See also the restrictions in sge_ckpt(5).

SEE ALSO

       sge_intro(1), sge_ckpt(5), sge_types(5), qconf(1), qmod(1), qsub(1), sge_execd(8).

COPYRIGHT

       See sge_intro(1) for a full statement of rights and permissions.