bionic (5) sge_checkpoint.5.gz

Provided by: gridengine-common_8.1.9+dfsg-7build1_all bug

NAME

       checkpoint - Grid Engine checkpointing environment configuration file format

DESCRIPTION

       Checkpointing is a facility to save the complete status of an executing program or job and to restore and
       restart from this so-called checkpoint at a later point of time  if  the  original  program  or  job  was
       halted, e.g.  through a system crash.

       Grid  Engine  provides  various  levels  of  checkpointing  support (see sge_ckpt(1)).  The checkpointing
       environment described here is a means to configure the different types of checkpointing in use  for  your
       Grid  Engine  cluster  or  parts thereof. For that purpose you can define the operations which have to be
       executed in initiating a checkpoint generation, a migration of a checkpoint to another host, or a restart
       of a checkpointed application.

       Supporting  different  operating  systems  may  easily  force  Grid  Engine to introduce operating system
       dependencies for the configuration of the checkpointing configuration file and updates of  the  supported
       operating  system  versions  may lead to frequently changing implementation details.  Please refer to the
       <sge_root>/ckpt directory for more information.

       Please use the  -ackpt,  -dckpt,  -mckpt  or  -sckpt  options  to  the  qconf(1)  command  to  manipulate
       checkpointing  environments from the command-line or use the corresponding qmon(1) dialogue for X-Windows
       based interactive configuration.

       Note, Grid Engine allows backslashes (\) be used to escape newline  characters.  The  backslash  and  the
       newline are replaced with a space character before any interpretation.

FORMAT

       The format of a checkpoint file is defined as follows:

   ckpt_name
       The name of the checkpointing environment in the format for ckpt_name in sge_types(5).  To be used in the
       qsub(1) -ckpt switch or for the qconf(1) options mentioned above.

   interface
       The type of checkpointing to be used. Currently, the following types are valid:

       hibernator
              The Hibernator kernel level checkpointing is interfaced.

       cpr    The SGI kernel level checkpointing is used.

       transparent
              Grid Engine assumes that the jobs submitted with reference to this checkpointing interface  use  a
              checkpointing library such as provided by the free package Condor.

       userdefined
              Grid Engine assumes that the jobs submitted with reference to this checkpointing interface perform
              their private checkpointing method.

       application-level
              Uses all of the interface commands configured in the checkpointing object like in the case of  one
              of  the  kernel  level  checkpointing  interfaces  (cpr, etc.) except for the restart_command (see
              below), which is not used (even if it is configured) but the job script is invoked in  case  of  a
              restart instead.

   ckpt_command
       A  command-line type command string to be executed by Grid Engine in order to initiate a checkpoint.  The
       following pseudo-variables are available to be substituted in the value:

       $host  The name of the host on which the command is executed.

       $ja_task_id
              The array job task index (0 if not an array job).

       $job_owner
              The user name of the job owner.

       $job_id
              Grid Engine's unique job identification number.

       $job_name
              The name of the job.

       $queue The cluster queue name of the master queue instance, on which the command is started.

       $job_pid
              The process id of the job/task to checkpoint.

       $ckpt_dir
              See ckpt_dir below.

       $ckpt_signal
              See signal below.

       $sge_cell
              The SGE_CELL environment variable (useful for locating files).

       $sge_root
              The SGE_ROOT environment variable (useful for locating files).

   migr_command
       A command-line type command string to be executed by Grid Engine during a migration  of  a  checkpointing
       job  from  one  host to another.  The same pseudo-variables are available as for ckpt_command.  Note that
       the command is expected  to  create  a  checkpoint  itself  -  the  checkpointing  command  isn't  called
       automatically on migration.

   restart_command
       A  command-line  type  command  string  to  be  executed  by  Grid  Engine  when  restarting a previously
       checkpointed application.  The same pseudo-variables are available as for ckpt_command.

   clean_command
       A command-line type command string to be executed by Grid Engine in order to cleanup after a checkpointed
       application has finished.  The same pseudo-variables are available as for ckpt_command.

   ckpt_dir
       A file system location to which checkpoints of potentially considerable size should be stored.

   signal
       A  Unix  signal  to be sent to a job by Grid Engine to initiate checkpoint generation. The value for this
       field can either be a symbolic name from the list produced by the -l option of the kill(1) command or  an
       integer number which must be a valid signal on the systems used for checkpointing.

   when
       The  points  of  time when checkpoints are expected to be generated.  Valid values for this parameter are
       composed from the letters s, m, x, r, and any combinations thereof without any  separating  character  in
       between.  The  same letters are allowed for the -c option of the qsub(1) command which will overwrite the
       definitions in the checkpointing environment used.  The meaning of the letters is as follows:

       s      A job is checkpointed, aborted and, if possible, migrated if  the  corresponding  sge_execd(8)  is
              shut down on the job's host.  This operation is handled by the specified migr_command.

       m      checkpoints  are generated periodically at the min_cpu_interval interval defined by the queue (see
              queue_conf(5)) in which a job executes.

       x      A job is checkpointed, aborted and, if possible, migrated  as  soon  as  the  job  gets  suspended
              (manually as well as automatically).  This operation is handled by the specified migr_command.

       r      A  job  will  be rescheduled (not checkpointed) when the host on which the job currently runs goes
              into the "unknown" state and the time interval reschedule_unknown (see sge_conf(5)) defined in the
              global/local cluster configuration is exceeded.

ENVIRONMENT VARIABLES

       SGE_BINDING and SGE_CKPT_DIR may be specified on job submission.  See submit(1).

RESTRICTIONS

       Note  that  the  functionality  of any checkpointing, migration or restart procedures provided by default
       with the Grid Engine distribution, as well  as  the  way  how  they  are  invoked  in  the  ckpt_command,
       migr_command  or  restart_command  parameters  of  any  default checkpointing environments, should not be
       changed; otherwise the functionality remains the full responsibility of the administrator configuring the
       checkpointing environment.  Grid Engine will just invoke these procedures and evaluate their exit status.
       If the procedures do not perform their tasks properly, or are  not  invoked  in  a  proper  fashion,  the
       checkpointing mechanism may behave unexpectedly; Grid Engine has no means to detect this - all exit codes
       are treated as successful operation except for the case of kernel checkpointing.

       See also the restrictions in sge_ckpt(5).

SEE ALSO

       sge_intro(1), sge_ckpt(5), sge_types(5), qconf(1), qmod(1), qsub(1), sge_execd(8).

       See sge_intro(1) for a full statement of rights and permissions.