Provided by: gridengine-common_6.2u5-7.4_all bug

NAME

       sge_ckpt.1 - the Sun Grid Engine checkpointing mechanism and checkpointing support

DESCRIPTION

       Sun  Grid  Engine  supports  two  levels of checkpointing: the user level and a operating system provided
       transparent level. User level checkpointing refers to applications, which do their own  checkpointing  by
       writing  restart  files  at  certain  times or algorithmic steps and by properly processing these restart
       files when restarted.

       Transparent checkpointing has to be provided by the operating system and is  usually  integrated  in  the
       operating  system  kernel.  An  example  for a kernel integrated checkpointing facility is the Hibernator
       package from Softway for SGI IRIX platforms.

       Checkpointing jobs need to be identified to the Sun Grid Engine system by using the -ckpt option  of  the
       qsub1() command. The argument to this flag refers to a so called checkpointing environment, which defines
       the  attributes  of  the  checkpointing method to be used (see checkpoint5() for details).  Checkpointing
       environments are setup by the qconf1() options -ackpt, -dckpt, -mckpt and -sckpt. The qsub1()  option  -c
       can be used to overwrite the when attribute for the referenced checkpointing environment.

       If  a  queue is of the type CHECKPOINTING, jobs need to have the checkpointing attribute flagged (see the
       -ckpt option to qsub1()) to be permitted to run in such a queue. As opposed to the behavior  for  regular
       batch  jobs,  checkpointing  jobs  are  aborted under conditions, for which batch or interactive jobs are
       suspended or even stay unaffected. These conditions are:

       •  Explicit suspension of the queue or job via qmod1() by the cluster administration or a queue owner  if
          the x occasion specifier (see qsub1() -c and checkpoint5()) was assigned to the job.

       •  A  load  average value exceeding the suspend threshold as configured for the corresponding queues (see
          queue_conf5().)

       •  Shutdown of the Sun Grid Engine execution daemon sge_execd8() being responsible for the  checkpointing
          job.

       After abortion, the jobs will migrate to other queues unless they were submitted to one specific queue by
       an  explicit  user request.  The migration of jobs leads to a dynamic load balancing.  Note: The abortion
       of checkpointed jobs will free all resources (memory, swap space) which the job occupies  at  that  time.
       This is opposed to the situation for suspended regular jobs, which still cover swap space.

RESTRICTIONS

       When  a  job  migrates to a queue on another machine at present no files are transferred automatically to
       that machine. This means that all files which are used throughout the entire job including restart files,
       executables and scratch files must be visible or transferred explicitly (e.g. at the beginning of the job
       script).

       There are also some practical limitations regarding use of disk  space  for  transparently  checkpointing
       jobs.  Checkpoints of a transparently checkpointed application are usually stored in a checkpoint file or
       directory by the operating system. The file or directory contains all the text, data, and stack space for
       the process, along with some additional control information. This means  jobs  which  use  a  very  large
       virtual  address space will generate very large checkpoint files. Also the workstations on which the jobs
       will actually execute may have little free disk space. Thus it is  not  always  possible  to  transfer  a
       transparent  checkpointing job to a machine, even though that machine is idle. Since large virtual memory
       jobs must wait for a machine that is both idle, and has a sufficient amount of free disk space, such jobs
       may suffer long turnaround times.

SEE ALSO

       sge_intro1(,) qconf1(,) qmod1(,) qsub1(,) checkpoint5(,) Sun Grid Engine Installation and  Administration
       Guide, Sun Grid Engine User's Guide

COPYRIGHT

       See sge_intro1() for a full statement of rights and permissions.

SGE 6.2u5                                            $Date$                                          SGE_CKPT(1)