Provided by: gridengine-common_8.1.9+dfsg-10build1_all bug


       sge_ckpt - the Grid Engine checkpointing mechanism and checkpointing support


       Grid  Engine supports two levels of checkpointing: the user level and an operating system-
       provided transparent level. User level checkpointing refers to applications which do their
       own  checkpointing  by  writing restart files at certain times or algorithmic steps and by
       properly processing these restart files when restarted.

       Transparent checkpointing has to be provided  by  the  operating  system  and  is  usually
       integrated   in   the  operating  system  kernel.  An  example  for  a  kernel  integrated
       checkpointing facility is the Hibernator package from Softway for SGI IRIX platforms.

       Checkpointing jobs need to be identified to the Grid Engine  system  by  using  the  -ckpt
       option  of  the  qsub(1)  command.  The  argument  to  this  flag  refers  to  a so called
       checkpointing environment, which defines the attributes of the checkpointing method to  be
       used  (see  checkpoint(5)  for  details).   Checkpointing  environments  are  setup by the
       qconf(1) options -ackpt, -dckpt, -mckpt and -sckpt. The qsub(1) option -c can be  used  to
       overwrite the when attribute for the referenced checkpointing environment.

       As  opposed  to  the  behavior  for  regular batch jobs, checkpointing jobs (see the -ckpt
       option to qsub(1)) are aborted under conditions for which batch or  interactive  jobs  are
       suspended or even stay unaffected. These conditions are:

       •  Explicit  suspension of the queue or job via qmod(1) by the cluster administration or a
          queue owner if the x occasion specifier (see qsub(1) -c and checkpoint(5)) was assigned
          to the job.

       •  A   load   average  value  exceeding  the  suspend  threshold  as  configured  for  the
          corresponding queues (see queue_conf(5)).

       •  Shutdown of the Grid Engine execution daemon sge_execd(8)  being  responsible  for  the
          checkpointing job.

       After  they  are  aborted,  jobs  will  migrate to other hosts, and possibly other cluster
       queues, unless they were submitted to a specific one by an  explicit  user  request.   The
       migration  of  jobs  leads  to a dynamic load balancing.  Note: Aborting checkpointed jobs
       will free all resources (memory, swap space) which the job occupies at that time. This  is
       opposed  to  the  situation for suspended regular jobs, which still use virtual memory and
       other consumable resources.


       When a job migrates to another machine, at present no files are transferred  automatically
       to  that  machine.  This  means  that  all files which are used throughout the entire job,
       including restart files, executables, and scratch files, must be  visible  or  transferred
       explicitly (e.g. at the beginning of the job script).

       There  are  also  some practical limitations regarding use of disk space for transparently
       checkpointing jobs. Checkpoints of a transparently checkpointed  application  are  usually
       stored  in  a  checkpoint file or directory by the operating system. The file or directory
       contains all the text, data, and stack space for the process, along with  some  additional
       control  information.  This  means  jobs which use a very large virtual address space will
       generate very large checkpoint files.  Also  the  workstations  on  which  the  jobs  will
       actually  execute  may  have  little  free  disk  space. Thus it is not always possible to
       transfer a transparent checkpointing job to a machine, even though that machine  is  idle.
       Since  large  virtual  memory  jobs  must  wait for a machine that is both idle, and has a
       sufficient amount of free disk space, such jobs may suffer long turnaround times.

       There is currently no mechanism for restarting jobs with  the  same  resources  they  were
       granted originally.  That might be important if they were submitted with a choice or range
       of resources and start running in a particular way with what they're given.

       Similarly, with heterogeneous execution hosts, jobs may need to restart on  a  host  which
       supports  a superset of the instruction set where the job originally ran if the checkpoint
       mechanism (e.g. BLCR or DMTCP) dumps an image of the running process.  Runtime  libraries,
       in  particular,  may  initialize  themselves depending on details of the architecture they
       start up on - say to use a specific type of vector unit.  Then, they may fail if moved  to
       an older host of similar architecture which lacks that feature, even if they were compiled
       for a common instruction set.


       sge_intro(1), qconf(1), qmod(1), qsub(1), checkpoint(5)


       See sge_intro(1) for a full statement of rights and permissions.