Provided by: gridengine-common_6.2u5-4_all bug

NAME

       sge_ckpt.1   -   the   Sun  Grid  Engine  checkpointing  mechanism  and
       checkpointing support

DESCRIPTION

       Sun Grid Engine supports two levels of checkpointing:  the  user  level
       and   a   operating  system  provided  transparent  level.  User  level
       checkpointing refers to applications, which do their own  checkpointing
       by  writing  restart files at certain times or algorithmic steps and by
       properly processing these restart files when restarted.

       Transparent checkpointing has to be provided by  the  operating  system
       and  is  usually  integrated in the operating system kernel. An example
       for a  kernel  integrated  checkpointing  facility  is  the  Hibernator
       package from Softway for SGI IRIX platforms.

       Checkpointing  jobs need to be identified to the Sun Grid Engine system
       by using the -ckpt option of the qsub1() command. The argument to  this
       flag refers to a so called checkpointing environment, which defines the
       attributes of the checkpointing method to be  used  (see  checkpoint5()
       for  details).   Checkpointing  environments  are setup by the qconf1()
       options -ackpt, -dckpt, -mckpt and -sckpt. The qsub1() option -c can be
       used  to  overwrite the when attribute for the referenced checkpointing
       environment.

       If a queue is  of  the  type  CHECKPOINTING,  jobs  need  to  have  the
       checkpointing attribute flagged (see the -ckpt option to qsub1()) to be
       permitted to run in such a  queue.  As  opposed  to  the  behavior  for
       regular  batch  jobs,  checkpointing jobs are aborted under conditions,
       for which  batch  or  interactive  jobs  are  suspended  or  even  stay
       unaffected. These conditions are:

       o  Explicit  suspension  of the queue or job via qmod1() by the cluster
          administration or a queue owner if the  x  occasion  specifier  (see
          qsub1() -c and checkpoint5()) was assigned to the job.

       o  A  load  average value exceeding the suspend threshold as configured
          for the corresponding queues (see queue_conf5().)

       o  Shutdown of the Sun Grid Engine execution daemon sge_execd8()  being
          responsible for the checkpointing job.

       After  abortion, the jobs will migrate to other queues unless they were
       submitted to one specific queue  by  an  explicit  user  request.   The
       migration  of  jobs  leads  to  a  dynamic  load  balancing.  Note: The
       abortion of checkpointed jobs will free  all  resources  (memory,  swap
       space)  which  the  job  occupies  at that time. This is opposed to the
       situation for suspended regular jobs, which still cover swap space.

RESTRICTIONS

       When a job migrates to a queue on another machine at present  no  files
       are  transferred  automatically  to  that  machine. This means that all
       files which are used throughout the entire job including restart files,
       executables and scratch files must be visible or transferred explicitly
       (e.g. at the beginning of the job script).

       There are also some practical limitations regarding use of  disk  space
       for  transparently  checkpointing  jobs. Checkpoints of a transparently
       checkpointed application are usually stored in  a  checkpoint  file  or
       directory  by  the operating system. The file or directory contains all
       the text, data, and stack  space  for  the  process,  along  with  some
       additional  control information. This means jobs which use a very large
       virtual address space will generate very large checkpoint  files.  Also
       the  workstations  on  which  the  jobs  will actually execute may have
       little free disk space. Thus it is not always possible  to  transfer  a
       transparent checkpointing job to a machine, even though that machine is
       idle. Since large virtual memory jobs must wait for a machine  that  is
       both  idle,  and  has a sufficient amount of free disk space, such jobs
       may suffer long turnaround times.

SEE ALSO

       sge_intro1(,)  qconf1(,)  qmod1(,)  qsub1(,)  checkpoint5(,)  Sun  Grid
       Engine  Installation  and  Administration Guide, Sun Grid Engine User's
       Guide

COPYRIGHT

       See sge_intro1() for a full statement of rights and permissions.