Ubuntu Manpage: sge_ckpt - the Grid Engine checkpointing mechanism and checkpointing support

Provided by: gridengine-common_8.1.9+dfsg-11build3_all

NAME

       sge_ckpt - the Grid Engine checkpointing mechanism and checkpointing support

DESCRIPTION

Grid Engine supports two levels of checkpointing: the user level and an operating system-provided
transparent level. User level checkpointing refers to applications which do their own checkpointing by
writing restart files at certain times or algorithmic steps and by properly processing these restart
files when restarted.

Transparent checkpointing has to be provided by the operating system and is usually integrated in the
operating system kernel. An example for a kernel integrated checkpointing facility is the Hibernator
package from Softway for SGI IRIX platforms.

Checkpointing jobs need to be identified to the Grid Engine system by using the -ckpt option of the
qsub(1) command. The argument to this flag refers to a so called checkpointing environment, which defines
the attributes of the checkpointing method to be used (see checkpoint(5) for details). Checkpointing
environments are setup by the qconf(1) options -ackpt, -dckpt, -mckpt and -sckpt. The qsub(1) option -c
can be used to overwrite the when attribute for the referenced checkpointing environment.

As opposed to the behavior for regular batch jobs, checkpointing jobs (see the -ckpt option to qsub(1))
are aborted under conditions for which batch or interactive jobs are suspended or even stay unaffected.
These conditions are:

• Explicit suspension of the queue or job via qmod(1) by the cluster administration or a queue owner if
the x occasion specifier (see qsub(1) -c and checkpoint(5)) was assigned to the job.

• A load average value exceeding the suspend threshold as configured for the corresponding queues (see
queue_conf(5)).

• Shutdown of the Grid Engine execution daemon sge_execd(8) being responsible for the checkpointing job.

After they are aborted, jobs will migrate to other hosts, and possibly other cluster queues, unless they
were submitted to a specific one by an explicit user request. The migration of jobs leads to a dynamic
load balancing. Note: Aborting checkpointed jobs will free all resources (memory, swap space) which the
job occupies at that time. This is opposed to the situation for suspended regular jobs, which still use
virtual memory and other consumable resources.

RESTRICTIONS

When a job migrates to another machine, at present no files are transferred automatically to that
machine. This means that all files which are used throughout the entire job, including restart files,
executables, and scratch files, must be visible or transferred explicitly (e.g. at the beginning of the
job script).

There are also some practical limitations regarding use of disk space for transparently checkpointing
jobs. Checkpoints of a transparently checkpointed application are usually stored in a checkpoint file or
directory by the operating system. The file or directory contains all the text, data, and stack space for
the process, along with some additional control information. This means jobs which use a very large
virtual address space will generate very large checkpoint files. Also the workstations on which the jobs
will actually execute may have little free disk space. Thus it is not always possible to transfer a
transparent checkpointing job to a machine, even though that machine is idle. Since large virtual memory
jobs must wait for a machine that is both idle, and has a sufficient amount of free disk space, such jobs
may suffer long turnaround times.

There is currently no mechanism for restarting jobs with the same resources they were granted originally.
That might be important if they were submitted with a choice or range of resources and start running in a
particular way with what they're given.

Similarly, with heterogeneous execution hosts, jobs may need to restart on a host which supports a
superset of the instruction set where the job originally ran if the checkpoint mechanism (e.g. BLCR or
DMTCP) dumps an image of the running process. Runtime libraries, in particular, may initialize
themselves depending on details of the architecture they start up on - say to use a specific type of
vector unit. Then, they may fail if moved to an older host of similar architecture which lacks that
feature, even if they were compiled for a common instruction set.

COPYRIGHT

       See sge_intro(1) for a full statement of rights and permissions.

SGE 8.1.3pre                                       2012-09-18                                        SGE_CKPT(5)

NAME

DESCRIPTION

RESTRICTIONS

SEE ALSO

COPYRIGHT