oracular (8) sge_shepherd.8.gz

Provided by: gridengine-exec_8.1.9+dfsg-11.1_amd64 bug

NAME

       sge_shepherd - Grid Engine single job-controlling agent

SYNOPSIS

       sge_shepherd

DESCRIPTION

       sge_shepherd  provides the parent process functionality for a single Grid Engine job.  The
       parent functionality is necessary on UNIX systems to retrieve resource  usage  information
       (see  getrusage(2))  after  a  job  has  finished.  In addition, the sge_shepherd forwards
       signals to the job, such for  suspension,  enabling,  termination,  and  the  Grid  Engine
       checkpointing signal (see sge_ckpt(1) and queue_conf(5) for details).

       The  sge_shepherd  receives information about the job to be started from the sge_execd(8).
       During the execution of the job it actually starts up to 5 child processes. First a prolog
       script  is  run  if  this  feature  is  enabled  by  the  prolog  parameter in the cluster
       configuration. (See sge_conf(5).)  Next a parallel environment startup procedure is run if
       the  job  is  a  parallel  job. (See sge_pe(5) for more information.)  After that, the job
       itself is run, followed by a parallel environment shutdown procedure  for  parallel  jobs,
       and  finally  an  epilog  script  if  requested  by  the  epilog  parameter in the cluster
       configuration. The prolog and epilog scripts, as well as the parallel environment  startup
       and  shutdown  procedures,  are  to  be  provided by the Grid Engine administrator and are
       intended for site-specific actions to be taken before and after execution  of  the  actual
       user job.

       After  the  job  has  finished  and the epilog script is processed, sge_shepherd retrieves
       resource usage statistics about the job, places them in a job-specific subdirectory of the
       sge_execd(8) spool directory for reporting through sge_execd(8), and finishes.

       sge_shepherd  also places an exit status file in the spool directory. This exit status can
       be viewed with qacct -j JobId (see qacct(1)); it is not the exit  status  of  sge_shepherd
       itself  but  of  one  of  the methods executed by sge_shepherd.  This exit status can have
       several meanings, depending on the method in  which  an  error  occurred  (if  any).   The
       possible  methods  are:  prolog,  parallel  start,  job,  parallel  stop, epilog, suspend,
       restart, terminate, clean, migrate, and checkpoint.

       The following exit values are returned:

       0      All methods: Operation was executed successfully.

       99     Job  script,  prolog  and  epilog:  When  FORBID_RESCHEDULE  is  not  set  in   the
              configuration (see sge_conf(5)), the job gets re-queued.  Otherwise see "Other".

       100    Job script, prolog and epilog: When FORBID_APPERROR is not set in the configuration
              (see sge_conf(5)), the job gets re-queued.  Otherwise see "Other".

       Other  Job script: This is the exit status of the job itself. No action is taken upon this
              exit status because the meaning of this exit status is not known.
              Prolog,  epilog  and parallel start: The queue is set to error state and the job is
              re-queued.
              Parallel stop: The queue is set to error state, but the job is not re-queued. It is
              assumed that the job itself ran successfully and only the clean up script failed.
              Suspend, restart, terminate, clean, and migrate: Always successful.
              Checkpoint:   Success,   except   for  kernel  checkpointing:  checkpoint  was  not
              successful, did not happen (but migration will happen).

       For the meaning of the return codes of the  shepherd  itself  (which  are  interpreted  by
       qacct(1)) see sge_status(5).

RESTRICTIONS

       sge_shepherd should not be invoked manually, but only by sge_execd(8).

ENVIRONMENT VARIABLES

       SGE_ROOT       Specifies the location of the Grid Engine standard configuration files.

       SGE_CELL       If  set,  specifies  the default Grid Engine cell. To address a Grid Engine
                      cell sge_execd uses (in the order of precedence):

                             The name of the cell specified in the environment variable SGE_CELL,
                             if it is set.

                             The name of the default cell, i.e. default.

       SGE_ENABLE_COREDUMP
                      If  set, enable core dumps on Linux when the admin_user is not root.  Linux
                      normally disables core dumps when  the  daemon  has  changed  uid  or  gid.
                      Setting  SGE_ENABLE_COREDUMP  in  sge_execd's  environment  defeats that to
                      enable core dumps for debugging if they are  otherwise  allowed.   This  is
                      typically  not  a big hazard with SGE, since most information is exposed in
                      the spool area anyhow.  Dumps will appear in the qmaster  spool  directory,
                      which need not be world-readable.
                      On Solaris, coreadm(1) may be used to enable such dumps.

       SGE_CGROUP_DIR If Linux cgroups handling is enabled, this variable names a directory under
                      the cgroup mount point in which to create  job-specific  directories.   The
                      default is sge.SGE_CELL so, for instance, the cpuset cgroup for a job might
                      be /sys/fs/cgroup/cpuset/sge.default/123.

FILES

       sgepasswd contains a list of user names and their corresponding  encrypted  passwords.  If
       available,  the password file will be used by sge_shepherd. To change the contents of this
       file please use the sgepasswd command. It is not advised to change that file manually.
       <execd_spool>/job_dir/<job_id>     job specific directory
       <sge_root>/<cell>/common/sgepasswd
                                          Password information used on Microsoft Windows hosts.  See
       sgepasswd(5).

SEE ALSO

       sge_intro(1), sge_conf(5), sge_status(5), remote_startup(5), sgepasswd(5), sge_execd(8).

       See sge_intro(1) for a full statement of rights and permissions.