Ubuntu Manpage: sge_pe - Grid Engine parallel environment configuration file format

Provided by: gridengine-common_8.1.9+dfsg-10build1_all

NAME

       sge_pe - Grid Engine parallel environment configuration file format

DESCRIPTION

       Parallel  environments  are  parallel  programming and runtime environments supporting the
       execution of shared memory  or  distributed  memory  parallelized  applications.  Parallel
       environments usually require some kind of setup to be operational before starting parallel
       applications.  Examples of common  parallel  environments  are  OpenMP  on  shared  memory
       multiprocessor   systems,  and  Message  Passing  Interface  (MPI)  on  shared  memory  or
       distributed systems.

       sge_pe allows for the definition of interfaces to arbitrary parallel environments.  Once a
       parallel  environment  is  defined or modified with the -ap or -mp options to qconf(1) and
       linked with one or more queues  via  pe_list  in  queue_conf(5)  the  environment  can  be
       requested  for  a  job via the -pe switch to qsub(1) together with a request for a numeric
       range of parallel processes to be allocated by the job. Additional -l options may be  used
       to specify more detailed job requirements.

       Note,  Grid  Engine  allows  backslashes  (\)  be  used  to escape newline characters. The
       backslash and the newline are replaced with a space character before any interpretation.

FORMAT

The format of a sge_pe file is defined as follows:

pe_name
The name of the parallel environment in the format for pe_name in sge_types(1). To be
used in the qsub(1) -pe switch.

slots
The total number of slots (normally one per parallel process or thread) allowed to be
filled concurrently under the parallel environment. Type is integer, valid values are 0
to 9999999.

user_lists
xuser_lists
A comma-separated list of user access list names (see access_list(5)).

Each user contained in at least one of the user_lists access lists has access to the
parallel environment. If the user_lists parameter is set to NONE (the default) any user
has access if not explicitly excluded via the xuser_lists parameter.

Each user contained in at least one of the xuser_lists access lists is not allowed to
access the parallel environment. If the xuser_lists parameter is set to NONE (the default)
any user has access.

If a user is contained both in an access list in xuser_lists and user_lists the user is
denied access to the parallel environment.

start_proc_args
stop_proc_args
The command line respectively of a startup or shutdown procedure (an executable command,
plus possible arguments) for the parallel environment, or "none" for no procedure
(typically for tightly integrated PEs). The command line is started directly, not in a
shell. An optional prefix "user@" specifies the username under which the procedure is to
be started. In that case see the SECURITY section below concerning security issues
running as a privileged user.

The startup procedure is invoked by sge_shepherd(8) on the master node of the job prior to
executing the job script. Its purpose is to setup the parallel environment according to
its needs. The shutdown procedure is invoked by sge_shepherd(8) after the job script has
finished. Its purpose is to stop the parallel environment and to remove it from all
participating systems. The standard output of the procedure is redirected to the file
REQUEST.poJID in the job's working directory (see qsub(1)), with REQUEST being the name of
the job as displayed by qstat(1), and JID being the job's identification number.
Likewise, the standard error output is redirected to REQUEST.peJID. If the -e or -o
options are given on job submission, the PE error and standard output is merged into the
paths specified.

The following special variables, expanded at runtime, can be used (besides any other
strings which have to be interpreted by the start and stop procedures) to constitute a
command line:

$pe_hostfile
The pathname of a file containing a detailed description of the layout of the
parallel environment to be setup by the start-up procedure. Each line of the file
refers to a host on which parallel processes are to be run. The first entry of each
line denotes the hostname, the second entry the number of parallel processes to be
run on the host, the third entry the name of the queue. The entries are separated
by spaces. If -binding pe is specified on job submission, the fourth column is the
core binding specification as colon-separated socket-core pairs, like "0,0:0,1",
meaning the first core on the first socket and the second core on the first socket
can be used for binding. Otherwise it will be "UNDEFINED". With the obsolete
queue processors specification the fourth entry could be a multi-processor
configuration (or "<NULL>").

$host The name of the host on which the startup or stop procedures are run.

$ja_task_id
The array job task index (0 if not an array job).

$job_owner
The user name of the job owner.

$job_id
Grid Engine's unique job identification number.

$job_name
The name of the job.

$pe The name of the parallel environment in use.

$pe_slots
Number of slots granted for the job.

$processors
The processors string as contained in the queue configuration (see queue_conf(5))
of the master queue (the queue in which the startup and stop procedures are run).

$queue The cluster queue of the master queue instance.

$sge_cell
The SGE_CELL environment variable (useful for locating files).

$sge_root
The SGE_ROOT environment variable (useful for locating files).

$stdin_path
The standard input path.

$stderr_path
The standard error path.

$stdout_path
The standard output path.

$merge_stderr

$fs_stdin_host

$fs_stdin_path

$fs_stdin_tmp_path

$fs_stdin_file_staging

$fs_stdout_host

$fs_stdout_path

$fs_stdout_tmp_path

$fs_stdout_file_staging

$fs_stderr_host

$fs_stderr_path

$fs_stderr_tmp_path

$fs_stderr_file_staging

The start and stop commands are run with the same environment setting as that of the job
to be started afterwards (see qsub(1)).

allocation_rule
The allocation rule is interpreted by the scheduler thread and helps the scheduler to
decide how to distribute parallel processes among the available machines. If, for
instance, a parallel environment is built for shared memory applications only, all
parallel processes have to be assigned to a single machine, no matter how many suitable
machines are available. If, however, the parallel environment follows the distributed
memory paradigm, an even distribution of processes among machines may be favorable, as may
packing processes onto the minimum number of machines.

The current version of the scheduler only understands the following allocation rules:

int An integer, fixing the number of processes per host. If it is 1, all processes have
to reside on different hosts. If the special name $pe_slots is used, the full range
of processes as specified with the qsub(1) -pe switch has to be allocated on a
single host (no matter what value belonging to the range is finally chosen for the
job to be allocated).

$fill_up
Starting from the best suitable host/queue, all available slots are allocated.
Further hosts and queues are "filled up" as long as a job still requires slots for
parallel tasks.

$round_robin
From all suitable hosts, a single slot is allocated until all tasks requested by
the parallel job are dispatched. If more tasks are requested than suitable hosts
are found, allocation starts again from the first host. The allocation scheme
walks through suitable hosts in a most-suitable-first order.

control_slaves
This parameter can be set to TRUE or FALSE (the default). It indicates whether Grid Engine
is the creator of the slave tasks of a parallel application via sge_execd(8) and
sge_shepherd(8) and thus has full control over all processes in a parallel application
("tight integration"). This enables:

• resource limits are enforced for all tasks, even on slave hosts;

• resource consumption is properly accounted on all hosts;

• proper control of tasks, with no need to write a customized terminate method to
ensure that whole job is finished on qdel and that tasks are properly reaped in the
case of abnormal job termination;

• all tasks are started with the appropriate nice value which was configured as
priority in the queue configuration;

• propagation of the job environment to slave hosts, e.g. so that they write into the
appropriate per-job temporary directory specified by TMPDIR, which is created on
each host and properly cleaned up.

To gain control over the slave tasks of a parallel application, a sophisticated PE
interface is required, which works closely together with Grid Engine facilities, typically
interpreting the Grid Engine hostfile and starting remote tasks with qrsh(1) and its
-inherit option. See, for instance, the $SGE_ROOT/mpi directory and the howto pages
⟨http://arc.liv.ac.uk/SGE/howto/#Tight%20Integration%20of%20Parallel%20Libraries⟩.

Please set the control_slaves parameter to false for all other PE interfaces.

job_is_first_task
The job_is_first_task parameter can be set to TRUE or FALSE. A value of TRUE indicates
that the Grid Engine job script already contains one of the tasks of the parallel
application (and the number of slots reserved for the job is the number of slots requested
with the -pe switch). FALSE indicates that the job script (and its child processes) is
not part of the parallel program, just being used to kick off the tasks that do the work;
then the number of slots reserved for the job in the master queue is increased by 1, as
indicated by qstat/qhost.

This should be TRUE for the common modern MPI implementations with tight integration.
Consider if the allocation rule is $fill_up, and a job is allocated only a single slot on
the master host; then one of the MPI processes actually runs in that slot, and should be
accounted as such, so the job is the first task.

If wallclock accounting is used (execd_params ACCT_RESERVED_USAGE
and/or SHARETREE_RESERVED_USAGE Is TRUE) and control_slaves is set to FALSE, the
job_is_first_task parameter influences the accounting for the job: A value of TRUE means
that accounting for CPU and requested memory gets multiplied by the number of slots
requested with the -pe switch. FALSE means the accounting information gets multiplied by
number of slots + 1. Otherwise, the only significant effect of the parameter is on the
display of the job.

urgency_slots
For pending jobs with a slot range PE request with different minimum and maximum, the
number of slots they will actually use is not determined. This setting specifies the
method to be used by Grid Engine to assess the number of slots such jobs might finally
get.

The assumed slot allocation has a meaning when determining the resource-request-based
priority contribution for numeric resources as described in sge_priority(5) and is
displayed when qstat(1) is run without -g t option.

The following methods are supported:

int The specified integer number is directly used as prospective slot amount.

min The slot range minimum is used as prospective slot amount. If no lower bound is
specified with the range, 1 is assumed.

max The slot range maximum is used as prospective slot amount. If no upper bound is
specified with the range, the absolute maximum possible due to the PE's slots
setting is assumed.

avg The average of all numbers occurring within the job's PE range request is assumed.

accounting_summary
This parameter is only checked if control_slaves (see above) is set to TRUE and thus Grid
Engine is the creator of the slave tasks of a parallel application via sge_execd(8) and
sge_shepherd(8). In this case, accounting information is available for every single slave
task started by Grid Engine.

The accounting_summary parameter can be set to TRUE or FALSE. A value of TRUE indicates
that only a single accounting record is written to the accounting(5) file, containing the
accounting summary of the whole job, including all slave tasks, while a value of FALSE
indicates an individual accounting(5) record is written for every slave task, as well as
for the master task.

Note: When running tightly integrated jobs with SHARETREE_RESERVED_USAGE set, and
accounting_summary enabled in the parallel environment, reserved usage will only be
reported by the master task of the parallel job. No per-parallel task usage records will
be sent from execd to qmaster, which can significantly reduce load on the qmaster when
running large, tightly integrated parallel jobs. However, this removes the only post-hoc
information about which hosts a job used.

qsort_args library qsort-function [arg1 ...]
Specifies a method for specifying the queues/hosts and order that should be used to
schedule a parallel job. For details, and the API, consult the header file
$SGE_ROOT/include/sge_pqs_api.h. library is the path to the qsort dynamic library, qsort-
function is the name of the qsort function implemented by the library, and the args are
arguments passed to qsort. Substitutions from the hard requested resource list for the
job are made for any strings of the form $resource, where resource is the full name of the
resource as defined in the complex(5) list. If resource is not requested in the job, a
null string is substituted.

RESTRICTIONS

       Note  that  the  functionality  of  the  start  and  stop  procedures  remains  the   full
       responsibility  of  the  administrator  configuring the parallel environment.  Grid Engine
       will invoke these procedures and evaluate their exit status.  A non-zero exit status  will
       put the queue into an error state.  If the start procedure has a non-zero exit status, the
       job will be re-queued.

SECURITY

       If start_proc_args,  or  stop_proc_args  is  specified  with  a  user@  prefix,  the  same
       considerations apply as for the prolog and epilog, as described in the SECURITY section of
       sge_conf(5).

FILES

       $SGE_ROOT/include/sge_pqs_api.h

COPYRIGHT

       See sge_intro(1) for a full statement of rights and permissions.