Ubuntu Manpage: sched_conf - Sun Grid Engine default scheduler configuration file

Provided by: gridengine-common_6.2u5-7.3_all

NAME

       sched_conf - Sun Grid Engine default scheduler configuration file

DESCRIPTION

       sched_conf  defines  the  configuration  file format for Sun Grid Engine's  scheduler.  In
       order to modify the configuration, use the  graphical  user's  interface  qmon(1)  or  the
       -msconf  option of the qconf(1) command. A default configuration is provided together with
       the Sun Grid Engine distribution package.

       Note, Sun Grid Engine  allows  backslashes  (\)  be  used  to  escape  newline  (\newline)
       characters. The backslash and the newline are replaced with a space (" ") character before
       any interpretation.

FORMAT

The following parameters are recognized by the Sun Grid Engine scheduler if present in
sched_conf:

algorithm
Note: Deprecated, may be removed in future release.
Allows for the selection of alternative scheduling algorithms.

Currently default is the only allowed setting.

load_formula
A simple algebraic expression used to derive a single weighted load value from all or part
of the load parameters reported by sge_execd(8) for each host and from all or part of the
consumable resources (see complex(5)) being maintained for each host. The load formula
expression syntax is that of a summation weighted load values, that is:

{w1|load_val1[*w1]}[{+|-}{w2|load_val2[*w2]}[{+|-}...]]

Note, no blanks are allowed in the load formula.
The load values and consumable resources (load_val1, ...) are specified by the name
defined in the complex (see complex(5)).
Note: Administrator defined load values (see the load_sensor parameter in sge_conf(5) for
details) and consumable resources available for all hosts (see complex(5)) may be used as
well as Sun Grid Engine default load parameters.
The weighting factors (w1, ...) are positive integers. After the expression is evaluated
for each host the results are assigned to the hosts and are used to sort the hosts
corresponding to the weighted load. The sorted host list is used to sort queues
subsequently.
The default load formula is "np_load_avg".

job_load_adjustments
The load, which is imposed by the Sun Grid Engine jobs running on a system varies in time,
and often, e.g. for the CPU load, requires some amount of time to be reported in the
appropriate quantity by the operating system. Consequently, if a job was started very
recently, the reported load may not provide a sufficient representation of the load which
is already imposed on that host by the job. The reported load will adapt to the real load
over time, but the period of time, in which the reported load is too low, may already lead
to an oversubscription of that host. Sun Grid Engine allows the administrator to specify
job_load_adjustments which are used in the Sun Grid Engine scheduler to compensate for
this problem.
The job_load_adjustments are specified as a comma separated list of arbitrary load
parameters or consumable resources and (separated by an equal sign) an associated load
correction value. Whenever a job is dispatched to a host by the scheduler, the load
parameter and consumable value set of that host is increased by the values provided in the
job_load_adjustments list. These correction values are decayed linearly over time until
after load_adjustment_decay_time from the start the corrections reach the value 0. If the
job_load_adjustments list is assigned the special denominator NONE, no load corrections
are performed.
The adjusted load and consumable values are used to compute the combined and weighted load
of the hosts with the load_formula (see above) and to compare the load and consumable
values against the load threshold lists defined in the queue configurations (see
queue_conf(5)). If the load_formula consists simply of the default CPU load average
parameter np_load_avg, and if the jobs are very compute intensive, one might want to set
the job_load_adjustments list to np_load_avg=1.00, which means that every new job
dispatched to a host will require 100 % CPU time, and thus the machine's load is instantly
increased by 1.00.

load_adjustment_decay_time
The load corrections in the "job_load_adjustments" list above are decayed linearly over
time from the point of the job start, where the corresponding load or consumable parameter
is raised by the full correction value, until after a time period of
"load_adjustment_decay_time", where the correction becomes 0. Proper values for
"load_adjustment_decay_time" greatly depend upon the load or consumable parameters used
and the specific operating system(s). Therefore, they can only be determined on-site and
experimentally. For the default np_load_avg load parameter a "load_adjustment_decay_time"
of 7 minutes has proven to yield reasonable results.

maxujobs
The maximum number of jobs any user may have running in a Sun Grid Engine cluster at the
same time. If set to 0 (default) the users may run an arbitrary number of jobs.

schedule_interval
At the time the scheduler thread initially registers at the event master thread in
sge_qmaster(8)process schedule_interval is used to set the time interval in which the
event master thread sends scheduling event updates to the scheduler thread. A scheduling
event is a status change that has occurred within sge_qmaster(8) which may trigger or
affect scheduler decisions (e.g. a job has finished and thus the allocated resources are
available again).
In the Sun Grid Engine default scheduler the arrival of a scheduling event report triggers
a scheduler run. The scheduler waits for event reports otherwise.
Schedule_interval is a time value (see queue_conf(5) for a definition of the syntax of
time values).

queue_sort_method
This parameter determines in which order several criteria are taken into account to
product a sorted queue list. Currently, two settings are valid: seqno and load. However in
both cases, Sun Grid Engine attempts to maximize the number of soft requests (see qsub(1)
-s option) being fulfilled by the queues for a particular as the primary criterion.
Then, if the queue_sort_method parameter is set to seqno, Sun Grid Engine will use the
seq_no parameter as configured in the current queue configurations (see queue_conf(5)) as
the next criterion to sort the queue list. The load_formula (see above) has only a meaning
if two queues have equal sequence numbers. If queue_sort_method is set to load the load
according the load_formula is the criterion after maximizing a job's soft requests and the
sequence number is only used if two hosts have the same load. The sequence number sorting
is most useful if you want to define a fixed order in which queues are to be filled (e.g.
the cheapest resource first).

The default for this parameter is load.

halftime
When executing under a share based policy, the scheduler "ages" (i.e. decreases) usage to
implement a sliding window for achieving the share entitlements as defined by the share
tree. The halftime defines the time interval in which accumulated usage will have been
decayed to half its original value. Valid values are specified in hours or according to
the time format as specified in queue_conf(5).
If the value is set to 0, the usage is not decayed.

usage_weight_list
Sun Grid Engine accounts for the consumption of the resources CPU-time, memory and IO to
determine the usage which is imposed on a system by a job. A single usage value is
computed from these three input parameters by multiplying the individual values by weights
and adding them up. The weights are defined in the usage_weight_list. The format of the
list is

cpu=wcpu,mem=wmem,io=wio

where wcpu, wmem and wio are the configurable weights. The weights are real number. The
sum of all tree weights should be 1.

compensation_factor
Determines how fast Sun Grid Engine should compensate for past usage below of above the
share entitlement defined in the share tree. Recommended values are between 2 and 10,
where 10 means faster compensation.

weight_user
The relative importance of the user shares in the functional policy. Values are of type
real.

weight_project
The relative importance of the project shares in the functional policy. Values are of
type real.

weight_department
The relative importance of the department shares in the functional policy. Values are of
type real.

weight_job
The relative importance of the job shares in the functional policy. Values are of type
real.

weight_tickets_functional
The maximum number of functional tickets available for distribution by Sun Grid Engine.
Determines the relative importance of the functional policy. See under sge_priority(5)
for an overview on job priorities.

weight_tickets_share
The maximum number of share based tickets available for distribution by Sun Grid Engine.
Determines the relative importance of the share tree policy. See under sge_priority(5) for
an overview on job priorities.

weight_deadline
The weight applied on the remaining time until a jobs latest start time. Determines the
relative importance of the deadline. See under sge_priority(5) for an overview on job
priorities.

weight_waiting_time
The weight applied on the jobs waiting time since submission. Determines the relative
importance of the waiting time. See under sge_priority(5) for an overview on job
priorities.

weight_urgency
The weight applied on jobs normalized urgency when determining priority finally used.
Determines the relative importance of urgency. See under sge_priority(5) for an overview
on job priorities.

weight_priority
The weight applied on jobs normalized POSIX priority when determining priority finally
used. Determines the relative importance of POSIX priority. See under sge_priority(5) for
an overview on job priorities.

weight_ticket
The weight applied on normalized ticket amount when determining priority finally used.
Determines the relative importance of the ticket policies. See under sge_priority(5) for
an overview on job priorities.

flush_finish_sec
The parameters are provided for tuning the system's scheduling behavior. By default, a
scheduler run is triggered in the scheduler interval. When this parameter is set to 1 or
larger, the scheduler will be triggered x seconds after a job has finished. Setting this
parameter to 0 disables the flush after a job has finished.

flush_submit_sec
The parameters are provided for tuning the system's scheduling behavior. By default, a
scheduler run is triggered in the scheduler interval. When this parameter is set to 1 or
larger, the scheduler will be triggered x seconds after a job was submitted to the
system. Setting this parameter to 0 disables the flush after a job was submitted.

schedd_job_info
The default scheduler can keep track why jobs could not be scheduled during the last
scheduler run. This parameter enables or disables the observation. The value true enables
the monitoring false turns it off.

It is also possible to activate the observation only for certain jobs. This will be done
if the parameter is set to job_list followed by a comma separated list of job ids.

The user can obtain the collected information with the command qstat -j.

params
This is foreseen for passing additional parameters to the Sun Grid Engine scheduler. The
following values are recognized:

DURATION_OFFSET
If set, overrides the default of value 60 seconds. This parameter is used by the
Sun Grid Engine scheduler when planning resource utilization as the delta between
net job runtimes and total time until resources become available again. Net job
runtime as specified with -l h_rt=... or -l s_rt=... or default_duration always
differs from total job runtime due to delays before and after actual job start and
finish. Among the delays before job start is the time until the end of a
schedule_interval, the time it takes to deliver a job to sge_execd(8) and the
delays caused by prolog in queue_conf(5) , start_proc_args in sge_pe(5) and
starter_method in queue_conf(5) . The delays after job finish include delays due to
a forced job termination (notify, terminate_method or checkpointing), procedures
run after actual job finish, such as stop_proc_args in sge_pe(5) or epilog in
queue_conf(5) , and the delay until a new schedule_interval.
If the offset is too low, resource reservations (see max_reservation) can be
delayed repeatedly due to an overly optimistic job circulation time.

JC_FILTER
Note: Deprecated, may be removed in future release.
If set to true, the scheduler limits the number of jobs it looks at during a
scheduling run. At the beginning of the scheduling run it assigns each job a
specific category, which is based on the job's requests, priority settings, and the
job owner. All scheduling policies will assign the same importance to each job in
one category. Therefore the number of jobs per category have a FIFO order and can
be limited to the number of free slots in the system.

A exception are jobs, which request a resource reservation. They are included
regardless of the number of jobs in a category.

This setting is turned off per default, because in very rare cases, the scheduler
can make a wrong decision. It is also advised to turn report_pjob_tickets off.
Otherwise qstat -ext can report outdated ticket amounts. The information shown with
a qstat -j for a job, that was excluded in a scheduling run, is very limited.

PROFILE
If set equal to 1, the scheduler logs profiling information summarizing each
scheduling run.

MONITOR
If set equal to 1, the scheduler records information for each scheduling run
allowing to reproduce job resources utilization in the file
<sge_root>/<cell>/common/schedule.

PE_RANGE_ALG
This parameter sets the algorithm for the pe range computation. The default is
automatic, which means that the scheduler will select the best one, and it should
not be necessary to change it to a different setting in normal operation. If a
custom setting is needed, the following values are available:
auto : the scheduler selects the best algorithm
least : starts the resource matching with the lowest slot amount first
bin : starts the resource matching in the middle of the pe slot range
highest : starts the resource matching with the highest slot amount first

Changing params will take immediate effect. The default for params is none.

reprioritize_interval
Interval (HH:MM:SS) to reprioritize jobs on the execution hosts based on the current
ticket amount for the running jobs. If the interval is set to 00:00:00 the
reprioritization is turned off. The default value is 00:00:00. The reprioritization
tickets are calculated by the scheduler and update events for running jobs are only sent
after the scheduler calculated new values. How often the schedule should calculate the
tickets is defined by the reprioritize_interval. Because the scheduler is only triggered
in a specific interval (scheduler_interval) this means the reprioritize_interval has only
a meaning if set greater than the scheduler_interval. For example, if the
scheduler_interval is 2 minutes and reprioritize_interval is set to 10 seconds, this means
the jobs get re-prioritized every 2 minutes.

report_pjob_tickets
This parameter allows to tune the system's scheduling run time. It is used to enable /
disable the reporting of pending job tickets to the qmaster. It does not influence the
tickets calculation. The sort order of jobs in qstat and qmon is only based on the submit
time, when the reporting is turned off.
The reporting should be turned off in a system with a very large amount of jobs by setting
this parameter to "false".

halflife_decay_list
The halflife_decay_list allows to configure different decay rates for the "finished_jobs
usage types, which is used in the pending job ticket calculation to account for jobs which
have just ended. This allows the user the pending jobs algorithm to count finished jobs
against a user or project for a configurable decayed time period. This feature is turned
off by default, and the halftime is used instead.
The halflife_decay_list also allows one to configure different decay rates for each usage
type being tracked (cpu, io, and mem). The list is specified in the following format:

<USAGE_TYPE>=<TIME>[:<USAGE_TYPE>=<TIME>[:<USAGE_TYPE>=<TIME>]]

<Usage_TYPE> can be one of the following: cpu, io, or mem.
<TIME> can be -1, 0 or a timespan specified in minutes. If <TIME> is -1, only the usage of
currently running jobs is used. 0 means that the usage is not decayed.

policy_hierarchy
This parameter sets up a dependency chain of ticket based policies. Each ticket based
policy in the dependency chain is influenced by the previous policies and influences the
following policies. A typical scenario is to assign precedence for the override policy
over the share-based policy. The override policy determines in such a case how share-based
tickets are assigned among jobs of the same user or project. Note that all policies
contribute to the ticket amount assigned to a particular job regardless of the policy
hierarchy definition. Yet the tickets calculated in each of the policies can be different
depending on "POLICY_HIERARCHY".

The "POLICY_HIERARCHY" parameter can be a up to 3 letter combination of the first letters
of the 3 ticket based policies S(hare-based), F(unctional) and O(verride). So a value
"OFS" means that the override policy takes precedence over the functional policy, which
finally influences the share-based policy. Less than 3 letters mean that some of the
policies do not influence other policies and also are not influenced by other policies. So
a value of "FS" means that the functional policy influences the share-based policy and
that there is no interference with the other policies.

The special value "NONE" switches off policy hierarchies.

share_override_tickets
If set to "true" or "1", override tickets of any override object instance are shared
equally among all running jobs associated with the object. The pending jobs will get as
many override tickets, as they would have, when they were running. If set to "false" or
"0", each job gets the full value of the override tickets associated with the object. The
default value is "true".

share_functional_shares
If set to "true" or "1", functional shares of any functional object instance are shared
among all the jobs associated with the object. If set to "false" or "0", each job
associated with a functional object, gets the full functional shares of that object. The
default value is "true".

max_functional_jobs_to_schedule
The maximum number of pending jobs to schedule in the functional policy. The default
value is 200.

max_pending_tasks_per_job
The maximum number of subtasks per pending array job to schedule. This parameter exists in
order to reduce scheduling overhead. The default value is 50.

max_reservation
The maximum number of reservations scheduled within a schedule interval. When a runnable
job can not be started due to a shortage of resources a reservation can be scheduled
instead. A reservation can cover consumable resources with the global host, any execution
host and any queue. For parallel jobs reservations are done also for slots resource as
specified in sge_pe(5). As job runtime the maximum of the time specified with -l h_rt=...
or -l s_rt=... is assumed. For jobs that have neither of them the default_duration is
assumed. Reservations prevent jobs of lower priority as specified in sge_priority(5) from
utilizing the reserved resource quota during the time of reservation. Jobs of lower
priority are allowed to utilize those reserved resources only if their prospective job end
is before the start of the reservation (backfilling). Reservation is done only for non-
immediate jobs (-now no) that request reservation (-R y). If max_reservation is set to "0"
no job reservation is done.

Note, that reservation scheduling can be performance consuming and hence reservation
scheduling is switched off by default. Since reservation scheduling performance
consumption is known to grow with the number of pending jobs, the use of -R y option is
recommended only for those jobs actually queuing for bottleneck resources. Together with
the max_reservation parameter this technique can be used to narrow down performance
impacts.

default_duration
When job reservation is enabled through max_reservation sched_conf(5) parameter the
default duration is assumed as runtime for jobs that have neither -l h_rt=... nor -l
s_rt=... specified. In contrast to a h_rt/s_rt time limit the default_duration is not
enforced.

FILES

       <sge_root>/<cell>/common/sched_configuration
                  scheduler thread configuration

COPYRIGHT

       See sge_intro(1) for a full statement of rights and permissions.

NAME

DESCRIPTION

FORMAT

FILES

SEE ALSO

COPYRIGHT