bionic (5) sge_sched_conf.5.gz

Provided by: gridengine-common_8.1.9+dfsg-7build1_all bug

NAME

       sched_conf - Grid Engine default scheduler configuration file

DESCRIPTION

       sched_conf  defines  the  configuration file format for Grid Engine's  scheduler.  In order to modify the
       configuration, use the graphical user's interface qmon(1) or the -msconf option of the qconf(1)  command.
       A default configuration is provided with the Grid Engine distribution package.

       Note,  Grid  Engine  allows  backslashes  (\) be used to escape newline characters. The backslash and the
       newline are replaced with a space character before any interpretation.

FORMAT

       The following parameters are recognized by the Grid Engine scheduler if present in sched_conf:

   algorithm
       Note: Deprecated, may be removed in future release.
       Allows for the selection of alternative scheduling algorithms.

       Currently default is the only allowed setting.

   load_formula
       A simple algebraic expression used to derive a single weighted load value from all or part  of  the  load
       parameters  reported  by sge_execd(8) for each host and from all or part of the consumable resources (see
       complex(5)) being maintained for each host.  The load formula expression syntax  is  that  of  a  sum  of
       weighted load values, that is:

              {w1|load_val1[*w1]}[{+|-}{w2|load_val2[*w2]}[{+|-}...]]

       Note, no blanks are allowed in the load formula.
       The  load  values  and  consumable  resources  (load_val1, ...)  are specified by the name defined in the
       complex (see complex(5)).
       Note: Administrator-defined load values (see the load_sensor parameter in sge_conf(5)  for  details)  and
       consumable  resources available for all hosts (see complex(5)) may be used as well as Grid Engine default
       load parameters.
       The weighting factors (w1, ...) are positive integers. After the expression is evaluated  for  each  host
       the  results are assigned to the hosts and are used to sort the hosts corresponding to the weighted load.
       The sorted host list is used to sort queues subsequently.
       The default load formula is np_load_avg.

   job_load_adjustments
       The load which is imposed by the Grid Engine jobs running on a system varies in time, and often, e.g. for
       the  CPU  load,  requires some amount of time to be reported in the appropriate quantity by the operating
       system. Consequently, if a job was started very recently, the reported load may not provide a  sufficient
       representation of the load which is already imposed on that host by the job. The reported load will adapt
       to the real load over time, but the period of time in which the reported load is too low may already lead
       to an oversubscription of that host. Grid Engine allows the administrator to specify job_load_adjustments
       which are used in the Grid Engine scheduler to compensate for this problem.
       The job_load_adjustments are specified  as  a  comma-separated  list  of  arbitrary  load  parameters  or
       consumable resources and (separated by an equal sign) an associated load correction value. Whenever a job
       is dispatched to a host by the scheduler, the load parameter and consumable value set  of  that  host  is
       increased  by  the  values provided in the job_load_adjustments list. These correction values are decayed
       linearly over time until after load_adjustment_decay_time from the start the corrections reach the  value
       0.   If  the  job_load_adjustments list is assigned the special denominator NONE, no load corrections are
       performed.
       The adjusted load and consumable values are used to compute the combined and weighted load of  the  hosts
       with  the  load_formula  (see  above)  and  to  compare  the  load and consumable values against the load
       threshold lists defined in the queue configurations (see queue_conf(5)).  If  the  load_formula  consists
       simply of the default CPU load average parameter np_load_avg, and if the jobs are very compute intensive,
       one might want to set the job_load_adjustments list to np_load_avg=1.00, which means that every  new  job
       dispatched  to  a  host will require 100% CPU time, and thus the machine's load is instantly increased by
       1.00.

   load_adjustment_decay_time
       The load corrections in the "job_load_adjustments" list above are decayed linearly  over  time  from  the
       point  of  the  job  start,  where  the  corresponding load or consumable parameter is raised by the full
       correction value, until after a time period of "load_adjustment_decay_time"  the  correction  becomes  0.
       Proper values for "load_adjustment_decay_time" greatly depend upon the load or consumable parameters used
       and the specific operating system(s). Therefore, they can only be determined on-site and  experimentally.
       For  the  default  np_load_avg  load  parameter a "load_adjustment_decay_time" of 7 minutes has proven to
       yield reasonable results.

   maxujobs
       The maximum number of jobs any user may have running in a Grid Engine cluster at the same time. If set to
       0 (default) the users may run an arbitrary number of jobs.

   schedule_interval
       At  the  time the scheduler thread initially registers with the event master thread in the sge_qmaster(8)
       process schedule_interval is used to set the time  interval  in  which  the  event  master  thread  sends
       scheduling  event  updates  to  the  scheduler  thread.   A  scheduling event is a status change that has
       occurred within sge_qmaster(8) which may trigger or affect scheduler decisions (e.g. a job  has  finished
       and thus the allocated resources are available again).
       In  the  Grid Engine default scheduler the arrival of a scheduling event report triggers a scheduler run.
       The scheduler waits for event reports otherwise.
       Schedule_interval is a time value (see sge_types(5) for a definition  of  the  syntax  of  time  values).
       Setting it to 0 disables scheduling.

   queue_sort_method
       This  parameter  determines  in  which  order several criteria are taken into account to produce a sorted
       queue instance list which determines  the  preferred  order  for  scheduling  tasks  to  them  (typically
       determining  the  order  in  which  hosts  are used).  Currently, two settings are valid: seqno and load.
       However in both cases, Grid Engine attempts to maximize the number  of  soft  requests  (see  qsub(1)  -s
       option) being fulfilled by the queues for a particular job as the primary criterion.
       Then,  if  the  queue_sort_method parameter is set to seqno, Grid Engine will use the seq_no parameter as
       configured in the current queue configurations (see queue_conf(5)) as the  next  criterion  to  sort  the
       queue  list.  The  load_formula  (see  above) is only used as the next criterion if two queues have equal
       sequence numbers.  If queue_sort_method is set to  load  the  load  according  the  load_formula  is  the
       criterion  after maximizing a job's soft requests, and the sequence number is only used if two hosts have
       the same load.  The sequence number sorting is most useful if you want to define a fixed order  in  which
       queues are to be filled (e.g. the cheapest resource first).

       The default for this parameter is load.

   halftime
       When  executing  under  a  share based policy, the scheduler "ages" (i.e. decreases) usage to implement a
       sliding window for achieving the share entitlements as defined by the share tree.  The  halftime  defines
       the time interval in which accumulated usage will have been decayed to half its value at the start of the
       interval.  (This is a radioactive-type exponential decay, where the parameter is  usually  called  "half-
       life".)  Valid values are specified in hours, default 168.
       If the value is set to 0, the usage is not decayed.

   usage_weight_list
       Grid  Engine accounts for the consumption of the resources CPU-time, memory and IO to determine the usage
       which is imposed on a system by a job. A single usage value is computed from these three input parameters
       by  multiplying  the  individual  values  by  weights  and adding them up. The weights are defined in the
       usage_weight_list. The format of the list is

              cpu=wcpu,mem=wmem,io=wio

       where wcpu, wmem and wio are the configurable weights. The weights are real numbers. The sum of all three
       weights should be 1.  The default is cpu=1,mem=0,io=0.

   compensation_factor
       Determines  how  fast  Grid  Engine should compensate for past usage below or above the share entitlement
       defined in the share tree. Recommended values are between 2 and 10, where 10 means  faster  compensation.
       The default is 5.

   weight_user
       The relative importance of the user shares in the functional policy.  Values are of type real.

   weight_project
       The relative importance of the project shares in the functional policy.  Values are of type real.

   weight_department
       The relative importance of the department shares in the functional policy. Values are of type real.

   weight_job
       The relative importance of the job shares in the functional policy. Values are of type real.

   weight_tickets_functional
       The  maximum  number  of  functional  tickets  available  for distribution by Grid Engine. Determines the
       relative importance of the  functional  policy.   See  under  sge_priority(5)  for  an  overview  on  job
       priorities.

   weight_tickets_share
       The  maximum  number  of  share  based  tickets available for distribution by Grid Engine. Determines the
       relative importance of the  share  tree  policy.  See  under  sge_priority(5)  for  an  overview  on  job
       priorities.

   weight_deadline
       The  weight  applied  on  the  remaining  time  until  a job's latest start time. Determines the relative
       importance of the deadline. See under sge_priority(5) for an overview on job priorities.

   weight_waiting_time
       The weight applied on the job's waiting time since submission. Determines the relative importance of  the
       waiting time.  See under sge_priority(5) for an overview on job priorities.

   weight_urgency
       The  weight  applied  on jobs' normalized urgency when determining the priority finally used.  Determines
       the relative importance of urgency.  See under sge_priority(5) for an overview on job priorities.

   weight_priority
       The weight applied on jobs' normalized  POSIX  priority  when  determining  the  priority  finally  used.
       Determines  the  relative importance of POSIX priority.  See under sge_priority(5) for an overview on job
       priorities.

   weight_ticket
       The weight applied  on  the  normalized  ticket  amount  when  determining  the  priority  finally  used.
       Determines  the  relative importance of the ticket policies. See under sge_priority(5) for an overview on
       job priorities.

   flush_finish_sec
       This parameter is provided for tuning the system's scheduling behavior.  By default, a scheduler  run  is
       triggered  in  the  scheduler  interval. When this parameter is set to 1 or larger, the scheduler will be
       triggered that number of seconds after a job has finished. Setting this parameter to 0 disables the flush
       after a job has finished.

   flush_submit_sec
       This  parameter  is provided for tuning the system's scheduling behavior.  By default, a scheduler run is
       triggered in the scheduler interval.  When this parameter is set to 1 or larger, the  scheduler  will  be
       triggered  that  number  of  seconds after a job was submitted to the system. Setting this parameter to 0
       disables the flush after a job was submitted.

   schedd_job_info
       The default scheduler can keep track of why jobs could not be scheduled during the  last  scheduler  run.
       This parameter enables or disables the observation.  The value true enables the monitoring false turns it
       off.

       It is also possible to activate the observation only for certain jobs. This will be done if the parameter
       is set to job_list followed by a comma-separated list of job ids.

       The user can obtain the collected information with the command qstat -j.

   params
       This  is  for  passing  additional  parameters  to  the  Grid  Engine scheduler. The following values are
       recognized:

       DURATION_OFFSET
              If set, overrides the default of value 60 seconds.  This parameter is  used  by  the  Grid  Engine
              scheduler  when planning resource utilization as the delta between net job runtimes and total time
              until resources become available again. Net job runtime as  specified  with  -l  h_rt=...   or  -l
              s_rt=...  or default_duration always differs from total job runtime due to delays before and after
              actual job start and finish. Among the delays before job start is the time  until  the  end  of  a
              schedule_interval,  the  time  it takes to deliver a job to sge_execd(8), and the delays caused by
              prolog in queue_conf(5), start_proc_args in sge_pe(5) and starter_method  in  queue_conf(5).   The
              delays after job finish include those due to a forced job termination (notify, terminate_method or
              checkpointing), procedures run after actual job finish, such as  stop_proc_args  in  sge_pe(5)  or
              epilog in queue_conf(5), and the delay until a new schedule_interval.
              If  the  offset  is too low, resource reservations (see max_reservation) can be delayed repeatedly
              due to an overly optimistic job circulation time.

       JC_FILTER
              Note: Deprecated, may be removed in future release.
              If set to true, the scheduler limits the number of jobs it looks at during a  scheduling  run.  At
              the beginning of the scheduling run it assigns each job a specific category, which is based on the
              job's requests, priority settings, and the job owner. All scheduling policies will assign the same
              importance to each job in one category. Therefore the number of jobs per category has a FIFO order
              and can be limited to the number of free slots in the system.

              An exception is jobs which request a resource reservation. They are  included  regardless  of  the
              number of jobs in a category.

              This  setting is turned off by default, because in very rare cases, the scheduler can make a wrong
              decision. It is also advised to turn report_pjob_tickets off.  Otherwise  qstat  -ext  can  report
              outdated  ticket  amounts.  The information shown with a qstat -j for a job that was excluded in a
              scheduling run is very limited.

       PROFILE
              If set equal to 1, the scheduler logs profiling information summarizing each scheduling run.

       MONITOR
              If set  equal  to  1,  the  scheduler  records  information  for  each  scheduling  run,  enabling
              reproduction of job resource utilization in the file <sge_root>/<cell>/common/schedule.

       PE_RANGE_ALG
              This  parameter  sets  the algorithm for the PE range computation. The default is automatic, which
              means that the scheduler will select the best one, and it should not be necessary to change it  to
              a  different  setting in normal operation. If a custom setting is needed, the following values are
              available:
              auto: the scheduler selects the best algorithm
              least: starts the resource matching with the lowest slot amount first
              bin: starts the resource matching in the middle of the pe slot range
              highest: starts the resource matching with the highest slot amount first.

       Changing params will take immediate effect.  The default for params is none.

   reprioritize_interval
       Interval (HH:MM:SS) to reprioritize jobs on the execution hosts based on the current  ticket  amount  for
       the  running  jobs.  If  the  interval is set to 00:00:00 the reprioritization is turned off. The default
       value is 00:00:00.  The reprioritization tickets are calculated by the scheduler and  update  events  for
       running  jobs  are  only  sent  after the scheduler calculated new values. How often the scheduler should
       calculate the tickets is defined by the reprioritize_interval.  Because the scheduler is  only  triggered
       in  a  specific  interval (scheduler_interval) this means the reprioritize_interval only has a meaning if
       set greater than the scheduler_interval.  For  example,  if  the  scheduler_interval  is  2  minutes  and
       reprioritize_interval is set to 10 seconds, this means the jobs get re-prioritized every 2 minutes.

   report_pjob_tickets
       This parameter allows tuning the system's scheduling run time. It is used to enable/disable the reporting
       of pending job tickets to the qmaster.  It does not influence the tickets calculation. The sort order  of
       jobs in qstat and qmon is only based on the submit time when the reporting is turned off.
       The reporting should be turned off in a system with a very large amount of jobs by setting this parameter
       to "false".

   halflife_decay_list
       The halflife_decay_list allows configuring different decay rates for the finished_jobs usage types, which
       is  used in the pending job ticket calculation to account for jobs which have just ended. This allows the
       user the pending jobs algorithm to count finished jobs against a  user  or  project  for  a  configurable
       decayed time period. This feature is turned off by default, and the halftime is used instead.
       The  halflife_decay_list  also  allows  one  to configure different decay rates for each usage type being
       tracked (cpu, io, and mem). The list is specified in the following format:

              usage_type=time[:usage_type=time[:usage_type=time]]

       usage_type can be one of cpu, io, or mem.  time can be -1, 0 or a timespan specified in minutes. If  time
       is -1, only the usage of currently running jobs is used. 0 means that the usage is not decayed.

   policy_hierarchy
       This  parameter  sets  up  a  dependency  chain of ticket-based policies. Each ticket-based policy in the
       dependency chain is influenced by the previous policies and influences the following policies. A  typical
       scenario is to assign precedence for the override policy over the share-based policy. The override policy
       determines in such a case how share-based tickets are assigned among jobs of the same  user  or  project.
       Note  that  all  policies  contribute to the ticket amount assigned to a particular job regardless of the
       policy hierarchy definition. Yet the tickets calculated  in  each  of  the  policies  can  be  different,
       depending on "POLICY_HIERARCHY".

       The  "POLICY_HIERARCHY"  parameter  can  be  an  up to 3 letter combination of the first letters of the 3
       ticket based policies S(hare-based), F(unctional) and  O(verride).  So  a  value  "OFS"  means  that  the
       override  policy  takes  precedence  over the functional policy, which finally influences the share-based
       policy.  Less than 3 letters means that some of the policies do not influence other policies and also are
       not  influenced  by  other  policies.  So a value of "FS" means that the functional policy influences the
       share-based policy and that there is no interference with the other policies.

       The special value "NONE" switches off policy hierarchies.

   share_override_tickets
       If set to "true" or "1", override tickets of any override object instance are shared  equally  among  all
       running  jobs  associated  with  the  object. The pending jobs will get as many override tickets, as they
       would have, when they were running. If set to "false" or "0",  each  job  gets  the  full  value  of  the
       override tickets associated with the object. The default value is "true".

   share_functional_shares
       If  set  to  "true"  or "1", functional shares of any functional object instance are shared among all the
       jobs associated with the object. If set to "false" or "0", each job associated with a functional  object,
       gets the full functional shares of that object. The default value is "true".

   max_functional_jobs_to_schedule
       The maximum number of pending jobs to schedule in the functional policy.  The default value is 200.

   max_pending_tasks_per_job
       The  maximum  number  of  subtasks  per  pending array job to schedule. This parameter exists in order to
       reduce scheduling overhead. The default value is 50.

   max_reservation
       The maximum number of reservations scheduled within a schedule interval.

       When a runnable job can not be started due to a shortage of resources  a  reservation  can  be  scheduled
       instead.  A  reservation can cover consumable resources with the global host, any execution host, and any
       queue. For parallel jobs reservations are done also for the slots resource  as  specified  in  sge_pe(5).
       The  top  max_reservation  jobs  (in  priority  order) are considered, not individual resources.  The job
       runtime assumed is the maximum of the time specified with -l h_rt=...  or -l s_rt=...  For jobs that have
       neither of them, the default_duration (see below) is assumed.

       Reservations  prevent  jobs of lower priority as specified in sge_priority(5) from utilizing the reserved
       resource quota during the time of reservation.  Jobs of lower  priority  are  allowed  to  utilize  those
       reserved  resources  only  if  their  prospective  job  end  is  before  the  start  of  the  reservation
       ("backfilling").  Reservation is done only for non-immediate jobs (-now no) that request reservation  (-R
       y). If max_reservation is set to "0" no job reservation is done.

       max_reservation  actually  has a more general effect on scheduler look-ahead, and it is necessary to turn
       it on for correct backfilling into calendar windows (see calendar_conf(5)).

       Note that reservation scheduling can  be  performance  consuming  and  hence  reservation  scheduling  is
       switched  off  by default. Since reservation scheduling performance consumption is known to grow with the
       number of pending jobs, the use of the -R y option is recommended only for those  jobs  actually  queuing
       for  bottleneck  resources.   Together  with the max_reservation parameter, this technique can be used to
       narrow down performance impacts.  A JSV can be used to add reservation requests for particular resources,
       such as large parallel jobs.

   default_duration
       When job reservation is enabled through the max_reservation sched_conf(5) parameter, the default_duration
       is assumed as runtime for jobs that have neither -l h_rt=...  nor -l s_rt=...  specified. In contrast  to
       an  h_rt/s_rt  time  limit,  the  default_duration  is  not enforced.  The default value is INFINITY, and
       reservation is not effective for jobs which get that value, i.e. the value must be finite, or  jobs  must
       specify a run time.

FILES

       <sge_root>/<cell>/common/sched_configuration
                  scheduler thread configuration

SEE ALSO

       sge_intro(1),   qalter(1),   qconf(1),   qstat(1),   qsub(1),  complex(5),  queue_conf(5),  sge_execd(8),
       sge_qmaster(8)

       See sge_intro(1) for a full statement of rights and permissions.