Provided by: gridengine-common_6.2~beta2-2_all bug


       sched_conf - Grid Engine default scheduler configuration file


       sched_conf  defines  the  configuration  file  format for Grid Engine’s
       scheduler.  In order to modify the  configuration,  use  the  graphical
       user’s interface qmon(1) or the -msconf option of the qconf(1) command.
       A default configuration is  provided  together  with  the  Grid  Engine
       distribution package.

       Note,  Grid  Engine  allows  backslashes  (\) be used to escape newline
       (\newline) characters. The backslash and the newline are replaced  with
       a space (" ") character before any interpretation.


       The following parameters are recognized by the Grid Engine scheduler if
       present in sched_conf:

       Allows for the selection of alternative scheduling algorithms.

       Currently default is the only allowed setting.

       A simple algebraic expression used to derive  a  single  weighted  load
       value  from  all or part of the load parameters reported by ge_execd(8)
       for each host and from all or part of  the  consumable  resources  (see
       complex(5))   being   maintained  for  each  host.   The  load  formula
       expression syntax is that of a summation weighted load values, that is:


       Note, no blanks are allowed in the load formula.
       The   load  values  and  consumable  resources  (load_val1,  ...)   are
       specified by the name defined in the complex (see complex(5)).
       Note: Administrator defined load values (see the load_sensor  parameter
       in  ge_conf(5)  for details) and consumable resources available for all
       hosts (see complex(5)) may be used as well as Grid Engine default  load
       The  weighting  factors  (w1,  ...)  are  positive  integers. After the
       expression is evaluated for each host the results are assigned  to  the
       hosts  and  are  used  to  sort the hosts corresponding to the weighted
       load. The sorted host list is used to sort queues subsequently.
       The default load formula is "np_load_avg".

       The load, which is imposed by the Grid Engine jobs running on a  system
       varies  in time, and often, e.g. for the CPU load, requires some amount
       of time to be reported in the appropriate  quantity  by  the  operating
       system.  Consequently, if a job was started very recently, the reported
       load may not provide a sufficient representation of the load  which  is
       already  imposed  on that host by the job. The reported load will adapt
       to the real load over time, but  the  period  of  time,  in  which  the
       reported  load  is  too low, may already lead to an oversubscription of
       that  host.  Grid  Engine   allows   the   administrator   to   specify
       job_load_adjustments  which  are  used  in the Grid Engine scheduler to
       compensate for this problem.
       The job_load_adjustments are specified as a  comma  separated  list  of
       arbitrary  load parameters or consumable resources and (separated by an
       equal sign) an associated load correction  value.  Whenever  a  job  is
       dispatched  to  a  host  by  the  scheduler,  the  load  parameter  and
       consumable value set of that host is increased by the  values  provided
       in  the  job_load_adjustments list. These correction values are decayed
       linearly over time  until  after  load_adjustment_decay_time  from  the
       start  the  corrections reach the value 0.  If the job_load_adjustments
       list is assigned the special denominator NONE, no load corrections  are
       The  adjusted  load  and  consumable  values  are  used  to compute the
       combined and weighted load of the  hosts  with  the  load_formula  (see
       above)  and  to compare the load and consumable values against the load
       threshold   lists   defined   in   the   queue   configurations    (see
       queue_conf(5)).  If the load_formula consists simply of the default CPU
       load average parameter np_load_avg, and if the jobs  are  very  compute
       intensive,  one  might  want  to  set  the job_load_adjustments list to
       np_load_avg=1.00, which means that every new job dispatched to  a  host
       will  require  100 % CPU time, and thus the machine’s load is instantly
       increased by 1.00.

       The load corrections  in  the  "job_load_adjustments"  list  above  are
       decayed  linearly  over time from the point of the job start, where the
       corresponding load or  consumable  parameter  is  raised  by  the  full
       correction     value,     until     after     a    time    period    of
       "load_adjustment_decay_time", where the correction  becomes  0.  Proper
       values for "load_adjustment_decay_time" greatly depend upon the load or
       consumable  parameters  used  and  the  specific  operating  system(s).
       Therefore, they can only be determined on-site and experimentally.  For
       the default np_load_avg load parameter  a  "load_adjustment_decay_time"
       of 7 minutes has proven to yield reasonable results.

       The  maximum  number of jobs any user may have running in a Grid Engine
       cluster at the same time. If set to 0 (default) the users  may  run  an
       arbitrary number of jobs.

       At  the  time  the  scheduler  thread  initially registers at the event
       master thread in ge_qmaster(8)process schedule_interval is used to  set
       the  time  interval  in  which the event master thread sends scheduling
       event updates to the scheduler thread.  A scheduling event is a  status
       change  that  has  occurred  within  ge_qmaster(8) which may trigger or
       affect scheduler decisions (e.g.  a  job  has  finished  and  thus  the
       allocated resources are available again).
       In  the Grid Engine default scheduler the arrival of a scheduling event
       report triggers a scheduler run. The scheduler waits for event  reports
       Schedule_interval  is  a time value (see queue_conf(5) for a definition
       of the syntax of time values).

       This parameter determines in which order  several  criteria  are  taken
       into  account  to  product a sorted queue list. Currently, two settings
       are valid: seqno and load. However in both cases, Grid Engine  attempts
       to  maximize  the number of soft requests (see qsub(1) -s option) being
       fulfilled by the queues for a particular as the primary criterion.
       Then, if the queue_sort_method parameter is set to seqno,  Grid  Engine
       will  use  the  seq_no  parameter  as  configured  in the current queue
       configurations (see queue_conf(5)) as the next criterion  to  sort  the
       queue  list.  The  load_formula  (see  above) has only a meaning if two
       queues have equal sequence numbers.  If  queue_sort_method  is  set  to
       load  the  load  according  the  load_formula  is  the  criterion after
       maximizing a job’s soft requests and the sequence number is  only  used
       if  two  hosts have the same load.  The sequence number sorting is most
       useful if you want to define a fixed order in which queues  are  to  be
       filled (e.g.   the cheapest resource first).

       The default for this parameter is load.

       When  executing  under a share based policy, the scheduler "ages" (i.e.
       decreases) usage to implement a sliding window for achieving the  share
       entitlements  as  defined  by  the share tree. The halftime defines the
       time interval in which accumulated usage will have been decayed to half
       its original value. Valid values are specified in hours or according to
       the time format as specified in queue_conf(5).
       If the value is set to 0, the usage is not decayed.

       Grid Engine accounts for the consumption  of  the  resources  CPU-time,
       memory  and IO to determine the usage which is imposed on a system by a
       job. A single usage value is computed from these three input parameters
       by multiplying the individual values by weights and adding them up. The
       weights are defined in the usage_weight_list. The format of the list is


       where  wcpu, wmem and wio are the configurable weights. The weights are
       real number. The sum of all tree weights should be 1.

       Determines how fast Grid Engine should compensate for past usage  below
       of  above  the share entitlement defined in the share tree. Recommended
       values are between 2 and 10, where 10 means faster compensation.

       The relative importance of the user shares in  the  functional  policy.
       Values are of type real.

       The relative importance of the project shares in the functional policy.
       Values are of type real.

       The relative importance of the  department  shares  in  the  functional
       policy. Values are of type real.

       The  relative  importance  of  the job shares in the functional policy.
       Values are of type real.

       The maximum number of functional tickets available for distribution  by
       Grid  Engine.  Determines  the  relative  importance  of the functional
       policy.  See under sge_priority(5) for an overview on job priorities.

       The maximum number of share based tickets available for distribution by
       Grid  Engine.  Determines  the  relative  importance  of the share tree
       policy. See under sge_priority(5) for an overview on job priorities.

       The weight applied on the remaining time  until  a  jobs  latest  start
       time.  Determines  the  relative  importance of the deadline. See under
       sge_priority(5) for an overview on job priorities.

       The  weight  applied  on  the  jobs  waiting  time  since   submission.
       Determines  the  relative  importance  of  the waiting time.  See under
       sge_priority(5) for an overview on job priorities.

       The weight applied on jobs normalized urgency when determining priority
       finally  used.   Determines  the  relative  importance of urgency.  See
       under sge_priority(5) for an overview on job priorities.

       The  weight  applied  on  normalized  ticket  amount  when  determining
       priority  finally  used.   Determines  the  relative  importance of the
       ticket policies. See under  sge_priority(5)  for  an  overview  on  job

       The   parameters  are  provided  for  tuning  the  system’s  scheduling
       behavior.  By default, a scheduler run is triggered  in  the  scheduler
       interval. When this parameter is set to 1 or larger, the scheduler will
       be triggered x seconds after a job has finished. Setting this parameter
       to 0 disables the flush after a job has finished.

       The   parameters  are  provided  for  tuning  the  system’s  scheduling
       behavior.  By default, a scheduler run is triggered  in  the  scheduler
       interval.   When  this  parameter  is set to 1 or larger, the scheduler
       will be triggered  x seconds after a job was submitted to  the  system.
       Setting  this  parameter  to  0  disables  the  flush  after  a job was

       The default scheduler can keep track why jobs could  not  be  scheduled
       during  the  last scheduler run. This parameter enables or disables the
       observation.  The value true enables the monitoring false turns it off.

       It  is also possible to activate the observation only for certain jobs.
       This will be done if the parameter is set to  job_list  followed  by  a
       comma separated list of job ids.

       The  user  can  obtain the collected information with the command qstat

       This is foreseen for passing additional parameters to the  Grid  Engine
       scheduler. The following values are recognized:

              If  set,  overrides  the  default  of  value  60  seconds.  This
              parameter is used by the Grid  Engine  scheduler  when  planning
              resource  utilization  as the delta between net job runtimes and
              total time until  resources  become  available  again.  Net  job
              runtime  as  specified  with  -l  h_rt=...   or  -l  s_rt=... or
              default_duration always differs from total job  runtime  due  to
              delays  before  and after actual job start and finish. Among the
              delays before  job  start  is  the  time  until  the  end  of  a
              schedule_interval,  the  time  it  takes  to  deliver  a  job to
              sge_execd(8) and the delays caused by prolog in queue_conf(5)  ,
              start_proc_args in sge_pe(5) and starter_method in queue_conf(5)
              . The delays after job finish include delays due to a forced job
              termination   (notify,   terminate_method   or   checkpointing),
              procedures run after actual job finish, such  as  stop_proc_args
              in  sge_pe(5) or epilog in queue_conf(5) , and the delay until a
              new schedule_interval.
              If  the  offset  is  too   low,   resource   reservations   (see
              max_reservation)  can  be  delayed  repeatedly  due to an overly
              optimistic job circulation time.

              If set to true, the scheduler limits the number of jobs it looks
              at  during  a scheduling run. At the beginning of the scheduling
              run it assigns each job a specific category, which is  based  on
              the  job’s  requests,  priority settings, and the job owner. All
              scheduling policies will assign the same importance to each  job
              in  one category. Therefore the number of jobs per category have
              a FIFO order and can be limited to the number of free  slots  in
              the system.

              A exception are jobs, which request a resource reservation. They
              are included regardless of the number of jobs in a category.

              This setting is turned off per default,  because  in  very  rare
              cases,  the  scheduler  can  make  a  wrong decision. It is also
              advised to turn report_pjob_tickets off.  Otherwise  qstat  -ext
              can report outdated ticket amounts. The information shown with a
              qstat -j for a job, that was excluded in a  scheduling  run,  is
              very limited.

              If  set  equal  to  1,  the scheduler logs profiling information
              summarizing each scheduling run.

              If set equal to 1, the scheduler records  information  for  each
              scheduling  run  allowing to reproduce job resources utilization
              in the file <ge_root>/<cell>/common/schedule.

              This parameter sets the algorithm for the pe range  computation.
              The  default  is  automatic, which means that the scheduler will
              select the best one, and it should not be necessary to change it
              to  a different setting in normal operation. If a custom setting
              is needed, the following values are available:
              auto       : the scheduler selects the best algorithm
              least      : starts the resource matching with the  lowest  slot
              amount first
              bin         :  starts the resource matching in the middle of the
              pe slot range
              highest    : starts the resource matching with the highest  slot
              amount first

       Changing  params will take immediate effect.  The default for params is

       Interval (HH:MM:SS) to reprioritize jobs on the execution  hosts  based
       on  the  current ticket amount for the running jobs. If the interval is
       set to 00:00:00 the reprioritization is turned off. The  default  value
       is 00:00:00.

       This  parameter  allows to tune the system’s scheduling run time. It is
       used to enable / disable the reporting of pending job  tickets  to  the
       qmaster.  It does not influence the tickets calculation. The sort order
       of jobs in qstat and qmon is only based on the submit  time,  when  the
       reporting is turned off.
       The reporting should be turned of in a system with very large amount of
       jobs by setting this param to "false".

       The halflife_decay_list allows to configure different decay  rates  for
       the "finished_jobs usage types, which is used in the pending job ticket
       calculation to account for jobs which have just ended. This allows  the
       user  the  pending jobs algorithm to count finished jobs against a user
       or project for a configurable decayed  time  period.  This  feature  is
       turned off by default, and the halftime is used instead.
       The  halflife_decay_list  also  allows one to configure different decay
       rates for each usage type being tracked (cpu, io, and mem). The list is
       specified in the following format:


       <Usage_TYPE> can be one of the following: cpu, io, or mem.
       <TIME>  can  be  -1, 0 or a timespan specified in minutes. If <TIME> is
       -1, only the usage of currently running jobs is used. 0 means that  the
       usage is not decayed.

       This  parameter  sets  up  a dependency chain of ticket based policies.
       Each ticket based policy in the dependency chain is influenced  by  the
       previous  policies  and  influences  the  following policies. A typical
       scenario is to assign precedence  for  the  override  policy  over  the
       share-based  policy.  The override policy determines in such a case how
       share-based tickets are  assigned  among  jobs  of  the  same  user  or
       project.   Note  that  all  policies  contribute  to  the ticket amount
       assigned to  a  particular  job  regardless  of  the  policy  hierarchy
       definition.  Yet  the tickets calculated in each of the policies can be
       different depending on "POLICY_HIERARCHY".

       The "POLICY_HIERARCHY" parameter can be a up to 3 letter combination of
       the  first  letters  of  the  3  ticket  based  policies S(hare-based),
       F(unctional) and O(verride). So a value "OFS" means that  the  override
       policy  takes  precedence  over  the  functional  policy, which finally
       influences the share-based policy.  Less than 3 letters mean that  some
       of  the  policies  do  not  influence  other  policies and also are not
       influenced by other policies.  So  a  value  of  "FS"  means  that  the
       functional  policy  influences the share-based policy and that there is
       no interference with the other policies.

       The special value "NONE" switches off policy hierarchies.

       If set to "true" or  "1",  override  tickets  of  any  override  object
       instance  are shared equally among all running jobs associated with the
       object. The pending jobs will get as many  override  tickets,  as  they
       would  have, when they were running. If set to "false" or "0", each job
       gets the full value of the override tickets associated with the object.
       The default value is "true".

       If  set  to  "true"  or "1", functional shares of any functional object
       instance are shared among all the jobs associated with the  object.  If
       set  to  "false"  or "0", each job associated with a functional object,
       gets the full functional shares of that object. The  default  value  is

       The  maximum  number  of  pending  jobs  to  schedule in the functional
       policy.  The default value is 200.

       The maximum number of subtasks per pending array job to schedule.  This
       parameter  exists  in  order to reduce scheduling overhead. The default
       value is 50.

       The  maximum  number  of  reservations  scheduled  within  a   schedule
       interval.   When a runnable job can not be started due to a shortage of
       resources a reservation can be scheduled  instead.  A  reservation  can
       cover consumable resources with the global host, any execution host and
       any queue. For parallel jobs  reservations  are  done  also  for  slots
       resource  as specified in sge_pe(5).  As job runtime the maximum of the
       time specified with -l h_rt=... or -l s_rt=...  is  assumed.  For  jobs
       that   have   neither   of   them   the  default_duration  is  assumed.
       Reservations  prevent  jobs  of  lower   priority   as   specified   in
       sge_priority(5)  from  utilizing the reserved resource quota during the
       while of reservation.  Jobs of lower priority are  allowed  to  utilize
       those  reserved  resources  only if their prospective job end is before
       the start of the reservation (backfilling).  Reservation is  done  only
       for  non-immediate  jobs  (-now no) that request reservation (-R y). If
       max_reservation is set to "0" no job reservation is done.

       Note, that reservation scheduling  can  be  performance  consuming  and
       hence   reservation  scheduling  is  switched  off  by  default.  Since
       reservation scheduling performance consumption is known  to  grow  with
       the  number  of pending jobs use of -R y option is recommended only for
       those jobs actually queuing for bottleneck  resources.   Together  with
       the max_reservation parameter this technique can be used to narrow down
       performance impacts.

       When job reservation is enabled through  max_reservation  sched_conf(5)
       parameter the default duration is assumed as runtime for jobs that have
       neither -l h_rt=...  nor  -l  s_rt=...  specified.  In  contrast  to  a
       h_rt/s_rt time limit the default_duration is not enforced.


                 scheduler thread configuration


       ge_intro(1),   qalter(1),   qconf(1),  qstat(1),  qsub(1),  complex(5),
       queue_conf(5), ge_execd(8), ge_qmaster(8), Grid Engine Installation and
       Administration Guide


       See ge_intro(1) for a full statement of rights and permissions.