Provided by: gridengine-common_6.2u5-4_all bug

NAME

       sched_conf - Sun Grid Engine default scheduler configuration file

DESCRIPTION

       sched_conf  defines  the  configuration  file format for Sun Grid Engine's  scheduler.  In
       order to modify the configuration, use the  graphical  user's  interface  qmon(1)  or  the
       -msconf  option of the qconf(1) command. A default configuration is provided together with
       the Sun Grid Engine distribution package.

       Note, Sun Grid Engine  allows  backslashes  (\)  be  used  to  escape  newline  (\newline)
       characters. The backslash and the newline are replaced with a space (" ") character before
       any interpretation.

FORMAT

       The following parameters are recognized by the Sun Grid Engine  scheduler  if  present  in
       sched_conf:

   algorithm
       Note: Deprecated, may be removed in future release.
       Allows for the selection of alternative scheduling algorithms.

       Currently default is the only allowed setting.

   load_formula
       A simple algebraic expression used to derive a single weighted load value from all or part
       of the load parameters reported by sge_execd(8) for each host and from all or part of  the
       consumable  resources  (see  complex(5)) being maintained for each host.  The load formula
       expression syntax is that of a summation weighted load values, that is:

              {w1|load_val1[*w1]}[{+|-}{w2|load_val2[*w2]}[{+|-}...]]

       Note, no blanks are allowed in the load formula.
       The load values and consumable resources (load_val1,  ...)   are  specified  by  the  name
       defined in the complex (see complex(5)).
       Note:  Administrator defined load values (see the load_sensor parameter in sge_conf(5) for
       details) and consumable resources available for all hosts (see complex(5)) may be used  as
       well as Sun Grid Engine default load parameters.
       The  weighting  factors (w1, ...) are positive integers. After the expression is evaluated
       for each host the results are assigned to the  hosts  and  are  used  to  sort  the  hosts
       corresponding  to  the  weighted  load.  The  sorted  host  list  is  used  to sort queues
       subsequently.
       The default load formula is "np_load_avg".

   job_load_adjustments
       The load, which is imposed by the Sun Grid Engine jobs running on a system varies in time,
       and  often,  e.g.  for  the  CPU  load, requires some amount of time to be reported in the
       appropriate quantity by the operating system. Consequently, if  a  job  was  started  very
       recently,  the reported load may not provide a sufficient representation of the load which
       is already imposed on that host by the job. The reported load will adapt to the real  load
       over time, but the period of time, in which the reported load is too low, may already lead
       to an oversubscription of that host. Sun Grid Engine allows the administrator  to  specify
       job_load_adjustments  which  are  used  in the Sun Grid Engine scheduler to compensate for
       this problem.
       The job_load_adjustments are specified  as  a  comma  separated  list  of  arbitrary  load
       parameters  or  consumable  resources  and (separated by an equal sign) an associated load
       correction value. Whenever a job is dispatched to  a  host  by  the  scheduler,  the  load
       parameter and consumable value set of that host is increased by the values provided in the
       job_load_adjustments list. These correction values are decayed linearly  over  time  until
       after load_adjustment_decay_time from the start the corrections reach the value 0.  If the
       job_load_adjustments list is assigned the special denominator NONE,  no  load  corrections
       are performed.
       The adjusted load and consumable values are used to compute the combined and weighted load
       of the hosts with the load_formula (see above) and to  compare  the  load  and  consumable
       values  against  the  load  threshold  lists  defined  in  the  queue  configurations (see
       queue_conf(5)).  If the load_formula consists simply  of  the  default  CPU  load  average
       parameter  np_load_avg,  and if the jobs are very compute intensive, one might want to set
       the job_load_adjustments  list  to  np_load_avg=1.00,  which  means  that  every  new  job
       dispatched to a host will require 100 % CPU time, and thus the machine's load is instantly
       increased by 1.00.

   load_adjustment_decay_time
       The load corrections in the "job_load_adjustments" list above are  decayed  linearly  over
       time from the point of the job start, where the corresponding load or consumable parameter
       is  raised  by   the   full   correction   value,   until   after   a   time   period   of
       "load_adjustment_decay_time",   where   the   correction  becomes  0.  Proper  values  for
       "load_adjustment_decay_time" greatly depend upon the load or  consumable  parameters  used
       and  the  specific operating system(s). Therefore, they can only be determined on-site and
       experimentally.  For the default np_load_avg load parameter a "load_adjustment_decay_time"
       of 7 minutes has proven to yield reasonable results.

   maxujobs
       The  maximum  number of jobs any user may have running in a Sun Grid Engine cluster at the
       same time. If set to 0 (default) the users may run an arbitrary number of jobs.

   schedule_interval
       At the time the scheduler thread  initially  registers  at  the  event  master  thread  in
       sge_qmaster(8)process  schedule_interval  is  used  to  set the time interval in which the
       event master thread sends scheduling event updates to the scheduler thread.  A  scheduling
       event  is  a  status  change  that has occurred within sge_qmaster(8) which may trigger or
       affect scheduler decisions (e.g. a job has finished and thus the allocated  resources  are
       available again).
       In the Sun Grid Engine default scheduler the arrival of a scheduling event report triggers
       a scheduler run. The scheduler waits for event reports otherwise.
       Schedule_interval is a time value (see queue_conf(5) for a definition  of  the  syntax  of
       time values).

   queue_sort_method
       This  parameter  determines  in  which  order  several  criteria are taken into account to
       product a sorted queue list. Currently, two settings are valid: seqno and load. However in
       both  cases, Sun Grid Engine attempts to maximize the number of soft requests (see qsub(1)
       -s option) being fulfilled by the queues for a particular as the primary criterion.
       Then, if the queue_sort_method parameter is set to seqno, Sun Grid  Engine  will  use  the
       seq_no  parameter as configured in the current queue configurations (see queue_conf(5)) as
       the next criterion to sort the queue list. The load_formula (see above) has only a meaning
       if  two  queues have equal sequence numbers.  If queue_sort_method is set to load the load
       according the load_formula is the criterion after maximizing a job's soft requests and the
       sequence number is only used if two hosts have the same load.  The sequence number sorting
       is most useful if you want to define a fixed order in which queues are to be filled  (e.g.
       the cheapest resource first).

       The default for this parameter is load.

   halftime
       When  executing under a share based policy, the scheduler "ages" (i.e. decreases) usage to
       implement a sliding window for achieving the share entitlements as defined  by  the  share
       tree.  The  halftime  defines  the time interval in which accumulated usage will have been
       decayed to half its original value. Valid values are specified in hours  or  according  to
       the time format as specified in queue_conf(5).
       If the value is set to 0, the usage is not decayed.

   usage_weight_list
       Sun  Grid  Engine accounts for the consumption of the resources CPU-time, memory and IO to
       determine the usage which is imposed on a system  by  a  job.  A  single  usage  value  is
       computed from these three input parameters by multiplying the individual values by weights
       and adding them up. The weights are defined in the usage_weight_list. The  format  of  the
       list is

              cpu=wcpu,mem=wmem,io=wio

       where  wcpu,  wmem  and wio are the configurable weights. The weights are real number. The
       sum of all tree weights should be 1.

   compensation_factor
       Determines how fast Sun Grid Engine should compensate for past usage below  of  above  the
       share  entitlement  defined  in  the  share tree. Recommended values are between 2 and 10,
       where 10 means faster compensation.

   weight_user
       The relative importance of the user shares in the functional policy.  Values are  of  type
       real.

   weight_project
       The  relative  importance  of  the project shares in the functional policy.  Values are of
       type real.

   weight_department
       The relative importance of the department shares in the functional policy. Values  are  of
       type real.

   weight_job
       The  relative  importance  of  the job shares in the functional policy. Values are of type
       real.

   weight_tickets_functional
       The maximum number of functional tickets available for distribution by  Sun  Grid  Engine.
       Determines  the  relative  importance of the functional policy.  See under sge_priority(5)
       for an overview on job priorities.

   weight_tickets_share
       The maximum number of share based tickets available for distribution by Sun  Grid  Engine.
       Determines the relative importance of the share tree policy. See under sge_priority(5) for
       an overview on job priorities.

   weight_deadline
       The weight applied on the remaining time until a jobs latest start  time.  Determines  the
       relative  importance  of  the  deadline.  See under sge_priority(5) for an overview on job
       priorities.

   weight_waiting_time
       The weight applied on the jobs waiting time  since  submission.  Determines  the  relative
       importance  of  the  waiting  time.   See  under  sge_priority(5)  for  an overview on job
       priorities.

   weight_urgency
       The weight applied on jobs normalized urgency  when  determining  priority  finally  used.
       Determines  the relative importance of urgency.  See under sge_priority(5) for an overview
       on job priorities.

   weight_priority
       The weight applied on jobs normalized POSIX priority  when  determining  priority  finally
       used. Determines the relative importance of POSIX priority.  See under sge_priority(5) for
       an overview on job priorities.

   weight_ticket
       The weight applied on normalized ticket amount when  determining  priority  finally  used.
       Determines  the  relative importance of the ticket policies. See under sge_priority(5) for
       an overview on job priorities.

   flush_finish_sec
       The parameters are provided for tuning the system's scheduling behavior.   By  default,  a
       scheduler  run  is triggered in the scheduler interval. When this parameter is set to 1 or
       larger, the scheduler will be triggered x seconds after a job has finished.  Setting  this
       parameter to 0 disables the flush after a job has finished.

   flush_submit_sec
       The  parameters  are  provided for tuning the system's scheduling behavior.  By default, a
       scheduler run is triggered in the scheduler interval.  When this parameter is set to 1  or
       larger,  the  scheduler  will  be  triggered   x  seconds after a job was submitted to the
       system. Setting this parameter to 0 disables the flush after a job was submitted.

   schedd_job_info
       The default scheduler can keep track why jobs could  not  be  scheduled  during  the  last
       scheduler run. This parameter enables or disables the observation.  The value true enables
       the monitoring false turns it off.

       It is also possible to activate the observation only for certain jobs. This will  be  done
       if the parameter is set to job_list followed by a comma separated list of job ids.

       The user can obtain the collected information with the command qstat -j.

   params
       This  is  foreseen for passing additional parameters to the Sun Grid Engine scheduler. The
       following values are recognized:

       DURATION_OFFSET
              If set, overrides the default of value 60 seconds.  This parameter is used  by  the
              Sun  Grid  Engine scheduler when planning resource utilization as the delta between
              net job runtimes and total time until resources become  available  again.  Net  job
              runtime  as  specified  with -l h_rt=...  or -l s_rt=... or default_duration always
              differs from total job runtime due to delays before and after actual job start  and
              finish.  Among  the  delays  before  job  start  is  the  time  until  the end of a
              schedule_interval, the time it takes to deliver  a  job  to  sge_execd(8)  and  the
              delays  caused  by  prolog  in  queue_conf(5)  ,  start_proc_args  in sge_pe(5) and
              starter_method in queue_conf(5) . The delays after job finish include delays due to
              a  forced  job  termination (notify, terminate_method or checkpointing), procedures
              run after actual job finish, such as  stop_proc_args  in  sge_pe(5)  or  epilog  in
              queue_conf(5) , and the delay until a new schedule_interval.
              If  the  offset  is  too  low,  resource  reservations (see max_reservation) can be
              delayed repeatedly due to an overly optimistic job circulation time.

       JC_FILTER
              Note: Deprecated, may be removed in future release.
              If set to true, the scheduler limits the number  of  jobs  it  looks  at  during  a
              scheduling  run.  At  the  beginning  of  the  scheduling run it assigns each job a
              specific category, which is based on the job's requests, priority settings, and the
              job  owner.  All scheduling policies will assign the same importance to each job in
              one category. Therefore the number of jobs per category have a FIFO order  and  can
              be limited to the number of free slots in the system.

              A  exception  are  jobs,  which  request  a resource reservation. They are included
              regardless of the number of jobs in a category.

              This setting is turned off per default, because in very rare cases,  the  scheduler
              can  make  a  wrong  decision.  It is also advised to turn report_pjob_tickets off.
              Otherwise qstat -ext can report outdated ticket amounts. The information shown with
              a qstat -j for a job, that was excluded in a scheduling run, is very limited.

       PROFILE
              If  set  equal  to  1,  the  scheduler  logs profiling information summarizing each
              scheduling run.

       MONITOR
              If set equal to 1, the  scheduler  records  information  for  each  scheduling  run
              allowing    to    reproduce    job    resources    utilization    in    the    file
              <sge_root>/<cell>/common/schedule.

       PE_RANGE_ALG
              This parameter sets the algorithm for the pe  range  computation.  The  default  is
              automatic,  which  means that the scheduler will select the best one, and it should
              not be necessary to change it to a different setting  in  normal  operation.  If  a
              custom setting is needed, the following values are available:
              auto       : the scheduler selects the best algorithm
              least      : starts the resource matching with the lowest slot amount first
              bin        : starts the resource matching in the middle of the pe slot range
              highest    : starts the resource matching with the highest slot amount first

       Changing params will take immediate effect.  The default for params is none.

   reprioritize_interval
       Interval  (HH:MM:SS)  to  reprioritize  jobs  on  the execution hosts based on the current
       ticket  amount  for  the  running  jobs.  If  the  interval  is  set   to   00:00:00   the
       reprioritization  is  turned  off.  The  default  value is 00:00:00.  The reprioritization
       tickets are calculated by the scheduler and update events for running jobs are  only  sent
       after  the  scheduler  calculated  new values. How often the schedule should calculate the
       tickets is defined by the reprioritize_interval.  Because the scheduler is only  triggered
       in  a specific interval (scheduler_interval) this means the reprioritize_interval has only
       a  meaning  if  set  greater  than  the   scheduler_interval.    For   example,   if   the
       scheduler_interval is 2 minutes and reprioritize_interval is set to 10 seconds, this means
       the jobs get re-prioritized every 2 minutes.

   report_pjob_tickets
       This parameter allows to tune the system's scheduling run time. It is  used  to  enable  /
       disable  the  reporting  of pending job tickets to the qmaster.  It does not influence the
       tickets calculation. The sort order of jobs in qstat and qmon is only based on the  submit
       time, when the reporting is turned off.
       The reporting should be turned off in a system with a very large amount of jobs by setting
       this parameter to "false".

   halflife_decay_list
       The halflife_decay_list allows to configure different decay rates for  the  "finished_jobs
       usage types, which is used in the pending job ticket calculation to account for jobs which
       have just ended. This allows the user the pending jobs algorithm to  count  finished  jobs
       against  a  user or project for a configurable decayed time period. This feature is turned
       off by default, and the halftime is used instead.
       The halflife_decay_list also allows one to configure different decay rates for each  usage
       type being tracked (cpu, io, and mem). The list is specified in the following format:

              <USAGE_TYPE>=<TIME>[:<USAGE_TYPE>=<TIME>[:<USAGE_TYPE>=<TIME>]]

       <Usage_TYPE> can be one of the following: cpu, io, or mem.
       <TIME> can be -1, 0 or a timespan specified in minutes. If <TIME> is -1, only the usage of
       currently running jobs is used. 0 means that the usage is not decayed.

   policy_hierarchy
       This parameter sets up a dependency chain of ticket  based  policies.  Each  ticket  based
       policy  in  the dependency chain is influenced by the previous policies and influences the
       following policies. A typical scenario is to assign precedence  for  the  override  policy
       over the share-based policy. The override policy determines in such a case how share-based
       tickets are assigned among jobs of the same user  or  project.   Note  that  all  policies
       contribute  to  the  ticket  amount  assigned to a particular job regardless of the policy
       hierarchy definition. Yet the tickets calculated in each of the policies can be  different
       depending on "POLICY_HIERARCHY".

       The  "POLICY_HIERARCHY" parameter can be a up to 3 letter combination of the first letters
       of the 3 ticket based policies S(hare-based), F(unctional)  and  O(verride).  So  a  value
       "OFS"  means  that  the override policy takes precedence over the functional policy, which
       finally influences the share-based policy.  Less than 3 letters  mean  that  some  of  the
       policies do not influence other policies and also are not influenced by other policies. So
       a value of "FS" means that the functional policy influences  the  share-based  policy  and
       that there is no interference with the other policies.

       The special value "NONE" switches off policy hierarchies.

   share_override_tickets
       If  set  to  "true"  or  "1",  override tickets of any override object instance are shared
       equally among all running jobs associated with the object. The pending jobs  will  get  as
       many  override  tickets,  as they would have, when they were running. If set to "false" or
       "0", each job gets the full value of the override tickets associated with the object.  The
       default value is "true".

   share_functional_shares
       If  set  to  "true" or "1", functional shares of any functional object instance are shared
       among all the jobs associated with the  object.  If  set  to  "false"  or  "0",  each  job
       associated  with  a functional object, gets the full functional shares of that object. The
       default value is "true".

   max_functional_jobs_to_schedule
       The maximum number of pending jobs to schedule in  the  functional  policy.   The  default
       value is 200.

   max_pending_tasks_per_job
       The maximum number of subtasks per pending array job to schedule. This parameter exists in
       order to reduce scheduling overhead. The default value is 50.

   max_reservation
       The maximum number of reservations scheduled within a schedule interval.  When a  runnable
       job  can  not  be  started  due  to a shortage of resources a reservation can be scheduled
       instead. A reservation can cover consumable resources with the global host, any  execution
       host  and  any  queue.  For parallel jobs reservations are done also for slots resource as
       specified in sge_pe(5).  As job runtime the maximum of the time specified with -l h_rt=...
       or  -l  s_rt=...  is  assumed.  For jobs that have neither of them the default_duration is
       assumed.  Reservations prevent jobs of lower priority as specified in sge_priority(5) from
       utilizing  the  reserved  resource  quota  during  the time of reservation.  Jobs of lower
       priority are allowed to utilize those reserved resources only if their prospective job end
       is  before  the start of the reservation (backfilling).  Reservation is done only for non-
       immediate jobs (-now no) that request reservation (-R y). If max_reservation is set to "0"
       no job reservation is done.

       Note,  that  reservation  scheduling  can  be  performance consuming and hence reservation
       scheduling  is  switched  off  by  default.  Since  reservation   scheduling   performance
       consumption  is  known  to grow with the number of pending jobs, the use of -R y option is
       recommended only for those jobs actually queuing for bottleneck resources.  Together  with
       the  max_reservation  parameter  this  technique  can  be  used to narrow down performance
       impacts.

   default_duration
       When job reservation  is  enabled  through  max_reservation  sched_conf(5)  parameter  the
       default  duration  is  assumed  as  runtime  for jobs that have neither -l h_rt=... nor -l
       s_rt=... specified. In contrast to a h_rt/s_rt time  limit  the  default_duration  is  not
       enforced.

FILES

       <sge_root>/<cell>/common/sched_configuration
                  scheduler thread configuration

SEE ALSO

       sge_intro(1),   qalter(1),   qconf(1),   qstat(1),   qsub(1),  complex(5),  queue_conf(5),
       sge_execd(8), sge_qmaster(8), Sun Grid Engine Installation and Administration Guide

COPYRIGHT

       See sge_intro(1) for a full statement of rights and permissions.