Provided by: slurm-client_15.08.7-1build1_amd64 bug

NAME

       sdiag - Scheduling diagnostic tool for Slurm

SYNOPSIS

       sdiag

DESCRIPTION

       sdiag  shows  information  related  to  slurmctld  execution about: threads, agents, jobs, and scheduling
       algorithms. The goal is  to  obtain  data  from  slurmctld  behaviour  helping  to  adjust  configuration
       parameters  or  queues  policies.  The main reason behind is to know Slurm behaviour under systems with a
       high throughput.

       It has two execution modes. The default mode --all shows several counters and statistics explained later,
       and there is another execution option --reset for resetting those values.

       Values are reset at midnight UTC time by default.

       The first block of information is related to global slurmctld execution:

       Server thread count
              The number of current active slurmctld threads. A high number would mean a  high  load  processing
              events  like  job  submissions,  jobs dispatching, jobs completing, etc. If this is often close to
              MAX_SERVER_THREADS it could point to a potential bottleneck.

       Agent queue size
              Slurm design has scalability in mind and sending messages to thousands of nodes is not  a  trivial
              task.  The  agent  mechanism  helps  to  control  communication  between the slurm daemons and the
              controller for a best effort. If this values is close to MAX_AGENT_CNT there could be some  delays
              affecting jobs management.

       Jobs submitted
              Number of jobs submitted since last reset

       Jobs started
              Number of jobs started since last reset. This includes backfilled jobs.

       Jobs completed
              Number of jobs completed since last reset.

       Jobs canceled
              Number of jobs canceled since last reset.

       Jobs failed
              Number of jobs failed since last reset.

       The  second  block  of  information  is  related to main scheduling algorithm based on jobs priorities. A
       scheduling cycle implies to get the job_write_lock lock, then trying to get resources for  jobs  pending,
       starting  from  the most priority one and going in descendent order. Once a job can not get the resources
       the loop keeps going but just for jobs requesting other partitions. Jobs with  dependencies  or  affected
       by accounts limits are not processed.

       Last cycle
              Time in microseconds for last scheduling cycle.

       Max cycle
              Time in microseconds for the maximum scheduling cycle since last reset.

       Total cycles
              Number of scheduling cycles since last reset. Scheduling is done in periodically and when a job is
              submitted or a job is completed.

       Mean cycle
              Mean of scheduling cycles since last reset

       Mean depth cycle
              Mean of cycle depth. Depth means number of jobs processed in a scheduling cycle.

       Cycles per minute
              Counter of scheduling executions per minute

       Last queue length
              Length of jobs pending queue.

       The  third block of information is related to backfilling scheduling algorithm.  A backfilling scheduling
       cycle implies to get locks for jobs, nodes and partitions objects then trying to get resources  for  jobs
       pending.  Jobs are processed based on priorities. If a job can not get resources the algorithm calculates
       when it could get them obtaining a future start time for the job.  Then next job  is  processed  and  the
       algorithm  tries  to  get  resources  for that job but avoiding to affect the previous ones, and again it
       calculates the future start time if not current resources available. The backfilling algorithm takes more
       time for each new job to process since more priority jobs can not be affected. The algorithm itself takes
       measures for avoiding a long execution cycle and for taking all the locks for too long.

       Total backfilled jobs (since last slurm start)
              Number of jobs started thanks to backfilling since last slurm start.

       Total backfilled jobs (since last stats cycle start)
              Number of jobs started thanks to backfilling since last time stats where reset.  By default  these
              values are reset at midnight UTC time.

       Total cycles
              Number of scheduling cycles since last reset

       Last cycle when
              Time  when  last  execution  cycle  happened in format "weekday Month MonthDay hour:minute.seconds
              year"

       Last cycle
              Time in microseconds of last backfilling cycle.  It counts only execution time removing sleep time
              inside a scheduling cycle when it takes too much time.  Note that locks are  released  during  the
              sleep time so that other work can proceed.

       Max cycle
              Time  in  microseconds  of  maximum  backfilling cycle execution since last reset.  It counts only
              execution time removing sleep time inside a scheduling cycle when it takes too  much  time.   Note
              that locks are released during the sleep time so that other work can proceed.

       Mean cycle
              Mean of backfilling scheduling cycles in microseconds since last reset

       Last depth cycle
              Number of processed jobs during last backfilling scheduling cycle. It counts every process even if
              it has no option to execute due to dependencies or limits.

       Last depth cycle (try sched)
              Number of processed jobs during last backfilling scheduling cycle. It counts only processes with a
              chance  to  run  waiting  for  available  resources.  These  jobs  are which makes the backfilling
              algorithm heavier.

       Depth Mean
              Mean of processed jobs during backfilling scheduling cycles since last reset.

       Depth Mean (try sched)
              Mean of processed jobs during backfilling scheduling cycles since  last  reset.   It  counts  only
              processes  with  a  chance to run waiting for available resources.  These jobs are which makes the
              backfilling algorithm heavier.

       Last queue length
              Number of jobs pending to be processed by backfilling algorithm. A job appears as  much  times  as
              partitions it requested.

       Queue length Mean
              Mean of jobs pending to be processed by backfilling algorithm.

       The  fourth  and  fifth  blocks  of  information report the most frequently issued remote procedure calls
       (RPCs), calls made for the Slurmctld daemon to perform some action.  The fourth block  reports  the  RPCs
       issued  by  message  type.   You will need to look up those RPC codes in the Slurm source code by looking
       them up in the file src/common/slurm_protocol_defs.h.  The report includes the number of times  each  RPC
       is  invoked,  the  total time consumed by all of those RPCs plus the average time consumed by each RPC in
       microseconds.  The fifth block reports the RPCs issued by user ID, the total number  of  RPCs  they  have
       issued,  the  total  time  consumed  by  all  of those RPCs plus the average time consumed by each RPC in
       microseconds.

OPTIONS

       -a, --all
              Get and report information. This is the default mode of operation.

       -h, --help
              Print description of options and exit.

       -i, --sort-by-id
              Sort Remote Procedure Call (RPC) data by message type ID and user ID.

       -r, --reset
              Reset counters. Only supported for Slurm operators and administrators.

       -t, --sort-by-time
              Sort Remote Procedure Call (RPC) data by total run time.

       -T, --sort-by-time2
              Sort Remote Procedure Call (RPC) data by average run time.

       --usage
              Print list of options and exit.

       -V, --version
              Print current version number and exit.

ENVIRONMENT VARIABLES

       Some sdiag options may be set via environment variables. These environment variables,  along  with  their
       corresponding options, are listed below.  (Note: commandline options will always override these settings)

       SLURM_CONF          The location of the Slurm configuration file.

COPYING

       Copyright (C) 2010-2011 Barcelona Supercomputing Center.
       Copyright (C) 2010-2014 SchedMD LLC.

       Slurm  is  free  software;  you  can  redistribute it and/or modify it under the terms of the GNU General
       Public License as published by the Free Software Foundation; either version 2 of the License, or (at your
       option) any later version.

       Slurm is distributed in the hope that it will be useful, but  WITHOUT  ANY  WARRANTY;  without  even  the
       implied  warranty  of  MERCHANTABILITY  or  FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public
       License for more details.

SEE ALSO

       sinfo(1), squeue(1), scontrol(1), slurm.conf(5),

April 2015                                       Slurm Commands                                         sdiag(1)