Ubuntu Manpage: sdiag - Scheduling diagnostic tool for Slurm

Provided by: slurm-client_23.02.3-2ubuntu1_amd64

NAME

       sdiag - Scheduling diagnostic tool for Slurm

SYNOPSIS

       sdiag

DESCRIPTION

sdiag shows information related to slurmctld execution about: threads, agents, jobs, and
scheduling algorithms. The goal is to obtain data from slurmctld behavior helping to
adjust configuration parameters or queues policies. The main reason behind is to know
Slurm behavior under systems with a high throughput.

It has two execution modes. The default mode --all shows several counters and statistics
explained later, and there is another execution option --reset for resetting those values.

Values are reset at midnight UTC time by default.

The first block of information is related to global slurmctld execution:

Server thread count
The number of current active slurmctld threads. A high number would mean a high
load processing events like job submissions, jobs dispatching, jobs completing,
etc. If this is often close to MAX_SERVER_THREADS it could point to a potential
bottleneck.

Agent queue size
Slurm design has scalability in mind and sending messages to thousands of nodes is
not a trivial task. The agent mechanism helps to control communication between
slurmctld and the slurmd daemons for a best effort. This value denotes the count of
enqueued outgoing RPC requests in an internal retry list.

Agent count
Number of agent threads. Each of these agent threads can create in turn a group of
up to 2 + AGENT_THREAD_COUNT active threads at a time.

Agent thread count
Total count of active threads created by all the agent threads.

DBD Agent queue size
Slurm queues up the messages intended for the SlurmDBD and processes them in a
separate thread. If the SlurmDBD, or database, is down then this number will
increase.

The max queue size is configured in the slurm.conf with MaxDBDMsgs. If this number
begins to grow more than half of the max queue size, the slurmdbd and the database
should be investigated immediately.

Jobs submitted
Number of jobs submitted since last reset

Jobs started
Number of jobs started since last reset. This includes backfilled jobs.

Jobs completed
Number of jobs completed since last reset.

Jobs canceled
Number of jobs canceled since last reset.

Jobs failed
Number of jobs failed due to slurmd or other internal issues since last reset.

Job states ts:
Lists the timestamp of when the following job state counts were gathered.

Jobs pending:
Number of jobs pending at the given time of the time stamp above.

Jobs running:
Number of jobs running at the given time of the time stamp above.

Jobs running ts:
Time stamp of when the running job count was taken.

The next block of information is related to main scheduling algorithm based on jobs
priorities. A scheduling cycle implies to get the job_write_lock lock, then trying to get
resources for jobs pending, starting from the most priority one and going in descending
order. Once a job can not get the resources the loop keeps going but just for jobs
requesting other partitions. Jobs with dependencies or affected by accounts limits are
not processed.

Last cycle
Time in microseconds for last scheduling cycle.

Max cycle
Maximum time in microseconds for any scheduling cycle since last reset.

Total cycles
Total run time in microseconds for all scheduling cycles since last reset.
Scheduling is performed periodically and (depending upon configuration) when a job
is submitted or a job is completed.

Mean cycle
Mean time in microseconds for all scheduling cycles since last reset.

Mean depth cycle
Mean of cycle depth. Depth means number of jobs processed in a scheduling cycle.

Cycles per minute
Counter of scheduling executions per minute.

Last queue length
Length of jobs pending queue.

The next block of information is related to backfilling scheduling algorithm. A
backfilling scheduling cycle implies to get locks for jobs, nodes and partitions objects
then trying to get resources for jobs pending. Jobs are processed based on priorities. If
a job can not get resources the algorithm calculates when it could get them obtaining a
future start time for the job. Then next job is processed and the algorithm tries to get
resources for that job but avoiding to affect the previous ones, and again it calculates
the future start time if not current resources available. The backfilling algorithm takes
more time for each new job to process since more priority jobs can not be affected. The
algorithm itself takes measures for avoiding a long execution cycle and for taking all the
locks for too long.

Total backfilled jobs (since last slurm start)
Number of jobs started thanks to backfilling since last slurm start.

Total backfilled jobs (since last stats cycle start)
Number of jobs started thanks to backfilling since last time stats where reset. By
default these values are reset at midnight UTC time.

Total backfilled heterogeneous job components
Number of heterogeneous job components started thanks to backfilling since last
Slurm start.

Total cycles
Number of backfill scheduling cycles since last reset

Last cycle when
Time when last backfill scheduling cycle happened in the format "weekday Month
MonthDay hour:minute.seconds year"

Last cycle
Time in microseconds of last backfill scheduling cycle. It counts only execution
time, removing sleep time inside a scheduling cycle when it executes for an
extended period time. Note that locks are released during the sleep time so that
other work can proceed.

Max cycle
Time in microseconds of maximum backfill scheduling cycle execution since last
reset. It counts only execution time, removing sleep time inside a scheduling
cycle when it executes for an extended period time. Note that locks are released
during the sleep time so that other work can proceed.

Mean cycle
Mean time in microseconds of backfilling scheduling cycles since last reset.

Last depth cycle
Number of processed jobs during last backfilling scheduling cycle. It counts every
job even if that job can not be started due to dependencies or limits.

Last depth cycle (try sched)
Number of processed jobs during last backfilling scheduling cycle. It counts only
jobs with a chance to start using available resources. These jobs consume more
scheduling time than jobs which are found can not be started due to dependencies or
limits.

Depth Mean
Mean count of jobs processed during all backfilling scheduling cycles since last
reset. Jobs which are found to be ineligible to run when examined by the backfill
scheduler are not counted (e.g. jobs submitted to multiple partitions and already
started, jobs which have reached a QOS or account limit such as maximum running
jobs for an account, etc).

Depth Mean (try sched)
The subset of Depth Mean that the backfill scheduler attempted to schedule.

Last queue length
Number of jobs pending to be processed by backfilling algorithm. A job is counted
once for each partition it is queued to use. A pending job array will normally be
counted as one job (tasks of a job array which have already been started/requeued
or individually modified will already have individual job records and are each
counted as a separate job).

Queue length Mean
Mean count of jobs pending to be processed by backfilling algorithm. A job is
counted once for each partition it requested. A pending job array will normally be
counted as one job (tasks of a job array which have already been started/requeued
or individually modified will already have individual job records and are each
counted as a separate job).

Last table size
Count of different time slots tested by the backfill scheduler in its last
iteration.

Mean table size
Mean count of different time slots tested by the backfill scheduler. Larger counts
increase the time required for the backfill operation. The table size is
influenced by many scheduling parameters, including: bf_min_age_reserve,
bf_min_prio_reserve, bf_resolution, and bf_window.

Latency for 1000 calls to gettimeofday()
Latency of 1000 calls to the gettimeofday() syscall in microseconds, as measured at
controller startup.

The next blocks of information report the most frequently issued remote procedure calls
(RPCs), calls made for the Slurmctld daemon to perform some action. The fourth block
reports the RPCs issued by message type. You will need to look up those RPC codes in the
Slurm source code by looking them up in the file src/common/slurm_protocol_defs.h. The
report includes the number of times each RPC is invoked, the total time consumed by all of
those RPCs plus the average time consumed by each RPC in microseconds. The fifth block
reports the RPCs issued by user ID, the total number of RPCs they have issued, the total
time consumed by all of those RPCs plus the average time consumed by each RPC in
microseconds. RPCs statistics are collected for the life of the slurmctld process unless
explicitly --reset.

The sixth block of information, labeled Pending RPC Statistics, shows information about
pending outgoing RPCs on the slurmctld agent queue. The first section of this block shows
types of RPCs on the queue and the count of each. The second section shows up to the first
25 individual RPCs pending on the agent queue, including the type and the destination host
list. This information is cached and only refreshed on 30 second intervals.

OPTIONS

       -a, --all
              Get and report information. This is the default mode of operation.

       -M, --cluster=<string>
              The  cluster  to  issue  commands to. Only one cluster name may be specified.  Note
              that the SlurmDBD must be up for this option to work properly.

       -h, --help
              Print description of options and exit.

       --json Dump job information as JSON. Sorting and formatting arguments will be ignored.

       -r, --reset
              Reset scheduler and RPC counters to 0.  Only  supported  for  Slurm  operators  and
              administrators.

       -i, --sort-by-id
              Sort Remote Procedure Call (RPC) data by message type ID and user ID.

       -t, --sort-by-time
              Sort Remote Procedure Call (RPC) data by total run time.

       -T, --sort-by-time2
              Sort Remote Procedure Call (RPC) data by average run time.

       --usage
              Print list of options and exit.

       -V, --version
              Print current version number and exit.

       --yaml Dump job information as YAML. Sorting and formatting arguments will be ignored.

PERFORMANCE

       Executing  sdiag sends a remote procedure call to slurmctld. If enough calls from sdiag or
       other Slurm client commands that send remote procedure calls to the slurmctld daemon  come
       in  at  once,  it  can  result  in  a  degradation of performance of the slurmctld daemon,
       possibly resulting in a denial of service.

       Do not run sdiag or other Slurm client  commands  that  send  remote  procedure  calls  to
       slurmctld  from loops in shell scripts or other programs. Ensure that programs limit calls
       to sdiag to the minimum necessary for the information you are trying to gather.

ENVIRONMENT VARIABLES

       Some sdiag options may be set via  environment  variables.  These  environment  variables,
       along  with  their  corresponding  options, are listed below.  (Note: Command line options
       will always override these settings.)

       SLURM_CLUSTERS      Same as --cluster

       SLURM_CONF          The location of the Slurm configuration file.

COPYING

       Copyright (C) 2010-2011 Barcelona Supercomputing Center.
       Copyright (C) 2010-2022 SchedMD LLC.

       Slurm is free software; you can redistribute it and/or modify it under the  terms  of  the
       GNU  General Public License as published by the Free Software Foundation; either version 2
       of the License, or (at your option) any later version.

       Slurm is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without
       even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
       GNU General Public License for more details.