Ubuntu Manpage: gres.conf - Slurm configuration file for Generic RESource (GRES) management.

Provided by: slurm-client_23.11.4-1.2ubuntu5_amd64

NAME

       gres.conf - Slurm configuration file for Generic RESource (GRES) management.

DESCRIPTION

gres.conf is an ASCII file which describes the configuration of Generic RESource(s) (GRES)
on each compute node. If the GRES information in the slurm.conf file does not fully
describe those resources, then a gres.conf file should be included on each compute node.
For cloud nodes, a gres.conf file that includes all the cloud nodes must be on all cloud
nodes and the controller. The file will always be located in the same directory as
slurm.conf.

If the GRES information in the slurm.conf file fully describes those resources (i.e. no
"Cores", "File" or "Links" specification is required for that GRES type or that
information is automatically detected), that information may be omitted from the gres.conf
file and only the configuration information in the slurm.conf file will be used. The
gres.conf file may be omitted completely if the configuration information in the
slurm.conf file fully describes all GRES.

If using the gres.conf file to describe the resources available to nodes, the first
parameter on the line should be NodeName. If configuring Generic Resources without
specifying nodes, the first parameter on the line should be Name.

Parameter names are case insensitive. Any text following a "#" in the configuration file
is treated as a comment through the end of that line. Changes to the configuration file
take effect upon restart of Slurm daemons, daemon receipt of the SIGHUP signal, or
execution of the command "scontrol reconfigure" unless otherwise noted.

NOTE: Slurm support for gres/[mps|shard] requires the use of the select/cons_tres plugin.
For more information on how to configure MPS, see
https://slurm.schedmd.com/gres.html#MPS_Management. For more information on how to
configure Sharding, see https://slurm.schedmd.com/gres.html#Sharding.

For more information on GRES scheduling in general, see
https://slurm.schedmd.com/gres.html.

The overall configuration parameters available include:

AutoDetect
The hardware detection mechanisms to enable for automatic GRES configuration.
Currently, the options are:

nrt Automatically detect AWS Trainium/Inferentia devices.

nvml Automatically detect NVIDIA GPUs. Requires the NVIDIA Management Library
(NVML).

off Do not automatically detect any GPUs. Used to override other options.

oneapi Automatically detect Intel GPUs. Requires the Intel Graphics Compute Runtime
for oneAPI Level Zero and OpenCL Driver (oneapi).

rsmi Automatically detect AMD GPUs. Requires the ROCm System Management Interface
(ROCm SMI) Library.

AutoDetect can be on a line by itself, in which case it will globally apply to all
lines in gres.conf by default. In addition, AutoDetect can be combined with
NodeName to only apply to certain nodes. Node-specific AutoDetects will trump the
global AutoDetect. A node-specific AutoDetect only needs to be specified once per
node. If specified multiple times for the same nodes, they must all be the same
value. To unset AutoDetect for a node when a global AutoDetect is set, simply set
it to "off" in a node-specific GRES line. E.g.: NodeName=tux3 AutoDetect=off
Name=gpu File=/dev/nvidia[0-3]. AutoDetect cannot be used with cloud nodes.

AutoDetect will automatically detect files, cores, links, and any other hardware.
If a parameter such as File, Cores, or Links are specified when AutoDetect is used,
then the specified values are used to sanity check the auto detected values. If
there is a mismatch, then the node's state is set to invalid and the node is
drained.

Count Number of resources of this name/type available on this node. The default value is
set to the number of File values specified (if any), otherwise the default value is
one. A suffix of "K", "M", "G", "T" or "P" may be used to multiply the number by
1024, 1048576, 1073741824, etc. respectively. For example: "Count=10G".

Cores Optionally specify the core index numbers for the specific cores which can use this
resource. For example, it may be strongly preferable to use specific cores with
specific GRES devices (e.g. on a NUMA architecture). While Slurm can track and
assign resources at the CPU or thread level, its scheduling algorithms used to
co-allocate GRES devices with CPUs operates at a socket level (or NUMA level with
numa_node_as_socket) for job allocations. Therefore it is not possible to
preferentially assign GRES with different specific CPUs on the same socket (or NUMA
with numa_node_as_socket) and this option should generally be used to identify all
cores on some socket. Though, job step allocations that request n portion of the
job's resources with --exact and task binding through --gpu-d will both look at
cores directly for which more specific core identification may be useful.

Multiple cores may be specified using a comma-delimited list or a range may be
specified using a "-" separator (e.g. "0,1,2,3" or "0-3"). If a job specifies
--gres-flags=enforce-binding, then only the identified cores can be allocated with
each generic resource. This will tend to improve performance of jobs, but delay the
allocation of resources to them. If specified and a job is not submitted with the
--gres-flags=enforce-binding option the identified cores will be preferred for
scheduling with each generic resource.

If --gres-flags=disable-binding is specified, then any core can be used with the
resources, which also increases the speed of Slurm's scheduling algorithm but can
degrade the application performance. The --gres-flags=disable-binding option is
currently required to use more CPUs than are bound to a GRES (e.g. if a GPU is
bound to the CPUs on one socket, but resources on more than one socket are required
to run the job). If any core can be effectively used with the resources, then do
not specify the cores option for improved speed in the Slurm scheduling logic. A
restart of the slurmctld is needed for changes to the Cores option to take effect.

NOTE: Since Slurm must be able to perform resource management on heterogeneous
clusters having various processing unit numbering schemes, a logical core index
must be specified instead of the physical core index. That logical core index
might not correspond to your physical core index number. Core 0 will be the first
core on the first socket, while core 1 will be the second core on the first socket.
This numbering coincides with the logical core number (Core L#) seen in "lstopo -l"
command output.

File Fully qualified pathname of the device files associated with a resource. The name
can include a numeric range suffix to be interpreted by Slurm (e.g.
File=/dev/nvidia[0-3]).

This field is generally required if enforcement of generic resource allocations is
to be supported (i.e. prevents users from making use of resources allocated to a
different user). Enforcement of the file allocation relies upon Linux Control
Groups (cgroups) and Slurm's task/cgroup plugin, which will place the allocated
files into the job's cgroup and prevent use of other files. Please see Slurm's
Cgroups Guide for more information: https://slurm.schedmd.com/cgroups.html.

If File is specified then Count must be either set to the number of file names
specified or not set (the default value is the number of files specified). The
exception to this is MPS/Sharding. For either of these GRES, each GPU would be
identified by device file using the File parameter and Count would specify the
number of entries that would correspond to that GPU. For MPS, typically 100 or some
multiple of 100. For Sharding typically the maximum number of jobs that could
simultaneously share that GPU.

If using a card with Multi-Instance GPU functionality, use MultipleFiles instead.
File and MultipleFiles are mutually exclusive.

NOTE: File is required for all gpu typed GRES.

NOTE: If you specify the File parameter for a resource on some node, the option
must be specified on all nodes and Slurm will track the assignment of each specific
resource on each node. Otherwise Slurm will only track a count of allocated
resources rather than the state of each individual device file.

NOTE: Drain a node before changing the count of records with File parameters (e.g.
if you want to add or remove GPUs from a node's configuration). Failure to do so
will result in any job using those GRES being aborted.

NOTE: When specifying File, Count is limited in size (currently 1024) for each
node.

Flags Optional flags that can be specified to change configured behavior of the GRES.

Allowed values at present are:

CountOnly Do not attempt to load a plugin of the GRES type as this GRES
will only be used to track counts of GRES used. This avoids
attempting to load non-existent plugin which can affect
filesystems with high latency metadata operations for
non-existent files.

explicit If the flag is set, GRES is not allocated to the job as part of
whole node allocation (--exclusive or OverSubscribe=EXCLUSIVE
set on partition) unless it was explicitly requested by the
job.

one_sharing To be used on a shared gres. If using a shared gres (mps) on
top of a sharing gres (gpu) only allow one of the sharing gres
to be used by the shared gres. This is the default for MPS.

NOTE: If a gres has this flag configured it is global, so all
other nodes with that gres will have this flag implied. This
flag is not compatible with all_sharing for a specific gres.

all_sharing To be used on a shared gres. This is the opposite of
one_sharing and can be used to allow all sharing gres (gpu) on
a node to be used for shared gres (mps).

NOTE: If a gres has this flag configured it is global, so all
other nodes with that gres will have this flag implied. This
flag is not compatible with one_sharing for a specific gres.

nvidia_gpu_env Set environment variable CUDA_VISIBLE_DEVICES for all GPUs on
the specified node(s).

amd_gpu_env Set environment variable ROCR_VISIBLE_DEVICES for all GPUs on
the specified node(s).

intel_gpu_env Set environment variable ZE_AFFINITY_MASK for all GPUs on the
specified node(s).

opencl_env Set environment variable GPU_DEVICE_ORDINAL for all GPUs on the
specified node(s).

no_gpu_env Set no GPU-specific environment variables. This is mutually
exclusive to all other environment-related flags.

If no environment-related flags are specified, then nvidia_gpu_env, amd_gpu_env,
intel_gpu_env, and opencl_env will be implicitly set by default. If AutoDetect is
used and environment-related flags are not specified, then AutoDetect=nvml will set
nvidia_gpu_env, AutoDetect=rsmi will set amd_gpu_env, and AutoDetect=oneapi will
set intel_gpu_env. Conversely, specified environment-related flags will always
override AutoDetect.

Environment-related flags set on one GRES line will be inherited by the GRES line
directly below it if no environment-related flags are specified on that line and if
it is of the same node, name, and type. Environment-related flags must be the same
for GRES of the same node, name, and type.

Note that there is a known issue with the AMD ROCm runtime where
ROCR_VISIBLE_DEVICES is processed first, and then CUDA_VISIBLE_DEVICES is
processed. To avoid the issues caused by this, set Flags=amd_gpu_env for AMD GPUs
so only ROCR_VISIBLE_DEVICES is set.

Links A comma-delimited list of numbers identifying the number of connections between
this device and other devices to allow coscheduling of better connected devices.
This is an ordered list in which the number of connections this specific device has
to device number 0 would be in the first position, the number of connections it has
to device number 1 in the second position, etc. A -1 indicates the device itself
and a 0 indicates no connection. If specified, then this line can only contain a
single GRES device (i.e. can only contain a single file via File).

This is an optional value and is usually automatically determined if AutoDetect is
enabled. A typical use case would be to identify GPUs having NVLink connectivity.
Note that for GPUs, the minor number assigned by the OS and used in the device file
(i.e. the X in /dev/nvidiaX) is not necessarily the same as the device
number/index. The device number is created by sorting the GPUs by PCI bus ID and
then numbering them starting from the smallest bus ID. See
https://slurm.schedmd.com/gres.html#GPU_Management

MultipleFiles
Fully qualified pathname of the device files associated with a resource. Graphics
cards using Multi-Instance GPU (MIG) technology will present multiple device files
that should be managed as a single generic resource. The file names can be a comma
separated list or it can include a numeric range suffix (e.g.
MultipleFiles=/dev/nvidia[0-3]).

Drain a node before changing the count of records with the MultipleFiles parameter,
such as when adding or removing GPUs from a node's configuration. Failure to do so
will result in any job using those GRES being aborted.

When not using GPUs with MIG functionality, use File instead. MultipleFiles and
File are mutually exclusive.

Name Name of the generic resource. Any desired name may be used. The name must match a
value in GresTypes in slurm.conf. Each generic resource has an optional plugin
which can provide resource-specific functionality. Generic resources that
currently include an optional plugin are:

gpu Graphics Processing Unit

mps CUDA Multi-Process Service (MPS)

nic Network Interface Card

shard Shards of a gpu

NodeName
An optional NodeName specification can be used to permit one gres.conf file to be
used for all compute nodes in a cluster by specifying the node(s) that each line
should apply to. The NodeName specification can use a Slurm hostlist specification
as shown in the example below.

Type An optional arbitrary string identifying the type of generic resource. For
example, this might be used to identify a specific model of GPU, which users can
then specify in a job request. For changes to the Type option to take effect with
a scontrol reconfig all affected slurmd daemons must be responding to the
slurmctld. Otherwise a restart of the slurmctld and slurmd daemons is required.

NOTE: If using autodetect functionality and defining the Type in your gres.conf
file, the Type specified should match or be a substring of the value that is
detected, using an underscore in lieu of any spaces.

EXAMPLES

       ##################################################################
       # Slurm's Generic Resource (GRES) configuration file
       # Define GPU devices with MPS support, with AutoDetect sanity checking
       ##################################################################
       AutoDetect=nvml
       Name=gpu Type=gtx560 File=/dev/nvidia0 COREs=0,1
       Name=gpu Type=tesla  File=/dev/nvidia1 COREs=2,3
       Name=mps Count=100 File=/dev/nvidia0 COREs=0,1
       Name=mps Count=100  File=/dev/nvidia1 COREs=2,3

       ##################################################################
       # Slurm's Generic Resource (GRES) configuration file
       # Overwrite system defaults and explicitly configure three GPUs
       ##################################################################
       Name=gpu Type=tesla File=/dev/nvidia[0-1] COREs=0,1
       # Name=gpu Type=tesla  File=/dev/nvidia[2-3] COREs=2,3
       # NOTE: nvidia2 device is out of service
       Name=gpu Type=tesla  File=/dev/nvidia3 COREs=2,3

       ##################################################################
       # Slurm's Generic Resource (GRES) configuration file
       # Use a single gres.conf file for all compute nodes - positive method
       ##################################################################
       ## Explicitly specify devices on nodes tux0-tux15
       # NodeName=tux[0-15]  Name=gpu File=/dev/nvidia[0-3]
       # NOTE: tux3 nvidia1 device is out of service
       NodeName=tux[0-2]  Name=gpu File=/dev/nvidia[0-3]
       NodeName=tux3  Name=gpu File=/dev/nvidia[0,2-3]
       NodeName=tux[4-15]  Name=gpu File=/dev/nvidia[0-3]

       ##################################################################
       # Slurm's Generic Resource (GRES) configuration file
       # Use NVML to gather GPU configuration information
       # for all nodes except one
       ##################################################################
       AutoDetect=nvml
       NodeName=tux3 AutoDetect=off Name=gpu File=/dev/nvidia[0-3]

       ##################################################################
       # Slurm's Generic Resource (GRES) configuration file
       # Specify some nodes with NVML, some with RSMI, and some with no AutoDetect
       ##################################################################
       NodeName=tux[0-7] AutoDetect=nvml
       NodeName=tux[8-11] AutoDetect=rsmi
       NodeName=tux[12-15] Name=gpu File=/dev/nvidia[0-3]

       ##################################################################
       # Slurm's Generic Resource (GRES) configuration file
       # Define 'bandwidth' GRES to use as a way to limit the
       # resource use on these nodes for workflow purposes
       ##################################################################
       NodeName=tux[0-7] Name=bandwidth Type=lustre Count=4G Flags=CountOnly

COPYING

       Copyright  (C)  2010  The  Regents  of the University of California.  Produced at Lawrence
       Livermore National Laboratory (cf, DISCLAIMER).
       Copyright (C) 2010-2022 SchedMD LLC.

       This  file  is  part  of  Slurm,  a  resource  management  program.   For   details,   see
       <https://slurm.schedmd.com/>.

       Slurm  is  free  software; you can redistribute it and/or modify it under the terms of the
       GNU General Public License as published by the Free Software Foundation; either version  2
       of the License, or (at your option) any later version.

       Slurm is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without
       even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
       GNU General Public License for more details.

NAME

DESCRIPTION

EXAMPLES

COPYING

SEE ALSO