Provided by: slurm-client_19.05.5-1_amd64 bug

NAME

       gres.conf - Slurm configuration file for Generic RESource (GRES) management.

DESCRIPTION

       gres.conf  is  an ASCII file which describes the configuration of Generic RESource (GRES) on each compute
       node.  If the GRES information in the slurm.conf file does not fully describe  those  resources,  then  a
       gres.conf  file  should  be  included  on each compute node.  The file location can be modified at system
       build time using the DEFAULT_SLURM_CONF  parameter  or  at  execution  time  by  setting  the  SLURM_CONF
       environment variable. The file will always be located in the same directory as the slurm.conf file.

       If  the  GRES information in the slurm.conf file fully describes those resources (i.e. no "Cores", "File"
       or "Links" specification is required for that GRES type or that information is  automatically  detected),
       that  information  may  be  omitted from the gres.conf file and only the configuration information in the
       slurm.conf file will be used.  The  gres.conf  file  may  be  omitted  completely  if  the  configuration
       information in the slurm.conf file fully describes all GRES.

       Parameter names are case insensitive.  Any text following a "#" in the configuration file is treated as a
       comment  through  the  end  of  that line.  Changes to the configuration file take effect upon restart of
       Slurm daemons, daemon receipt of the SIGHUP signal, or execution of the  command  "scontrol  reconfigure"
       unless otherwise noted.

       NOTE: Slurm support for gres/mps requires the use of the select/cons_tres plugin. For more information on
       how to configure MPS, see https://slurm.schedmd.com/gres.html#MPS_Management.

       For more information on GRES scheduling in general, see https://slurm.schedmd.com/gres.html.

       The overall configuration parameters available include:

       AutoDetect
              The hardware detection mechanisms to enable for automatic GRES configuration.  This should be on a
              line  by  itself.   Currently,  the  only  valid  option  is  nvml, which allows for automatically
              detecting NVIDIA GPUs.

       Count  Number of resources of this type available on this node.  The default value is set to  the  number
              of  File values specified (if any), otherwise the default value is one. A suffix of "K", "M", "G",
              "T" or "P" may be used to multiply the number by 1024,  1048576,  1073741824,  etc.  respectively.
              For example: "Count=10G".

       Cores  Optionally  specify  the  first thread CPU index numbers for the specific cores which can use this
              resource.  For example, it may be strongly preferable to use specific  cores  with  specific  GRES
              devices  (e.g.  on a NUMA architecture).  While Slurm can track and assign resources at the CPU or
              thread level, its scheduling algorithms used to co-allocate GRES devices with CPUs operates  at  a
              socket  or  NUMA level.  Therefore it is not possible to preferentially assign GRES with different
              specific CPUs on the same NUMA or socket and this option should be used to identify all  cores  on
              some socket.

              Multiple  cores  may be specified using a comma delimited list or a range may be specified using a
              "-" separator (e.g. "0,1,2,3" or "0-3").  If a job  specifies  --gres-flags=enforce-binding,  then
              only  the  identified cores can be allocated with each generic resource. This will tend to improve
              performance of jobs, but delay the allocation of resources to them.  If specified and a job is not
              submitted with the --gres-flags=enforce-binding option the identified cores will be preferred  for
              scheduled with each generic resource.

              If  --gres-flags=disable-binding is specified, then any core can be used with the resources, which
              also increases the  speed  of  Slurm's  scheduling  algorithm  but  can  degrade  the  application
              performance.   The --gres-flags=disable-binding option is currently required to use more CPUs than
              are bound to a GRES (i.e. if a GPU is bound to the CPUs on one socket, but resources on more  than
              one  socket are required to run the job).  If any core can be effectively used with the resources,
              then do not specify the cores option for improved speed in the Slurm scheduling logic.  A  restart
              of the slurmctld is needed for changes to the Cores option to take effect.

              NOTE:  If your cores contain multiple threads only the first thread (processing unit) of each core
              needs to be listed.  Also note that since Slurm must be able to  perform  resource  management  on
              heterogeneous clusters having various processing unit numbering schemes, a logical processing unit
              index  must  be  specified  instead  of  the physical processing unit index.  That processing unit
              logical index might not correspond to your physical index number.  Processing unit 0 will  be  the
              first  socket,  first  core  and  (if  configured)  first  thread.   If hyperthreading is enabled,
              processing  unit  1  will  always  be  the  first  socket,  first  core  and  second  thread.   If
              hyperthreading  is not enabled, processing unit 1 will always be the first socket and second core.
              This numbering coincides with the processing unit logical number  (PU  L#)  seen  in  "lstopo  -l"
              command output.

       File   Fully  qualified  pathname of the device files associated with a resource.  The name can include a
              numeric range suffix to be interpreted by Slurm (e.g. File=/dev/nvidia[0-3]).

              This field is generally required if enforcement of generic resource allocations is to be supported
              (i.e. prevents users from making use of resources allocated to a different user).  Enforcement  of
              the  file  allocation  relies  upon Linux Control Groups (cgroups) and Slurm's task/cgroup plugin,
              which will place the allocated files into the job's cgroup and prevent use of other files.  Please
              see Slurm's Cgroups Guide for more information: https://slurm.schedmd.com/cgroups.html.

              If File is specified then Count must be either set to the number of file names  specified  or  not
              set  (the default value is the number of files specified).  The exception to this is MPS. For MPS,
              each GPU would be identified by device file using the File parameter and Count would  specify  the
              number of MPS entries that would correspond to that GPU (typically 100 or some multiple of 100).

              NOTE:  If you specify the File parameter for a resource on some node, the option must be specified
              on all nodes and Slurm will track the assignment of each specific resource on each node. Otherwise
              Slurm will only track a count of allocated resources rather than  the  state  of  each  individual
              device file.

              NOTE:  Drain a node before changing the count of records with File parameters (i.e. if you want to
              add or remove GPUs from a node's configuration).  Failure to do so will result in  any  job  using
              those GRES being aborted.

       Links  A  comma-delimited  list  of numbers identifying the number of connections between this device and
              other devices to allow coscheduling of better connected devices.  This is an ordered list in which
              the number of connections this specific device has to device  number  0  would  be  in  the  first
              position,  the  number of connections it has to device number 1 in the second position, etc.  A -1
              indicates the device itself and a 0 indicates no connection.  If specified,  then  this  line  can
              only contain a single GRES device (i.e. can only contain a single file via File).

              This  is  an  optional  value and is usually automatically determined if AutoDetect is enabled.  A
              typical use case would be to identify GPUs having NVLink connectivity.  Note that  for  GPUs,  the
              minor  number  assigned  by the OS and used in the device file (i.e. the X in /dev/nvidiaX) is not
              necessarily the same as the device number/index. The device number is created by sorting the  GPUs
              by   PCI   bus   ID   and   then   numbering   them  starting  from  the  smallest  bus  ID.   See
              https://slurm.schedmd.com/gres.html#GPU_Management

       Name   Name of the generic resource. Any desired name may be used.   The  name  must  match  a  value  in
              GresTypes  in  slurm.conf.   Each  generic  resource  has  an  optional  plugin  which can provide
              resource-specific functionality.  Generic resources that currently include an optional plugin are:

              gpu    Graphics Processing Unit

              mps    CUDA Multi-Process Service (MPS)

              nic    Network Interface Card

              mic    Intel Many Integrated Core (MIC) processor

       NodeName
              An optional NodeName specification can be used to permit one gres.conf file to  be  used  for  all
              compute nodes in a cluster by specifying the node(s) that each line should apply to.  The NodeName
              specification can use a Slurm hostlist specification as shown in the example below.

       Type   An  optional  arbitrary string identifying the type of device.  For example, this might be used to
              identify a specific model of GPU, which users can then specify in  a  job  request.   If  Type  is
              specified, then Count is limited in size (currently 1024).

EXAMPLES

       ##################################################################
       # Slurm's Generic Resource (GRES) configuration file
       # Define GPU devices with MPS support
       ##################################################################
       AutoDetect=nvml
       Name=gpu Type=gtx560 File=/dev/nvidia0 COREs=0,1
       Name=gpu Type=tesla  File=/dev/nvidia1 COREs=2,3
       Name=mps Count=100 File=/dev/nvidia0 COREs=0,1
       Name=mps Count=100  File=/dev/nvidia1 COREs=2,3

       ##################################################################
       # Slurm's Generic Resource (GRES) configuration file
       # Overwrite system defaults and explicitly configure three GPUs
       ##################################################################
       Name=gpu Type=tesla File=/dev/nvidia[0-1] COREs=0,1
       # Name=gpu Type=tesla  File=/dev/nvidia[2-3] COREs=2,3
       # NOTE: nvidia2 device is out of service
       Name=gpu Type=tesla  File=/dev/nvidia3 COREs=2,3

       ##################################################################
       # Slurm's Generic Resource (GRES) configuration file
       # Use a single gres.conf file for all compute nodes - positive method
       ##################################################################
       ## Explicitly specify devices on nodes tux0-tux15
       # NodeName=tux[0-15]  Name=gpu File=/dev/nvidia[0-3]
       # NOTE: tux3 nvidia1 device is out of service
       NodeName=tux[0-2]  Name=gpu File=/dev/nvidia[0-3]
       NodeName=tux3  Name=gpu File=/dev/nvidia[0,2-3]
       NodeName=tux[4-15]  Name=gpu File=/dev/nvidia[0-3]

       ##################################################################
       # Slurm's Generic Resource (GRES) configuration file
       # Use NVML to gather GPU configuration information
       # Information about all other GRES gathered from slurm.conf
       ##################################################################
       AutoDetect=nvml

COPYING

       Copyright  (C) 2010 The Regents of the University of California.  Produced at Lawrence Livermore National
       Laboratory (cf, DISCLAIMER).
       Copyright (C) 2010-2019 SchedMD LLC.

       This   file   is   part   of   Slurm,   a   resource    management    program.     For    details,    see
       <https://slurm.schedmd.com/>.

       Slurm  is  free  software;  you  can  redistribute it and/or modify it under the terms of the GNU General
       Public License as published by the Free Software Foundation; either version 2 of the License, or (at your
       option) any later version.

       Slurm is distributed in the hope that it will be useful, but  WITHOUT  ANY  WARRANTY;  without  even  the
       implied  warranty  of  MERCHANTABILITY  or  FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public
       License for more details.

SEE ALSO

       slurm.conf(5)

September 2019                              Slurm Configuration File                                gres.conf(5)