Provided by: slurm-client_19.05.5-1_amd64 bug

NAME

       gres.conf - Slurm configuration file for Generic RESource (GRES) management.

DESCRIPTION

       gres.conf is an ASCII file which describes the configuration of Generic RESource (GRES) on
       each compute node.  If the GRES information in the slurm.conf file does not fully describe
       those  resources, then a gres.conf file should be included on each compute node.  The file
       location can be modified at system build time using the DEFAULT_SLURM_CONF parameter or at
       execution  time  by  setting  the SLURM_CONF environment variable. The file will always be
       located in the same directory as the slurm.conf file.

       If the GRES information in the slurm.conf file fully describes those  resources  (i.e.  no
       "Cores",  "File"  or  "Links"  specification  is  required  for  that  GRES  type  or that
       information is automatically detected), that information may be omitted from the gres.conf
       file  and  only  the  configuration  information in the slurm.conf file will be used.  The
       gres.conf file  may  be  omitted  completely  if  the  configuration  information  in  the
       slurm.conf file fully describes all GRES.

       Parameter  names are case insensitive.  Any text following a "#" in the configuration file
       is treated as a comment through the end of that line.  Changes to the  configuration  file
       take  effect  upon  restart  of  Slurm  daemons,  daemon  receipt of the SIGHUP signal, or
       execution of the command "scontrol reconfigure" unless otherwise noted.

       NOTE: Slurm support for gres/mps requires the use of the select/cons_tres plugin. For more
       information          on          how         to         configure         MPS,         see
       https://slurm.schedmd.com/gres.html#MPS_Management.

       For     more     information     on     GRES     scheduling      in      general,      see
       https://slurm.schedmd.com/gres.html.

       The overall configuration parameters available include:

       AutoDetect
              The hardware detection mechanisms to enable for automatic GRES configuration.  This
              should be on a line by itself.  Currently, the only valid  option  is  nvml,  which
              allows for automatically detecting NVIDIA GPUs.

       Count  Number  of resources of this type available on this node.  The default value is set
              to the number of File values specified (if any), otherwise  the  default  value  is
              one.  A  suffix  of "K", "M", "G", "T" or "P" may be used to multiply the number by
              1024, 1048576, 1073741824, etc. respectively.  For example: "Count=10G".

       Cores  Optionally specify the first thread CPU index numbers for the specific cores  which
              can  use this resource.  For example, it may be strongly preferable to use specific
              cores with specific GRES devices (e.g. on a NUMA architecture).   While  Slurm  can
              track  and  assign  resources at the CPU or thread level, its scheduling algorithms
              used to co-allocate GRES devices with CPUs operates at  a  socket  or  NUMA  level.
              Therefore  it is not possible to preferentially assign GRES with different specific
              CPUs on the same NUMA or socket and this option should  be  used  to  identify  all
              cores on some socket.

              Multiple  cores  may  be  specified  using a comma delimited list or a range may be
              specified using a "-" separator (e.g. "0,1,2,3" or  "0-3").   If  a  job  specifies
              --gres-flags=enforce-binding,  then only the identified cores can be allocated with
              each generic resource. This will tend to improve performance of jobs, but delay the
              allocation  of resources to them.  If specified and a job is not submitted with the
              --gres-flags=enforce-binding option the identified  cores  will  be  preferred  for
              scheduled with each generic resource.

              If  --gres-flags=disable-binding  is  specified, then any core can be used with the
              resources, which also increases the speed of Slurm's scheduling algorithm  but  can
              degrade  the  application  performance.  The --gres-flags=disable-binding option is
              currently required to use more CPUs than are bound to a GRES  (i.e.  if  a  GPU  is
              bound to the CPUs on one socket, but resources on more than one socket are required
              to run the job).  If any core can be effectively used with the resources,  then  do
              not  specify  the cores option for improved speed in the Slurm scheduling logic.  A
              restart of the slurmctld is needed for changes to the Cores option to take effect.

              NOTE: If your cores contain multiple threads  only  the  first  thread  (processing
              unit)  of each core needs to be listed.  Also note that since Slurm must be able to
              perform resource management on heterogeneous  clusters  having  various  processing
              unit  numbering  schemes, a logical processing unit index must be specified instead
              of the physical processing unit index.  That processing unit  logical  index  might
              not  correspond to your physical index number.  Processing unit 0 will be the first
              socket, first core and (if configured) first thread.  If hyperthreading is enabled,
              processing  unit  1  will always be the first socket, first core and second thread.
              If hyperthreading is not enabled, processing unit 1 will always be the first socket
              and  second core.  This numbering coincides with the processing unit logical number
              (PU L#) seen in "lstopo -l" command output.

       File   Fully qualified pathname of the device files associated with a resource.  The  name
              can   include   a   numeric   range   suffix  to  be  interpreted  by  Slurm  (e.g.
              File=/dev/nvidia[0-3]).

              This field is generally required if enforcement of generic resource allocations  is
              to  be  supported  (i.e. prevents users from making use of resources allocated to a
              different user).  Enforcement of the file  allocation  relies  upon  Linux  Control
              Groups  (cgroups)  and  Slurm's  task/cgroup plugin, which will place the allocated
              files into the job's cgroup and prevent use of other  files.   Please  see  Slurm's
              Cgroups Guide for more information: https://slurm.schedmd.com/cgroups.html.

              If  File  is  specified  then  Count must be either set to the number of file names
              specified or not set (the default value is the number  of  files  specified).   The
              exception  to  this  is  MPS.  For MPS, each GPU would be identified by device file
              using the File parameter and Count would specify the number  of  MPS  entries  that
              would correspond to that GPU (typically 100 or some multiple of 100).

              NOTE:  If  you  specify  the File parameter for a resource on some node, the option
              must be specified on all nodes and Slurm will track the assignment of each specific
              resource  on  each  node.  Otherwise  Slurm  will  only  track a count of allocated
              resources rather than the state of each individual device file.

              NOTE: Drain a node before changing the count of records with File parameters  (i.e.
              if  you  want to add or remove GPUs from a node's configuration).  Failure to do so
              will result in any job using those GRES being aborted.

       Links  A comma-delimited list of numbers identifying the  number  of  connections  between
              this  device  and  other devices to allow coscheduling of better connected devices.
              This is an ordered list in which the number of connections this specific device has
              to device number 0 would be in the first position, the number of connections it has
              to device number 1 in the second position, etc.  A -1 indicates the  device  itself
              and  a  0 indicates no connection.  If specified, then this line can only contain a
              single GRES device (i.e. can only contain a single file via File).

              This is an optional value and is usually automatically determined if AutoDetect  is
              enabled.   A typical use case would be to identify GPUs having NVLink connectivity.
              Note that for GPUs, the minor number assigned by the OS and used in the device file
              (i.e.   the  X  in  /dev/nvidiaX)  is  not  necessarily  the  same  as  the  device
              number/index. The device number is created by sorting the GPUs by PCI  bus  ID  and
              then    numbering    them    starting    from    the    smallest   bus   ID.    See
              https://slurm.schedmd.com/gres.html#GPU_Management

       Name   Name of the generic resource. Any desired name may be used.  The name must match  a
              value  in  GresTypes  in  slurm.conf.  Each generic resource has an optional plugin
              which  can  provide  resource-specific  functionality.   Generic   resources   that
              currently include an optional plugin are:

              gpu    Graphics Processing Unit

              mps    CUDA Multi-Process Service (MPS)

              nic    Network Interface Card

              mic    Intel Many Integrated Core (MIC) processor

       NodeName
              An  optional  NodeName specification can be used to permit one gres.conf file to be
              used for all compute nodes in a cluster by specifying the node(s)  that  each  line
              should apply to.  The NodeName specification can use a Slurm hostlist specification
              as shown in the example below.

       Type   An optional arbitrary string identifying the type of  device.   For  example,  this
              might  be used to identify a specific model of GPU, which users can then specify in
              a job request.  If Type is specified, then Count  is  limited  in  size  (currently
              1024).

EXAMPLES

       ##################################################################
       # Slurm's Generic Resource (GRES) configuration file
       # Define GPU devices with MPS support
       ##################################################################
       AutoDetect=nvml
       Name=gpu Type=gtx560 File=/dev/nvidia0 COREs=0,1
       Name=gpu Type=tesla  File=/dev/nvidia1 COREs=2,3
       Name=mps Count=100 File=/dev/nvidia0 COREs=0,1
       Name=mps Count=100  File=/dev/nvidia1 COREs=2,3

       ##################################################################
       # Slurm's Generic Resource (GRES) configuration file
       # Overwrite system defaults and explicitly configure three GPUs
       ##################################################################
       Name=gpu Type=tesla File=/dev/nvidia[0-1] COREs=0,1
       # Name=gpu Type=tesla  File=/dev/nvidia[2-3] COREs=2,3
       # NOTE: nvidia2 device is out of service
       Name=gpu Type=tesla  File=/dev/nvidia3 COREs=2,3

       ##################################################################
       # Slurm's Generic Resource (GRES) configuration file
       # Use a single gres.conf file for all compute nodes - positive method
       ##################################################################
       ## Explicitly specify devices on nodes tux0-tux15
       # NodeName=tux[0-15]  Name=gpu File=/dev/nvidia[0-3]
       # NOTE: tux3 nvidia1 device is out of service
       NodeName=tux[0-2]  Name=gpu File=/dev/nvidia[0-3]
       NodeName=tux3  Name=gpu File=/dev/nvidia[0,2-3]
       NodeName=tux[4-15]  Name=gpu File=/dev/nvidia[0-3]

       ##################################################################
       # Slurm's Generic Resource (GRES) configuration file
       # Use NVML to gather GPU configuration information
       # Information about all other GRES gathered from slurm.conf
       ##################################################################
       AutoDetect=nvml

COPYING

       Copyright  (C)  2010  The  Regents  of the University of California.  Produced at Lawrence
       Livermore National Laboratory (cf, DISCLAIMER).
       Copyright (C) 2010-2019 SchedMD LLC.

       This  file  is  part  of  Slurm,  a  resource  management  program.   For   details,   see
       <https://slurm.schedmd.com/>.

       Slurm  is  free  software; you can redistribute it and/or modify it under the terms of the
       GNU General Public License as published by the Free Software Foundation; either version  2
       of the License, or (at your option) any later version.

       Slurm is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without
       even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
       GNU General Public License for more details.

SEE ALSO

       slurm.conf(5)