Provided by:
slurm-llnl_1.2.20-1_i386 
NAME
slurm.conf - Slurm configuration file
DESCRIPTION
/etc/slurm.conf is an ASCII file which describes general SLURM
configuration information, the nodes to be managed, information about
how those nodes are grouped into partitions, and various scheduling
parameters associated with those partitions.
The file location can be modified at system build time using the
DEFAULT_SLURM_CONF parameter. In addition, you can use the SLURM_CONF
environment variable to override the built-in location of this file.
The SLURM daemons also allow you to override both the built-in and
environment-provided location using the "-f" option on the command
line.
The contents of the file are case insensitive except for the names of
nodes and partitions. Any text following a "#" in the configuration
file is treated as a comment through the end of that line. The size of
each line in the file is limited to 1024 characters. Changes to the
configuration file take effect upon restart of SLURM daemons, daemon
receipt of the SIGHUP signal, or execution of the command "scontrol
reconfigure" unless otherwise noted.
If a line begins with the word "Include" followed by whitespace and
then a file name, that file will be included inline with the current
configuration file.
The overall configuration parameters available include:
AuthType
Define the authentication method for communications between
SLURM components. Acceptable values at present include
"auth/none", "auth/authd", and "auth/munge". The default value
is "auth/none", which means the UID included in communication
messages is not verified. This may be fine for testing
purposes, but do not use "auth/none" if you desire any security.
"auth/authd" indicates that Brett Chun’s authd is to be used
(see "http://www.theether.org/authd/" for more information).
"auth/munge" indicates that Chris Dunlap’s munge is to be used
(this is the best supported authentication mechanism for SLURM,
see "http://www.llnl.gov/linux/munge/" for more information).
All SLURM daemons and commands must be terminated prior to
changing the value of AuthType and later restarted (SLURM jobs
can be preserved).
BackupAddr
Name that BackupController should be referred to in establishing
a communications path. This name will be used as an argument to
the gethostbyname() function for identification. For example,
"elx0000" might be used to designate the ethernet address for
node "lx0000". By default the BackupAddr will be identical in
value to BackupController.
BackupController
The name of the machine where SLURM control functions are to be
executed in the event that ControlMachine fails. This node may
also be used as a compute server if so desired. It will come
into service as a controller only upon the failure of
ControlMachine and will revert to a "standby" mode when the
ControlMachine becomes available once again. This should be a
node name without the full domain name (e.g. "lx0002"). While
not essential, it is recommended that you specify a backup
controller. See the RELOCATING CONTROLLERS section if you
change this.
CacheGroups
If set to 1, the slurmd daemon will cache /etc/groups entries.
This can improve performance for highly parallel jobs if NIS
servers are used and unable to respond very quickly. The
default value is 0 to disable caching group data.
CheckpointType
Define the system-initiated checkpoint method to be used for
user jobs. The slurmctld daemon must be restarted for a change
in CheckpointType to take effect. Acceptable values at present
include "checkpoint/aix" (only on AIX systems),
"checkpoint/ompi" (requires OpenMPI version 1.3 or higher), and
"checkpoint/none". (only on AIX systems). The default value is
"checkpoint/none".
ControlAddr
Name that ControlMachine should be referred to in establishing a
communications path. This name will be used as an argument to
the gethostbyname() function for identification. For example,
"elx0000" might be used to designate the ethernet address for
node "lx0000". By default the ControlAddr will be identical in
value to ControlMachine.
ControlMachine
The name of the machine where SLURM control functions are
executed. This should be a node name without the full domain
name (e.g. "lx0001"). This value must be specified. See the
RELOCATING CONTROLLERS section if you change this.
Epilog Fully qualified pathname of a script to execute as user root on
every node when a user’s job completes (e.g.
"/usr/local/slurm/epilog"). This may be used to purge files,
disable user login, etc. By default there is no epilog.
FastSchedule
Controls how a nodes configuration specifications in slurm.conf
are used. If the number of node configuration entries in the
configuration file is significantly lower than the number of
nodes, setting FastSchedule to 1 will permit much faster
scheduling decisions to be made. (The scheduler can just check
the values in a few configuration records instead of possibly
thousands of node records. If a job can’t be initiated
immediately, the scheduler may execute these tests repeatedly.)
Note that on systems with hyper-threading, the processor count
reported by the node will be twice the actually processor count.
Consider which value you want to be used for scheduling
purposes.
1 (default)
Consider the configuration of each node to be that
specified in the configuration file and any node with less
than the configured resouces will be set DOWN.
0 Base scheduling decisions upon the actual configuration of
each individual node.
2 Consider the configuration of each node to be that
specified in the slurm.conf configuration file and any node
with less resources than configured will not be set DOWN.
This can be useful for testing purposes.
FirstJobId
The job id to be used for the first submitted to SLURM without a
specific requested value. Job id values generated will
incremented by 1 for each subsequent job. This may be used to
provide a meta-scheduler with a job id space which is disjoint
from the interactive jobs. The default value is 1.
HeartbeatInterval
Defunct paramter. Interval of heartbeat for slurmd daemon is
half of SlurmdTimeout. Interval of heartbeat for slurmctld
daemon is half of SlurmctldTimeout.
InactiveLimit
The interval, in seconds, a job or job step is permitted to be
inactive before it is terminated. A job or job step is
considered inactive if the associated srun command is not
responding to slurm daemons. This could be due to the
termination of the srun command or the program being is a
stopped state. A batch job is considered inactive if it has no
active job steps (e.g. periods of pre- and post-processing).
This limit permits defunct jobs to be purged in a timely fashion
without waiting for their time limit to be reached. This value
should reflect the possibility that the srun command may stopped
by a debugger or considerable time could be required for batch
job pre- and post-processing. This limit is ignored for jobs
running in partitions with the RootOnly flag set (the scheduler
running as root will be responsible for the job). The default
value is unlimited (zero). May not exceed 65533.
JobAcctType
Define the job accounting mechanism type. Acceptable values at
present include "jobacct/aix" (for AIX operating system),
"jobacct/linux" (for Linux operating system) and "jobacct/none"
(no accounting data collected). The default value is
"jobacct/none". In order to use the sacct tool, "jobacct/aix"
or "jobacct/linux" must be configured.
JobAcctLogFile
Define the location where job accounting logs are to be written.
For jobacct/none this parameter is ignored. For jobacct/linux
this is the fully-qualified file name for the data file.
JobAcctFrequency
Define the polling frequencys to pass to the job accounting
plugin. For jobacct/none this parameter is ignored. For
jobacct/linux the parameter is a number is seconds between
polls.
JobCompLoc
The interpretation of this value depends upon the logging
mechanism specified by the JobCompType parameter.
JobCompType
Define the job completion logging mechanism type. Acceptable
values at present include "jobcomp/none", "jobcomp/filetxt", and
"jobcomp/script". The default value is "jobcomp/none", which
means that upon job completion the record of the job is purged
from the system. The value "jobcomp/filetxt" indicates that a
record of the job should be written to a text file specified by
the JobCompLoc parameter. The value "jobcomp/script" indicates
that a script specified by the JobCompLoc parameter is to be
executed with environment variables indicating the job
information.
JobCredentialPrivateKey
Fully qualified pathname of a file containing a private key used
for authentication by Slurm daemons.
JobCredentialPublicCertificate
Fully qualified pathname of a file containing a public key used
for authentication by Slurm daemons.
JobFileAppend
This option controls what to do if a job’s output or error file
exist when the job is started. If JobFileAppend is set to a
value of 1, then append to the existing file. By default, any
existing file is truncated. NOTE: This variable does not appear
in the output of the command "scontrol show config" in versions
of SLURM less than version 1.3.
KillTree
This option is mapped to "ProctrackType=proctrack/linuxproc".
It will be removed from a future release.
KillWait
The interval, in seconds, given to a job’s processes between the
SIGTERM and SIGKILL signals upon reaching its time limit. If
the job fails to terminate gracefully in the interval specified,
it will be forcably terminated. The default value is 30
seconds. May not exceed 65533.
MailProg
Fully qualified pathname to the program used to send email per
user request. The default value is "/bin/mail".
MaxJobCount
The maximum number of jobs SLURM can have in its active database
at one time. Set the values of MaxJobCount and MinJobAge to
insure the slurmctld daemon does not exhaust its memory or other
resources. Once this limit is reached, requests to submit
additional jobs will fail. The default value is 2000 jobs. This
value may not be reset via "scontrol reconfig". It only takes
effect upon restart of the slurmctld daemon. May not exceed
65533.
MessageTimeout
Time permitted for a round-trip communication to complete in
seconds. Default value is 10 seconds. For systems with shared
nodes, the slurmd daemon could be paged out and necessitate
higher values.
MinJobAge
The minimum age of a completed job before its record is purged
from SLURM’s active database. Set the values of MaxJobCount and
MinJobAge to insure the slurmctld daemon does not exhaust its
memory or other resources. The default value is 300 seconds. A
value of zero prevents any job record purging. May not exceed
65533.
MpiDefault
Identifies the default type of MPI to be used. Srun may
override this configuration parameter in any case. Currently
supported versions include: mpichgm, mvapich, none (default,
which works for many other versions of MPI including LAM MPI and
Open MPI).
PluginDir
Identifies the places in which to look for SLURM plugins. This
is a colon-separated list of directories, like the PATH
environment variable. The default value is
"/usr/local/lib/slurm".
PlugStackConfig
Location of the config file for SLURM stackable plugins that use
the Stackable Plugin Architecture for Node job (K)control
(SPANK). This provides support for a highly configurable set of
plugins to be called before and/or after execution of each task
spawned as part of a user’s job step. Default location is
"plugstack.conf" in the same directory as the system slurm.conf.
For more information on SPANK plugins, see the spank(8) manual.
ProctrackType
Identifies the plugin to be used for process tracking. The
slurmd daemon uses this mechanism to identify all processes
which are children of processes it spawns for a user job. The
slurmd daemon must be restarted for a change in ProctrackType to
take effect. NOTE: "proctrack/linuxproc" and "proctrack/pgid"
can fail to identify all processes associated with a job since
processes can become a child of the init process (when the
parent process terminates) or change their process group. To
reliably track all processes, one of the other mechanisms
utilizing kernel modifications is preferable. NOTE:
"proctrack/linuxproc" is not compatible with "switch/elan."
Acceptable values at present include:
proctrack/aix which uses an AIX kernel extenstion and is
the default for AIX systems
proctrack/linuxproc which uses linux process tree using
parent process IDs
proctrack/rms which uses Quadrics kernel patch and is the
default if "SwitchType=switch/elan"
proctrack/sgi_job which uses SGI’s Process Aggregates (PAGG)
kernel module, see http://oss.sgi.com/projects/pagg/ for
more information
proctrack/pgid which uses process group IDs and is the
default for all other systems
Prolog Fully qualified pathname of a script for the slurmd to execute
whenever it is asked to run a job step from a new job
allocation. (e.g. "/usr/local/slurm/prolog"). The slurmd
executes the script before starting the job step. This may be
used to purge files, enable user login, etc. By default there
is no prolog. Any configured script is expected to complete
execution quickly (in less time than MessageTimeout).
NOTE: The Prolog script is ONLY run on any individual node when
it first sees a job step from a new allocation; it does not run
the Prolog immediately when an allocation is granted. If no job
steps from an allocation are run on a node, it will never run
the Prolog for that allocation. The Epilog, on the other hand,
always runs on every node of an allocation when the allocation
is released.
PropagatePrioProcess
Setting PropagatePrioProcess to "1", will cause a users job to
run with the same priority (aka nice value) as the users process
which launched the job on the submit node. If set to "0", or
left unset, the users job will inherit the scheduling priority
from the slurm daemon.
PropagateResourceLimits
A list of comma separated resource limit names. The slurmd
daemon uses these names to obtain the associated (soft) limit
values from the users process environment on the submit node.
These limits are then propagated and applied to the jobs that
will run on the compute nodes. This parameter can be useful
when system limits vary among nodes. Any resource limits that
do not appear in the list are not propagated. However, the user
can override this by specifying which resource limits to
propagate with the srun commands "--propagate" option. If
neither of the ’propagate resource limit’ parameters are
specified, then the default action is to propagate all limits.
Only one of the parameters, either PropagateResourceLimits or
PropagateResourceLimitsExcept, may be specified. The following
limit names are supported by Slurm (although some options may
not be supported on some systems):
ALL All limits listed below
AS The maximum address space for a processes
CORE The maximum size of core file
CPU The maximum amount of CPU time
DATA The maximum size of a process’s data segment
FSIZE The maximum size of files created
MEMLOCK The maximum size that may be locked into memory
NOFILE The maximum number of open files
NPROC The maximum number of processes available
RSS The maximum resident set size
STACK The maximum stack size
PropagateResourceLimitsExcept
A list of comma separated resource limit names. By default, all
resource limits will be propagated, (as described by the
PropagateResourceLimits parameter), except for the limits
appearing in this list. The user can override this by
specifying which resource limits to propagate with the srun
commands "--propagate" option. See PropagateResourceLimits
above for a list of valid limit names.
ReturnToService
If set to 1, then a non-responding (DOWN) node will become
available for use upon registration. Note that DOWN node’s state
will be changed only if it was set DOWN due to being
non-responsive. If the node was set DOWN for any other reason
(low memory, prolog failure, epilog failure, etc.), its state
will not automatically be changed. The default value is 0,
which means that a node will remain in the DOWN state until a
system administrator explicitly changes its state (even if the
slurmd daemon registers and resumes communications).
SchedulerRootFilter
If set to ’1’ then scheduler will filter and avoid RootOnly
partitions (let root user or process schedule these partitions).
Otherwise scheduler will treat RootOnly partitions as any other
standard partition. Currently only supported by sched/backfill
schedululer plugin.
SchedulerPort
The port number on which slurmctld should listen for connection
requests. This value is only used by the Maui Scheduler (see
SchedulerType). The default value is 7321.
SchedulerRootFilter
Identifies whether or not RootOnly partitions should be filtered
from any external scheduling activities. If set to 0, then
RootOnly partitions are treated like any other partition. If set
to 1, then RootOnly partitions are exempt from any external
scheduling activities. The default value is 1. Currently only
used by the built-in backfill scheduling module "sched/backfill"
(see SchedulerType).
SchedulerType
Identifies the type of scheduler to be used. Acceptable values
include "sched/builtin" for the built-in FIFO scheduler,
"sched/backfill" for a backfill scheduling module to augment the
default FIFO scheduling, "sched/hold" to hold all newly arriving
jobs if a file "/etc/slurm.hold" exists otherwise use the
built-in FIFO scheduler, and "sched/wiki" for the Wiki interface
to the Maui Scheduler. The default value is "sched/builtin".
Backfill scheduling will initiate lower-priority jobs if doing
so does not delay the expected initiation time of any higher
priority job. Note that this backfill scheduler implementation
is relatively simple. It does not support partitions configured
to to share resources (run multiple jobs on the same nodes) or
support jobs requesting specific nodes. When initially setting
the value to "sched/wiki", any pending jobs must have their
priority set to zero (held). When changing the value from
"sched/wiki", all pending jobs should have their priority change
from zero to some large number. The scontrol command can be
used to change job priorities. The slurmctld daemon must be
restarted for a change in scheduler type to become effective.
SelectType
Identifies the type of resource selection algorithm to be used.
Acceptable values include
select/linear
for allocation of entire nodes assuming a one-dimentional
array of nodes in which sequentially ordered nodes are
preferable. This is the default value for non-BlueGene
systems.
select/cons_res
The resources within a node are individually allocated as
consumable resources. Note that whole nodes can be
allocated to jobs for selected partitions by using the
Shared=EXCLUSIVE option. See the partition Shared
parameter for more information.
select/bluegene
for a three-dimentional BlueGene system. The default
value is "select/bluegene" for BlueGene systems.
SelectTypeParameters
This only apply for SelectType=select/cons_res.
CR_CPU CPUs are consumable resources. There is no notion of
sockets, cores or threads. On a multi-core system, each
core will be consided a CPU. On a multi-core and
hyperthreaded system, each thread will be considered a
CPU. On single-core systems, each CPUs will be
considered a CPU.
CR_CPU_Memory
CPUs and memory are consumable resources.
CR_Core
Cores are consumable resources.
CR_Core_Memory
Cores and memory are consumable resources.
CR_Socket
Sockets are consumable resources.
CR_Socket_Memory
Memory and CPUs are consumable resources.
CR_Memory
Memory is a consumable resource. NOTE: This implies
Shared=Yes for all partitions.
SlurmUser
The name of the user that the slurmctld daemon executes as. For
security purposes, a user other than "root" is recommended. The
default value is "root".
SlurmctldDebug
The level of detail to provide slurmctld daemon’s logs. Values
from 0 to 7 are legal, with ‘0’ being "quiet" operation and ‘7’
being insanely verbose. The default value is 3.
SlurmctldLogFile
Fully qualified pathname of a file into which the slurmctld
daemon’s logs are written. The default value is none (performs
logging via syslog).
SlurmctldPidFile
Fully qualified pathname of a file into which the slurmctld
daemon may write its process id. This may be used for automated
signal processing. The default value is
"/var/run/slurmctld.pid".
SlurmctldPort
The port number that the SLURM controller, slurmctld, listens to
for work. The default value is SLURMCTLD_PORT as established at
system build time. If none is explicitly specified, it will be
set to 6817. NOTE: Either slurmctld and slurmd daemons must not
execute on the same nodes or the values of SlurmctldPort and
SlurmdPort must be different.
SlurmctldTimeout
The interval, in seconds, that the backup controller waits for
the primary controller to respond before assuming control. The
default value is 120 seconds. May not exceed 65533.
SlurmdDebug
The level of detail to provide slurmd daemon’s logs. Values
from 0 to 7 are legal, with ‘0’ being "quiet" operation and ‘7’
being insanely verbose. The default value is 3.
SlurmdLogFile
Fully qualified pathname of a file into which the slurmd
daemon’s logs are written. The default value is none (performs
logging via syslog). Any "%h" within the name is replaced with
the hostname on which the slurmd is running.
SlurmdPidFile
Fully qualified pathname of a file into which the slurmd daemon
may write its process id. This may be used for automated signal
processing. The default value is "/var/run/slurmd.pid".
SlurmdPort
The port number that the SLURM compute node daemon, slurmd,
listens to for work. The default value is SLURMD_PORT as
established at system build time. If none is explicitly
specified, its value will be 6818. NOTE: Either slurmctld and
slurmd daemons must not execute on the same nodes or the values
of SlurmctldPort and SlurmdPort must be different.
SlurmdSpoolDir
Fully qualified pathname of a directory into which the slurmd
daemon’s state information and batch job script information are
written. This must be a common pathname for all nodes, but
should represent a directory which is local to each node
(reference a local file system). The default value is
"/var/spool/slurmd." NOTE: This directory is also used to store
slurmd’s shared memory lockfile, and should not be changed
unless the system is being cleanly restarted. If the location of
SlurmdSpoolDir is changed and slurmd is restarted, the new
daemon will attach to a different shared memory region and lose
track of any running jobs.
SlurmdTimeout
The interval, in seconds, that the SLURM controller waits for
slurmd to respond before configuring that node’s state to DOWN.
The default value is 300 seconds. A value of zero indicates the
node will not be tested by slurmctld to confirm the state of
slurmd, the node will not be automatically set to a DOWN state
indicating a non-responsive slurmd, and some other tool will
take responsibility for monitoring the state of each compute
node and its slurmd daemon. The value may not exceed 65533.
StateSaveLocation
Fully qualified pathname of a directory into which the SLURM
controller, slurmctld, saves its state (e.g.
"/usr/local/slurm/checkpoint"). SLURM state will saved here to
recover from system failures. SlurmUser must be able to create
files in this directory. If you have a BackupController
configured, this location should be readable and writable by
both systems. The default value is "/tmp". If any slurm
daemons terminate abnormally, their core files will also be
written into this directory.
SrunEpilog
Fully qualified pathname of an executable to be run by srun
following the completion of a job step. The command line
arguments for the executable will be the command and arguments
of the job step. This configuration parameter may be overridden
by srun’s --epilog parameter.
SrunProlog
Fully qualified pathname of an executable to be run by srun
prior to the launch of a job step. The command line arguments
for the executable will be the command and arguments of the job
step. This configuration parameter may be overridden by srun’s
--prolog parameter.
SwitchType
Identifies the type of switch or interconnect used for
application communications. Acceptable values include
"switch/none" for switches not requiring special processing for
job launch or termination (Myrinet, Ethernet, and InfiniBand),
"switch/elan" for Quadrics Elan 3 or Elan 4 interconnect. The
default value is "switch/none". All SLURM daemons, commands and
running jobs must be restarted for a change in SwitchType to
take effect. If running jobs exist at the time slurmctld is
restarted with a new value of SwitchType, records of all jobs in
any state may be lost.
TaskEpilog
Fully qualified pathname of a program to be execute as the slurm
job’s owner after termination of each task. See TaskPlugin for
execution order details.
TaskPlugin
Identifies the type of task launch plugin, typically used to
provide resource management within a node (e.g. pinning tasks to
specific processors). Acceptable values include "task/none" for
systems requiring no special handling and "task/affinity" to
enable the --cpu_bind and/or --mem_bind srun options. The
default value is "task/none". If you "task/affinity" and
encounter problems, it may be due to the variety of system calls
used to implement task affinity on different operating systems.
If that is the case, you may want to use Portable Linux Process
Affinity (PLPA, see http://www.open-mpi.org/software/plpa),
which is supported by SLURM. The order of task prolog/epilog
execution is as follows:
1. pre_launch(): function in TaskPlugin
2. TaskProlog: system-wide per task program defined in
slurm.conf
3. user prolog: job step specific task program defined using
srun’s --task-prolog option or SLURM_TASK_PROLOG
environment variable
4. Execute the job step’s task
5. user epilog: job step specific task program defined using
srun’s --task-epilog option or SLURM_TASK_EPILOG
environment variable
6. TaskEpilog: system-wide per task program defined in
slurm.conf
7. post_term(): function in TaskPlugin
TaskPluginParam
Optional parameters for the task plugin.
Cpusets Use cpusets to perform task affinity functions
Sched Use sched_setaffinity or plpa_sched_setaffinity (if
available) to bind tasks to processors. This is the
default mode of operation is no parameters are
specified.
TaskProlog
Fully qualified pathname of a program to be execute as the slurm
job’s owner prior to initiation of each task. Besides the
normal environment variables, this has SLURM_TASK_PID available
to identify the process ID of the task being started. Standard
output from this program of the form "export NAME=value" will be
used to set environment variables for the task being spawned.
See TaskPlugin for execution order details.
TmpFS Fully qualified pathname of the file system available to user
jobs for temporary storage. This parameter is used in
establishing a node’s TmpDisk space. The default value is
"/tmp".
TreeWidth
Slurmd daemons use a virtual tree network for communications.
TreeWidth specifies the width of the tree (i.e. the fanout).
The default value is 50, meaning each slurmd daemon can
communicate with up to 50 other slurmd daemons and over 2500
nodes can be contacted with two message hops. The default value
will work well for most clusters. Optimaly system performance
can typically be achieved if TreeWidth is set to the square root
of the number of nodes in the cluster for systems having no more
than 2500 nodes or the cube root for larger systems.
UnkillableStepProgram
If the processes in a job step are determined to be unkillable
for a period of time specified by the UnkillableStepTimeout
variable, the program specified by the UnkillableStepProgram
string will be executed. This program can be used to take
special actions to clean up the unkillable processes. The
program will be run as the same user as the slurmd (usually
"root"). NOTE: This variable does not appear in the output of
the command "scontrol show config" in versions of SLURM less
than version 1.3.
UnkillableStepTimeout
The length of time, in seconds, that SLURM will wait before
deciding that processes in a job step are unkillable (after they
have been signalled with SIGKILL). The default timeout value is
60 seconds. NOTE: This variable does not appear in the output
of the command "scontrol show config" in versions of SLURM less
than version 1.3.
UsePAM If set to 1, PAM (Pluggable Authentication Modules for Linux)
will be enabled. PAM is used to establish the upper bounds for
resource limits. With PAM support enabled, local system
administrators can dynamically configure system resource limits.
Changing the upper bound of a resource limit will not alter the
limits of running jobs, only jobs started after a change has
been made will pick up the new limits. The default value is 0
(not to enable PAM support). Remember that PAM also needs to be
configured to support SLURM as a service. For sites using PAM’s
directory based configuration option, a configuration file named
slurm should be created. The module-type, control-flags, and
module-path names that should be included in the file are:
auth required pam_localuser.so
auth required pam_shells.so
account required pam_unix.so
account required pam_access.so
session required pam_unix.so
For sites configuring PAM with a general configuration file, the
appropriate lines (see above), where slurm is the service-name,
should be added.
WaitTime
Specifies how many seconds the srun command should by default
wait after the first task terminates before terminating all
remaining tasks. The "--wait" option on the srun command line
overrides this value. If set to 0, this feature is disabled.
May not exceed 65533.
The configuration of nodes (or machines) to be managed by Slurm is also
specified in /etc/slurm.conf. Only the NodeName must be supplied in
the configuration file. All other node configuration information is
optional. It is advisable to establish baseline node configurations,
especially if the cluster is heterogeneous. Nodes which register to
the system with less than the configured resources (e.g. too little
memory), will be placed in the "DOWN" state to avoid scheduling jobs on
them. Establishing baseline configurations will also speed SLURM’s
scheduling process by permitting it to compare job requirements against
these (relatively few) configuration parameters and possibly avoid
having to check job requirements against every individual node’s
configuration. The resources checked at node registration time are:
Procs, RealMemory and TmpDisk. While baseline values for each of these
can be established in the configuration file, the actual values upon
node registration are recorded and these actual values may be used for
scheduling purposes (depending upon the value of FastSchedule in the
configuration file.
Default values can be specified with a record in which "NodeName" is
"DEFAULT". The default entry values will apply only to lines following
it in the configuration file and the default values can be reset
multiple times in the configuration file with multiple entries where
"NodeName=DEFAULT". The "NodeName=" specification must be placed on
every line describing the configuration of nodes. In fact, it is
generally possible and desirable to define the configurations of all
nodes in only a few lines. This convention permits significant
optimization in the scheduling of larger clusters. In order to support
the concept of jobs requiring consecutive nodes on some architectures,
node specifications should be place in this file in consecutive order.
No single node name may be listed more than once in the configuration
file. Use "DownNodes=" to record the state of nodes which are
temporarily in a DOWN or DRAIN state without altering permanent
configuration information. A job step’s tasks are allocated to nodes
in order the nodes appear in the configuration file. There is presently
no capability within SLURM to arbitarily order a job step’s tasks.
Multiple node names may be comma separated (e.g. "alpha,beta,gamma")
and/or a simple node range expression may optionally be used to specify
numeric ranges of nodes to avoid building a configuration file with
large numbers of entries. The node range expression can contain one
pair of square brackets with a sequence of comma separated numbers
and/or ranges of numbers separated by a "-" (e.g. "linux[0-64,128]", or
"lx[15,18,32-33]"). Note that the numeric ranges can include one or
more leading zeros to indicate the numeric portion has a fixed number
of digits (e.g. "linux[0000-1023]").
On BlueGene systems only, the square brackets should contain pairs of
three digit numbers separated by a "x". These numbers indicate the
boundaries of a rectangular prism (e.g. "bgl[000x144,400x544]"). See
BlueGene documentation for more details. Presently the numeric range
must be the last characters in the node name (e.g. "unit[0-31]rack1" is
invalid). The node configuration specified the following information:
NodeName
Name that SLURM uses to refer to a node (or base partition for
BlueGene systems). Typically this would be the string that
"/bin/hostname -s" returns, however it may be an arbitary string
if NodeHostname is specified. If the NodeName is "DEFAULT", the
values specified with that record will apply to subsequent node
specifications unless explicitly set to other values in that
node record or replaced with a different set of default values.
For architectures in which the node order is significant, nodes
will be considered consecutive in the order defined. For
example, if the configuration for "NodeName=charlie" immediately
follows the configuration for "NodeName=baker" they will be
considered adjacent in the computer.
NodeHostname
The string that "/bin/hostname -s" returns. A node range
expression can be used to specify a set of nodes. If an
expression is used, the number of nodes identified by
NodeHostname on a line in the configuration file must be
identical to the number of nodes identified by NodeName. By
default, the NodeHostname will be identical in value to
NodeName.
NodeAddr
Name that a node should be referred to in establishing a
communications path. This name will be used as an argument to
the gethostbyname() function for identification. If a node
range expression is used to designate multiple nodes, they must
exactly match the entries in the NodeName (e.g.
"NodeName=lx[0-7] NodeAddr="elx[0-7]"). NodeAddr may also
contain IP addresses. By default, the NodeAddr will be
identical in value to NodeName.
Feature
A comma delimited list of arbitrary strings indicative of some
characteristic associated with the node. There is no value
associated with a feature at this time, a node either has a
feature or it does not. If desired a feature may contain a
numeric component indicating, for example, processor speed. By
default a node has no features.
RealMemory
Size of real memory on the node in MegaBytes (e.g. "2048"). The
default value is 1.
Procs Number of logical processors on the node (e.g. "2"). If Procs
is omitted, it will be inferred from Sockets, CoresPerSocket,
and ThreadsPerCore. The default value is 1.
Sockets
Number of physical processor sockets/chips on the node (e.g.
"2"). If Sockets is omitted, it will be inferred from Procs,
CoresPerSocket, and ThreadsPerCore. NOTE: If you have
multi-core processors, you will likely need to specify these
parameters. The default value is 1.
CoresPerSocket
Number of cores in a single physical processor socket (e.g.
"2"). The CoresPerSocket value describes physical cores, not
the logical number of processors per socket. NOTE: If you have
multi-core processors, you will likely need to specify this
parameter. The default value is 1.
ThreadsPerCore
Number of logical threads in a single physical core (e.g. "2").
The default value is 1.
Reason Identifies the reason for a node being in state "DOWN" or
"DRAIN". Use quotes to enclose a reason having more than one
word.
State State of the node with respect to the initiation of user jobs.
Acceptable values are "DOWN", "DRAIN" and "UNKNOWN". "DOWN"
indicates the node failed and is unavailable to be allocated
work. "DRAIN" indicates the node is unavailable to be allocated
work. "UNKNOWN" indicates the node’s state is undefined (BUSY
or IDLE), but will be established when the slurmd daemon on that
node registers. The default value is "UNKNOWN". Also see the
DownNodes paramter below.
TmpDisk
Total size of temporary disk storage in TmpFS in MegaBytes (e.g.
"16384"). TmpFS (for "Temporary File System") identifies the
location which jobs should use for temporary storage. Note this
does not indicate the amount of free space available to the user
on the node, only the total file system size. The system
administration should insure this file system is purged as
needed so that user jobs have access to most of this space. The
Prolog and/or Epilog programs (specified in the configuration
file) might be used to insure the file system is kept clean.
The default value is 1.
Weight The priority of the node for scheduling purposes. All things
being equal, jobs will be allocated the nodes with the lowest
weight which satisfies their requirements. For example, a
heterogeneous collection of nodes might be placed into a single
partition for greater system utilization, responsiveness and
capability. It would be preferable to allocate smaller memory
nodes rather than larger memory nodes if either will satisfy a
job’s requirements. The units of weight are arbitrary, but
larger weights should be assigned to nodes with more processors,
memory, disk space, higher processor speed, etc. Weight is an
integer value with a default value of 1.
The "DownNodes=" configuration permits you to mark certain nodes as in
a DOWN or DRAIN state without altering the permanent configuration
information listed under a "NodeName=" specification.
DownNodes
Any node name, or list of node names, from the "NodeName="
specifications.
Reason Identifies the reason for a node being in state "DOWN" or
"DRAIN". Use quotes to enclose a reason having more than one
word.
State State of the node with respect to the initiation of user jobs.
Acceptable values are "DOWN", "DRAIN" and "UNKNOWN". "DOWN"
indicates the node failed and is unavailable to be allocated
work. "DRAIN" indicates the node is unavailable to be allocated
work. "UNKNOWN" indicates the node’s state is undefined (BUSY
or IDLE), but will be established when the slurmd daemon on that
node registers. The default value is "UNKNOWN".
The partition configuration permits you to establish different job
limits or access controls for various groups (or partitions) of nodes.
Nodes may be in more than one partition, making partitions serve as
general purpose queues. For example one may put the same set of nodes
into two different partitions, each with different constraints (time
limit, job sizes, groups allowed to use the partition, etc.). Jobs are
allocated resources within a single partition. Default values can be
specified with a record in which "PartitionName" is "DEFAULT". The
default entry values will apply only to lines following it in the
configuration file and the default values can be reset multiple times
in the configuration file with multiple entries where
"PartitionName=DEFAULT". The "PartitionName=" specification must be
placed on every line describing the configuration of partitions. NOTE:
Put all parameters for each partition on a single line. Each line of
partition configuration information should represent a different
partition. The partition configuration file contains the following
information:
AllowGroups
Comma separated list of group IDs which may execute jobs in the
partition. If at least one group associated with the user
attempting to execute the job is in AllowGroups, he will be
permitted to use this partition. Jobs executed as user root can
use any partition without regard to the value of AllowGroups.
If user root attempts to execute a job as another user (e.g.
using srun’s --uid option), this other user must be in one of
groups identified by AllowGroups for the job to succesfully
execute. The default value is "ALL".
Default
If this keyword is set, jobs submitted without a partition
specification will utilize this partition. Possible values are
"YES" and "NO". The default value is "NO".
Hidden Specifies if the partition and its jobs are to be hidden by
default. Hidden partitions will by default not be reported by
the SLURM APIs or commands. Possible values are "YES" and "NO".
The default value is "NO".
RootOnly
Specifies if only user ID zero (i.e. user root) may allocate
resources in this partition. User root may allocate resources
for any other user, but the request must be initiated by user
root. This option can be useful for a partition to be managed
by some external entity (e.g. a higher-level job manager) and
prevents users from directly using those resources. Possible
values are "YES" and "NO". The default value is "NO".
MaxNodes
Maximum count of nodes (or base partitions for BlueGene systems)
which may be allocated to any single job. The default value is
"UNLIMITED", which is represented internally as -1. This limit
does not apply to jobs executed by SlurmUser or user root.
MaxTime
Maximum wall-time limit for any job in minutes. The default
value is "UNLIMITED", which is represented internally as -1.
This limit does not apply to jobs executed by SlurmUser or user
root.
MinNodes
Minimum count of nodes (or base partitions for BlueGene systems)
which may be allocated to any single job. The default value is
1. This limit does not apply to jobs executed by SlurmUser or
user root.
Nodes Comma separated list of nodes (or base partitions for BlueGene
systems) which are associated with this partition. Node names
may be specified using the node range expression syntax
described above. A blank list of nodes (i.e. "Nodes= ") can be
used if one wants a partition to exist, but have no resources
(possibly on a temporary basis).
PartitionName
Name by which the partition may be referenced (e.g.
"Interactive"). This name can be specified by users when
submitting jobs. If the PartitionName is "DEFAULT", the values
specified with that record will apply to subsequent partition
specifications unless explicitly set to other values in that
partition record or replaced with a different set of default
values.
Shared Ability of the partition to execute more than one job at a time
on each node. Shared nodes will offer unpredictable performance
for application programs, but can provide higher system
utilization and responsiveness than otherwise possible.
Possible values are "EXCLUSIVE", "FORCE", "YES", and "NO".
"EXCLUSIVE" allocates entire nodes to jobs even with
select/cons_res configured. This can be used to allocate whole
nodes in some partitions and individual processors in other
partitions. "FORCE" makes all nodes in the partition available
for sharing without user means of disabling it. "YES" makes
nodes in the partition available for sharing if and only if the
individual jobs permit sharing (see the srun "--share" option).
"NO" makes nodes unavailable for sharing under all
circumstances. The default value is "NO".
State State of partition or availability for use. Possible values are
"UP" or "DOWN". The default value is "UP".
RELOCATING CONTROLLERS
If the cluster’s computers used for the primary or backup controller
will be out of service for an extended period of time, it may be
desirable to relocate them. In order to do so, follow this procedure:
1. Stop the SLURM daemons
2. Modify the slurm.conf file appropriately
3. Distribute the updated slurm.conf file to all nodes
4. Restart the SLURM daemons
There should be no loss of any running or pending jobs. Insure that
any nodes added to the cluster have the current slurm.conf file
installed.
CAUTION: If two nodes are simultaneously configured as the primary
controller (two nodes on which ControlMachine specify the local host
and the slurmctld daemon is executing on each), system behavior will be
destructive. If a compute node has an incorrect ControlMachine or
BackupController parameter, that node may be rendered unusable, but no
other harm will result.
EXAMPLE
#
# Sample /etc/slurm.conf for dev[0-25].llnl.gov
# Author: John Doe
# Date: 11/06/2001
#
ControlMachine=dev0
ControlAddr=edev0
BackupController=dev1
BackupAddr=edev1
#
AuthType=auth/authd
Epilog=/usr/local/slurm/epilog
Prolog=/usr/local/slurm/prolog
FastSchedule=1
FirstJobId=65536
HeartbeatInterval=60
InactiveLimit=120
JobCompType=jobcomp/filetxt
JobCompLoc=/var/log/slurm.job.log
KillWait=30
MaxJobCount=10000
MinJobAge=3600
PluginDir=/usr/local/lib:/usr/local/slurm/lib
ReturnToService=0
SchedulerType=sched/wiki
SchedulerPort=7004
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdLogFile=/var/log/slurmd.log
SlurmctldPort=7002
SlurmdPort=7003
SlurmdSpoolDir=/usr/local/slurm/slurmd.spool
StateSaveLocation=/usr/local/slurm/slurm.state
SwitchType=switch/elan
TmpFS=/tmp
WaitTime=30
JobCredentialPrivateKey=/usr/local/slurm/private.key
JobCredentialPublicCertificate=/usr/local/slurm/public.cert
JobAcctType=jobacct/linux
JobAcctLogFile=/var/log/slurm_accounting.log
JobAcctParameters="Frequency=30,MaxSendRetries=5"
#
# Node Configurations
#
NodeName=DEFAULT Procs=2 RealMemory=2000 TmpDisk=64000
NodeName=DEFAULT State=UNKNOWN
NodeName=dev[0-25] NodeAddr=edev[0-25] Weight=16
# Update records for specific DOWN nodes
DownNodes=dev20 State=DOWN Reason="power,ETA=Dec25"
#
# Partition Configurations
#
PartitionName=DEFAULT MaxTime=30 MaxNodes=10 State=UP
PartitionName=debug Nodes=dev[0-8,18-25] Default=YES
PartitionName=batch Nodes=dev[9-17] MinNodes=4
PartitionName=long Nodes=dev[9-17] MaxTime=120 AllowGroups=admin
COPYING
Copyright (C) 2002-2007 The Regents of the University of California.
Produced at Lawrence Livermore National Laboratory (cf, DISCLAIMER).
UCRL-CODE-226842.
This file is part of SLURM, a resource management program. For
details, see <http://www.llnl.gov/linux/slurm/>.
SLURM is free software; you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free
Software Foundation; either version 2 of the License, or (at your
option) any later version.
SLURM is distributed in the hope that it will be useful, but WITHOUT
ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
for more details.
FILES
/etc/slurm.conf
SEE ALSO
bluegene.conf(5), getrlimit(2), gethostbyname(3), group(5),
hostname(1), scontrol(1), slurmctld(8), slurmd(8), spank(8), syslog(2),
wiki.conf(5)