Provided by: slurm-client_22.05.8-3_amd64 bug

NAME

       nonstop.conf - Slurm configuration file for fault-tolerant computing.

DESCRIPTION

       nonstop.conf  is  an  ASCII file which describes the configuration used for fault-tolerant
       computing with Slurm using the optional slurmctld/nonstop plugin.  This plugin provides  a
       means  for  users  to  notify  Slurm  of  nodes it believes are suspect, replace the job's
       failing or failed nodes, and extend a job's in response to failures.  The file will always
       be located in the same directory as the slurm.conf.

       Parameter  names are case insensitive.  Any text following a "#" in the configuration file
       is treated as a comment through the end of that line.  Changes to the  configuration  file
       take  effect  upon  restart  of  Slurm  daemons,  daemon  receipt of the SIGHUP signal, or
       execution of the command "scontrol reconfigure" unless otherwise noted.  The configuration
       parameters available include:

       BackupAddr
              Communications  address  used  for  the  slurmctld  daemon.   This  can either be a
              hostname or IP address.  This value would typically be the same  as  the  secondary
              SlurmctldHost in the slurm.conf file, when applicable.

       ControlAddr
              Communications  address  used  for  the  slurmctld  daemon.   This  can either be a
              hostname  or  IP  address.   This  value  would  typically  be  the  same  as   the
              SlurmctldHost in the slurm.conf file.

       Debug  A  number  indicating  the level of additional logging desired for the plugin.  The
              default value is zero, which generates no additional logging.

       HotSpareCount
              This identifies how many nodes in each partition  should  be  maintained  as  spare
              resources.   When  a  job  fails,  this pool of resources will be depleted and then
              replenished  when  possible  using  idle  resources.   The  value   should   be   a
              comma-delimited list of partition and node count pairs separated by a colon.

       MaxSpareNodeCount
              This  identifies the maximum number of nodes any single job may replace through the
              job's entire lifetime.  This could prevent a single job from  causing  all  of  the
              nodes in a cluster to fail.  By default, there is no maximum node count.

       Port   Port used for communications.  The default value is 6820.

       TimeLimitDelay
              If  a  job  requires replacement resources and none are immediately available, then
              permit a job to extend its time limit by the length  of  time  required  to  secure
              replacement  resources  up  to  the  number of minutes specified by TimeLimitDelay.
              This option will only take effect if no hot spare resources are  available  at  the
              time replacement resources are requested.  This time limit extension is in addition
              to the value calculated using the TimeLimitExtend.  The default value is  zero  (no
              time limit extension).  The value may not exceed 65533 seconds.

       TimeLimitDrop
              Specifies  the  number  of  minutes  that  a job can extend its time limit for each
              failed or failing node removed from the job's allocation.   The  default  value  is
              zero (no time limit extension).  The value may not exceed 65533 seconds.

       TimeLimitExtend
              Specifies  the  number  of  minutes  that  a job can extend its time limit for each
              replaced node.  The default value is zero (no time limit extension).  The value may
              not exceed 65533 seconds.

       UserDrainAllow
              This  identifies  a comma-delimited list of user names or user IDs of users who are
              authorized to drain nodes they believe are failing.  Specify a value  of  "ALL"  to
              permit  any  user  to drain nodes.  By default, no users may drain nodes using this
              interface.

       UserDrainDeny
              This identifies a comma-delimited list of user names or user IDs of users  who  are
              NOT  authorized  to  drain  nodes they believe are failing.  Specifying a value for
              UserDrainDeny implicitly allows all other users to drain nodes (sets the  value  of
              UserDrainAllow to "ALL").

EXAMPLE

       #
       # Sample nonstop.conf file
       # Date: 12 Feb 2013
       #
       ControlAddr=12.34.56.78
       BackupAddr=12.34.56.79
       Port=1234
       #
       HotSpareCount=batch:6,interactive:0
       MaxSpareNodesCount=4
       TimeLimitDelay=30
       TimeLimitExtend=20
       TimeLimitExtend=10
       UserDrainAllow=adam,brenda

COPYING

       Copyright (C) 2013-2022 SchedMD LLC. All rights reserved.

       Slurm is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without
       even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
       GNU General Public License for more details.

SEE ALSO

       slurm.conf(5)