oracular (8) opensm.8.gz

Provided by: opensm_3.3.23-3_amd64 bug

NAME

       opensm - InfiniBand subnet manager and administration (SM/SA)

SYNOPSIS

       opensm  [--version]]  [-F  | --config <file_name>] [-c(reate-config) <file_name>] [-g(uid)
       <GUID in hex>] [-l(mc) <LMC>] [-p(riority) <PRIORITY>] [--subnet_prefix <PREFIX  in  hex>]
       [--smkey  <SM_Key>]  [--sm_sl  <SL  number>]  [-r(eassign_lids)]  [-R  <engine  name(s)> |
       --routing_engine <engine  name(s)>]  [--do_mesh_analysis]  [--lash_start_vl  <vl  number>]
       [--nue_max_num_vls  <vl  number>]  [-A  |  --ucast_cache] [-z | --connect_roots] [-M <file
       name> | --lid_matrix_file <file name>] [-U <file name> | --lfts_file <file  name>]  [-S  |
       --sadb_file <file name>] [-a | --root_guid_file <path to file>] [-u | --cn_guid_file <path
       to file>] [-G | --io_guid_file <path to file>] [--port-shifting] [--scatter-ports  <random
       seed>]    [-H    |    --max_reverse_hops    <max    reverse    hops    allowed>]   [-X   |
       --guid_routing_order_file <path to file>] [-m | --ids_guid_file <path to file>]  [-o(nce)]
       [-s(weep)   <interval>]   [-t(imeout)   <milliseconds>]  [--retries  <number>]  [--maxsmps
       <number>] [--console [off | local | socket |  loopback]]  [--console-port  <port>]  [-i  |
       --ignore_guids  <equalize-ignore-guids-file>] [-w | --hop_weights_file <path to file>] [-O
       | --port_search_ordering_file <path to file>] [-O  |  --dimn_ports_file  <path  to  file>]
       (DEPRECATED)  [--dump_files_dir  <directory-name>]  [-f  <log file path> | --log_file <log
       file path> ] [-L | --log_limit <size in MB>]  [-e(rase_log_file)]  [-P(config)  <partition
       config file> ] [-N | --no_part_enforce] (DEPRECATED) [-Z | --part_enforce [both | in | out
       | off]] [-W | --allow_both_pkeys] [-Q  |  --qos  [-Y  |  --qos_policy_file  <file  name>]]
       [--congestion-control]  [--cckey  <key>]  [-y  |  --stay_on_fatal]  [-B  | --daemon] [-J |
       --pidfile <file_name>] [-I | --inactive]  [--perfmgr]  [--perfmgr_sweep_time_s  <seconds>]
       [--prefix_routes_file  <path>]  [--consolidate_ipv6_snm_req]  [--log_prefix <prefix text>]
       [--torus_config <path  to  file>]  [-v(erbose)]  [-V]  [-D  <flags>]  [-d(ebug)  <number>]
       [-h(elp)] [-?]

DESCRIPTION

       opensm  is  an  InfiniBand compliant Subnet Manager and Administration, and runs on top of
       OpenIB.

       opensm provides an implementation of an InfiniBand Subnet Manager and Administration. Such
       a  software  entity  is required to run for in order to initialize the InfiniBand hardware
       (at least one per each InfiniBand subnet).

       opensm also now contains an experimental version of a performance manager as well.

       opensm defaults were designed to meet the common case usage on clusters with up to  a  few
       hundred  nodes. Thus, in this default mode, opensm will scan the IB fabric, initialize it,
       and sweep occasionally for changes.

       opensm attaches to a specific IB port on the local machine and configures only the  fabric
       connected  to it. (If the local machine has other IB ports, opensm will ignore the fabrics
       connected to those other ports). If no port is specified, it will select the first  "best"
       available port.

       opensm can present the available ports and prompt for a port number to attach to.

       By  default,  the  run  is logged to two files: /var/log/messages and /var/log/opensm.log.
       The first file will register only general major events, whereas the  second  will  include
       details  of  reported errors. All errors reported in this second file should be treated as
       indicators of IB fabric health issues.  (Note that when a fatal and non-recoverable  error
       occurs,  opensm  will  exit.)   Both  log  files should include the message "SUBNET UP" if
       opensm was able to setup the subnet correctly.

OPTIONS

       --version
              Prints OpenSM version and exits.

       -F, --config <config file>
              The name of the OpenSM config file. When  not  specified    /etc/opensm/opensm.conf
              will be used (if exists).

       -c, --create-config <file name>
              OpenSM  will  dump its configuration to the specified file and exit.  This is a way
              to generate OpenSM configuration file template.

       -g, --guid <GUID in hex>
              This option specifies the local port GUID value  with  which  OpenSM  should  bind.
              OpenSM  may  be  bound  to 1 port at a time.  If GUID given is 0, OpenSM displays a
              list of possible port GUIDs and waits for user input.  Without -g, OpenSM tries  to
              use the default port.

       -l, --lmc <LMC value>
              This  option specifies the subnet's LMC value.  The number of LIDs assigned to each
              port is 2^LMC.  The LMC value must be in the range  0-7.   LMC  values  >  0  allow
              multiple  paths  between  ports.   LMC values > 0 should only be used if the subnet
              topology  actually  provides  multiple   paths   between   ports,   i.e.   multiple
              interconnects  between  switches.   Without  -l,  OpenSM defaults to LMC = 0, which
              allows one path between any two ports.

       -p, --priority <Priority value>
              This option specifies the SM´s PRIORITY.  This  will  effect  the  handover  cases,
              where master is chosen by priority and GUID.  Range goes from 0 (default and lowest
              priority) to 15 (highest).

       --subnet_prefix <PREFIX in hex>
              This option specifies the subnet prefix to use  in  on  the  fabric.   The  default
              prefix is 0xfe80000000000000.

       --smkey <SM_Key value>
              This   option   specifies   the  SM´s  SM_Key  (64  bits).   This  will  effect  SM
              authentication.  Note that OpenSM version 3.2.1 and below used  the  default  value
              '1'  in  a  host  byte  order,  it  is  fixed  now  but you may need this option to
              interoperate with old OpenSM running on a little endian machine.

       --sm_sl <SL number>
              This option sets the SL to use for communication with the SM/SA.  Defaults to 0.

       -r, --reassign_lids
              This option causes OpenSM to reassign LIDs to all end nodes.  Specifying  -r  on  a
              running subnet may disrupt subnet traffic.  Without -r, OpenSM attempts to preserve
              existing LID assignments resolving multiple use of same LID.

       -R, --routing_engine <Routing engine names>
              This option  chooses  routing  engine(s)  to  use  instead  of  Min  Hop  algorithm
              (default).   Multiple  routing engines can be specified separated by commas so that
              specific ordering of routing algorithms will be tried if  earlier  routing  engines
              fail.   If all configured routing engines fail, OpenSM will always attempt to route
              with Min Hop unless 'no_fallback' is included  in  the  list  of  routing  engines.
              Supported  engines:  minhop,  updn,  dnup, file, ftree, lash, dor, torus-2QoS, nue,
              dfsssp, sssp.

       --do_mesh_analysis
              This option enables additional analysis for the lash routing engine to precondition
              switch  port assignments in regular cartesian meshes which may reduce the number of
              SLs required to give a deadlock free routing.

       --lash_start_vl <vl number>
              This option sets the starting VL to use for the lash routing  algorithm.   Defaults
              to 0.

       --nue_max_num_vls <vl number>
              This  option  sets  the  maximum  number  of VLs to use for the Nue routing engine.
              Every number greater or equal to 0 is allowed, and the  default  is  1  to  enforce
              deadlock-freedom  even  if  QoS  is not enabled. If set to 0, then Nue routing will
              automatically determine and choose maximum supported by the fabric. And if  set  to
              any  integer >= 1, then Nue uses min(max_supported,nue_max_num_vls).  Rule of thumb
              is: higher nue_max_num_vls results in better path balancing.

       -A, --ucast_cache
              This option enables unicast routing cache and prevents routing recalculation (which
              is  a  heavy  task  in  a large cluster) when there was no topology change detected
              during the heavy sweep, or when the topology change does not  require  new  routing
              calculation,  e.g.  when  one  or more CAs/RTRs/leaf switches going down, or one or
              more of these nodes coming back after being down.   A  very  common  case  that  is
              handled  by  the  unicast routing cache is host reboot, which otherwise would cause
              two full routing recalculations: one when the host goes down, and  the  other  when
              the host comes back online.

       -z, --connect_roots
              This  option  enforces  routing engines (up/down and fat-tree) to make connectivity
              between root switches and in this way to be fully IBA compliant. In many cases this
              can violate "pure" deadlock free algorithm, so use it carefully.

       -M, --lid_matrix_file <file name>
              This  option  specifies  the name of the lid matrix dump file from where switch lid
              matrices (min hops tables) will be loaded.

       -U, --lfts_file <file name>
              This option specifies the name of the LFTs file from where switch forwarding tables
              will be loaded when using "file" routing engine.

       -S, --sadb_file <file name>
              This  option  specifies the name of the SA DB dump file from where SA database will
              be loaded.

       -a, --root_guid_file <file name>
              Set the root nodes for the Up/Down or  Fat-Tree  routing  algorithm  to  the  guids
              provided in the given file (one to a line).

       -u, --cn_guid_file <file name>
              Set  the  compute  nodes  for the Fat-Tree or DFSSSP/SSSP routing algorithms to the
              port GUIDs provided in the given file (one to a line).

       -G, --io_guid_file <file name>
              Set the I/O nodes for the Fat-Tree or DFSSSP/SSSP routing algorithms  to  the  port
              GUIDs provided in the given file (one to a line).
              In the case of Fat-Tree routing:
              I/O nodes are non-CN nodes allowed to use up to max_reverse_hops switches the wrong
              way around to improve connectivity.
              In the case of (DF)SSSP routing:
              Providing guids of compute and/or I/O nodes will ensure that  paths  towards  those
              nodes  are  as  much  separated  as  possible within their node category, i.e., I/O
              traffic will not share the same link if multiple links are available.

       --port-shifting
              This option enables a feature called port shifting.  In some fabrics,  particularly
              cluster  environments,  routes  commonly align and congest with other routes due to
              algorithmically unchanging traffic patterns.   This  routing  option  will  "shift"
              routing around in an attempt to alleviate this problem.

       --scatter-ports <random seed>
              This  option  is  used  to  randomize port selection in routing rather than using a
              round-robin algorithm (which is the default). Value supplied with option is used as
              a  random  seed.   If value is 0, which is the default, the scatter ports option is
              disabled.

       -H, --max_reverse_hops <max reverse hops allowed>
              Set the maximum number of reverse hops an I/O node is allowed to  make.  A  reverse
              hop is the use of a switch the wrong way around.

       -m, --ids_guid_file <file name>
              Name  of  the  map  file  with set of the IDs which will be used by Up/Down routing
              algorithm instead of node GUIDs (format: <guid> <id> per line).

       -X, --guid_routing_order_file <file name>
              Set the order port guids  will  be  routed  for  the  MinHop  and  Up/Down  routing
              algorithms to the guids provided in the given file (one to a line).

       -o, --once
              This option causes OpenSM to configure the subnet once, then exit.  Ports remain in
              the ACTIVE state.

       -s, --sweep <interval value>
              This option specifies the number of seconds between subnet sweeps.  Specifying -s 0
              disables sweeping.  Without -s, OpenSM defaults to a sweep interval of 10 seconds.

       -t, --timeout <value>
              This  option  specifies  the  time  in  milliseconds used for transaction timeouts.
              Timeout values should be > 0.  Without -t, OpenSM defaults to a  timeout  value  of
              200 milliseconds.

       --retries <number>
              This  option  specifies  the  number  of  retries  used  for transactions.  Without
              --retries, OpenSM defaults to 3 retries for transactions.

       --maxsmps <number>
              This option specifies the number of VL15 SMP MADs allowed on the wire  at  any  one
              time.    Specifying   --maxsmps  0  allows  unlimited  outstanding  SMPs.   Without
              --maxsmps, OpenSM defaults to a maximum of 4 outstanding SMPs.

       --console [off | local | loopback | socket]
              This option brings up the OpenSM console (default off).  Note, loopback and  socket
              open  a socket which can be connected to WITHOUT CREDENTIALS.  Loopback is safer if
              access to your SM host is controlled.  tcp_wrappers  (hosts.[allow|deny])  is  used
              with loopback and socket.  loopback and socket will only be available if OpenSM was
              built with  --enable-console-loopback  (default  yes)  and  --enable-console-socket
              (default no) respectively.

       --console-port <port>
              Specify an alternate telnet port for the socket console (default 10000).  Note that
              this option only appears if OpenSM was built with --enable-console-socket.

       -i, --ignore_guids <equalize-ignore-guids-file>
              This option provides the means to define a set of ports  (by  node  guid  and  port
              number) that will be ignored by the link load equalization algorithm.

       -w, --hop_weights_file <path to file>
              This  option  provides  weighting  factors  per  port  representing  a  hop cost in
              computing the lid matrix.  The file consists of lines containing a switch port GUID
              (specified  as  a  64  bit  hex  number,  with leading 0x), output port number, and
              weighting factor.  Any port not listed in the file defaults to a  weighting  factor
              of  1.   Lines  starting with # are comments.  Weights affect only the output route
              from the port, so many useful configurations will require weights to  be  specified
              in pairs.

       -O, --port_search_ordering_file <path to file>
              This  option  tweaks  the  routing.  It  suitable for two cases: 1. While using DOR
              routing algorithm.  This option provides a mapping between hypercube dimensions and
              ports on a per switch basis for the DOR routing engine.  The file consists of lines
              containing a switch node GUID (specified as a 64 bit hex number, with  leading  0x)
              followed  by  a  list of non-zero port numbers, separated by spaces, one switch per
              line.  The order for the port numbers is  in  one  to  one  correspondence  to  the
              dimensions.   Ports  not listed on a line are assigned to the remaining dimensions,
              in port order.  Anything after a # is a comment.  2. While  using  general  routing
              algorithm.   This  option  provides the order of the ports that would be chosen for
              routing, from each switch rather than searching for an appropriate port from port 1
              to  N.  The file consists of lines containing a switch node GUID (specified as a 64
              bit hex number, with leading 0x) followed by  a  list  of  non-zero  port  numbers,
              separated  by  spaces, one switch per line.  In case of DOR, the order for the port
              numbers is in one to one correspondence to the dimensions.  Ports not listed  on  a
              line  are  assigned to the remaining dimensions, in port order.  Anything after a #
              is a comment.

       -O, --dimn_ports_file <path to file> (DEPRECATED)
              This is a deprecated flag. Please use  --port_search_ordering_file  instead.   This
              option  provides  a  mapping between hypercube dimensions and ports on a per switch
              basis for the DOR routing engine.  The file consists of lines containing  a  switch
              node GUID (specified as a 64 bit hex number, with leading 0x) followed by a list of
              non-zero port numbers, separated by spaces, one switch per line.  The order for the
              port  numbers  is in one to one correspondence to the dimensions.  Ports not listed
              on a line are assigned to the remaining dimensions, in port order.  Anything  after
              a # is a comment.

       -x, --honor_guid2lid
              This  option forces OpenSM to honor the guid2lid file, when it comes out of Standby
              state, if such file exists under OSM_CACHE_DIR, and is valid.  By default, this  is
              FALSE.

       --dump_files_dir <directory name>
              This option will set the directory to hold the file dumps.

       -f, --log_file <file name>
              This  option  defines  the  log  to be the given file.  By default, the log goes to
              /var/log/opensm.log.  For the log to go to standard output use -f stdout.

       -L, --log_limit <size in MB>
              This option defines maximal log file size in MB. When specified the log  file  will
              be truncated upon reaching this limit.

       -e, --erase_log_file
              This  option  will  cause  deletion  of  the log file (if it previously exists). By
              default, the log file is accumulative.

       -P, --Pconfig <partition config file>
              This option defines the optional partition configuration file.  The default name is
              /etc/opensm/partitions.conf.

       --prefix_routes_file <file name>
              Prefix  routes  control  how  the SA responds to path record queries for off-subnet
              DGIDs.  By default, the SA fails such queries.  The  PREFIX  ROUTES  section  below
              describes   the   format   of   the   configuration  file.   The  default  path  is
              /etc/opensm/prefix-routes.conf.

       -Q, --qos
              This option enables QoS setup. It is disabled by default.

       -Y, --qos_policy_file <file name>
              This  option  defines  the  optional  QoS  policy  file.  The   default   name   is
              /etc/opensm/qos-policy.conf.  See  QoS_management_in_OpenSM.txt  in  opensm doc for
              more information on configuring QoS policy via this file.

       --congestion_control
              (EXPERIMENTAL)  This  option  enables  congestion  control  configuration.   It  is
              disabled by default.  See config file for congestion control configuration options.
              --cc_key <key>  (EXPERIMENTAL)  This  option  configures  the  CCkey  to  use  when
              configuring  congestion  control.   Note  that this option does not configure a new
              CCkey into switches and CAs.  Defaults to 0.

       -N, --no_part_enforce (DEPRECATED)
              This is a deprecated flag. Please use --part_enforce instead.  This option disables
              partition enforcement on switch external ports.

       -Z, --part_enforce [both | in | out | off]
              This  option  indicates the partition enforcement type (for switches).  Enforcement
              type can be inbound only (in), outbound only (out), both or disabled (off). Default
              is both.

       -W, --allow_both_pkeys
              This  option  indicates  whether  both  full  and  limited  membership  on the same
              partition can be configured in the PKeyTable. Default is not to allow both pkeys.

       -y, --stay_on_fatal
              This option will cause SM not  to  exit  on  fatal  initialization  issues:  if  SM
              discovers  duplicated  guids or a 12x link with lane reversal badly configured.  By
              default, the SM will exit on these errors.

       -B, --daemon
              Run in daemon mode - OpenSM will run in the background.

       -J, --pidfile <file_name>
              Makes the SM write its own PID to the specified file when started in daemon mode.

       -I, --inactive
              Start SM in inactive rather than init  SM  state.   This  option  can  be  used  in
              conjunction  with the perfmgr so as to run a standalone performance manager without
              SM/SA.  However, this is NOT currently implemented in the performance manager.

       --perfmgr
              Enable the perfmgr.   Only  takes  effect  if  --enable-perfmgr  was  specified  at
              configure   time.    See  performance-manager-HOWTO.txt  in  opensm  doc  for  more
              information on running perfmgr.

       --perfmgr_sweep_time_s <seconds>
              Specify the sweep time for the performance  manager  in  seconds  (default  is  180
              seconds).  Only takes effect if --enable-perfmgr was specified at configure time.

       --consolidate_ipv6_snm_req
              Use shared MLID for IPv6 Solicited Node Multicast groups per MGID scope and P_Key.

       --log_prefix <prefix text>
              This  option  specifies  the prefix to the syslog messages from OpenSM.  A suitable
              prefix can be used to identify the IB subnet in syslog messages when  two  or  more
              instances  of  OpenSM run in a single node to manage multiple fabrics. For example,
              in a dual-fabric (or dual-rail) IB cluster, the prefix for the first  fabric  could
              be "mpi" and the other fabric could be "storage".

       --torus_config <path to torus-2QoS config file>
              This  option  defines  the file name for the extra configuration information needed
              for    the    torus-2QoS    routing    engine.      The     default     name     is
              /etc/opensm/torus-2QoS.conf

       -v, --verbose
              This  option  increases  the  log  verbosity level.  The -v option may be specified
              multiple times to further increase the verbosity level.  See the -D option for more
              information about log verbosity.

       -V     This  option  sets  the  maximum  verbosity  level and forces log flushing.  The -V
              option is equivalent to ´-D 0xFF -d 2´.  See the -D  option  for  more  information
              about log verbosity.

       -D <value>
              This option sets the log verbosity level.  A flags field must follow the -D option.
              A bit set/clear in the flags enables/disables a specific log level as follows:

               BIT    LOG LEVEL ENABLED
               ----   -----------------
               0x01 - ERROR (error messages)
               0x02 - INFO (basic messages, low volume)
               0x04 - VERBOSE (interesting stuff, moderate volume)
               0x08 - DEBUG (diagnostic, high volume)
               0x10 - FUNCS (function entry/exit, very high volume)
               0x20 - FRAMES (dumps all SMP and GMP frames)
               0x40 - ROUTING (dump FDB routing information)
               0x80 - SYS (syslog at LOG_INFO level in addition to OpenSM logging)

              Without -D, OpenSM defaults to ERROR + INFO (0x3).  Specifying -D  0  disables  all
              messages.  Specifying -D 0xFF enables all messages (see -V).  High verbosity levels
              may require increasing the transaction timeout with the -t option.

       -d, --debug <value>
              This option specifies a debug option.  These options are not normally needed.   The
              number following -d selects the debug option to enable as follows:

               OPT   Description
               ---    -----------------
               -d0  - Ignore other SM nodes
               -d1  - Force single threaded dispatching
               -d2  - Force log flushing after each log message
               -d3  - Disable multicast support

       -h, --help
              Display this usage info then exit.

       -?     Display this usage info then exit.

ENVIRONMENT VARIABLES

       The following environment variables control opensm behavior:

       OSM_TMP_DIR  - controls the directory in which the temporary files generated by opensm are
       created. These files are: opensm-subnet.lst, opensm.fdbs, and opensm.mcfdbs.  By  default,
       this   directory   is   /var/log.  Note  that  --dump_files_dir  command  line  option  or
       dump_file_dir  option  in  option/config  file  takes  precedence  over  this  environment
       variable.

       OSM_CACHE_DIR  -  opensm  stores  certain  data  to the disk such that subsequent runs are
       consistent. The default directory used is  /var/cache/opensm.   The  following  files  are
       included in it:

        guid2lid  - stores the LID range assigned to each GUID
        guid2mkey - stores the MKey previously assigned to each GUID
        neighbors - stores a map of the GUIDs at either end of each link
                    in the fabric

NOTES

       When  opensm  receives a HUP signal, it starts a new heavy sweep as if a trap was received
       or a topology change was found.

       Also, SIGUSR1 can be used  to  trigger  a  reopen  of  /var/log/opensm.log  for  logrotate
       purposes.

PARTITION CONFIGURATION

       The  default  name of OpenSM partitions configuration file is /etc/opensm/partitions.conf.
       The default may be changed by using the --Pconfig (-P) option with OpenSM.

       The default partition will be  created  by  OpenSM  unconditionally  even  when  partition
       configuration file does not exist or cannot be accessed.

       The  default  partition  has  P_Key  value  0x7fff.  OpenSM´s  port  will always have full
       membership in default partition. All other end ports will  have  full  membership  if  the
       partition  configuration file is not found or cannot be accessed, or limited membership if
       the file exists and can be accessed but there is no rule for the Default partition.

       Effectively, this amounts to the same as if one of the following rules below appear in the
       partition configuration file.

       In the case of no rule for the Default partition:

       Default=0x7fff : ALL=limited, SELF=full ;

       In the case of no partition configuration file or file cannot be accessed:

       Default=0x7fff : ALL=full ;

       File Format

       Comments:

       Line content followed after ´#´ character is comment and ignored by parser.

       General file format:

       <Partition Definition>:[<newline>]<Partition Properties>;

            Partition Definition:
              [PartitionName][=PKey][,indx0][,ipoib_bc_flags][,defmember=full|limited]

               PartitionName  - string, will be used with logging. When
                                omitted, empty string will be used.
               PKey           - P_Key value for this partition. Only low 15
                                bits will be used. When omitted will be
                                autogenerated.
               indx0          - indicates that this pkey should be inserted in
                                block 0 index 0.
               ipoib_bc_flags - used to indicate/specify IPoIB capability of
                                this partition.

               defmember=full|limited|both - specifies default membership for
                                port guid list. Default is limited.

            ipoib_bc_flags:
               ipoib_flag|[mgroup_flag]*

               ipoib_flag:
                   ipoib  - indicates that this partition may be used for
                            IPoIB, as a result the IPoIB broadcast group will
                            be created with the mgroup_flag flags given,
                            if any.

            Partition Properties:
              [<Port list>|<MCast Group>]* | <Port list>

            Port list:
               <Port Specifier>[,<Port Specifier>]

            Port Specifier:
               <PortGUID>[=[full|limited|both]]

               PortGUID         - GUID of partition member EndPort.
                                  Hexadecimal numbers should start from
                                  0x, decimal numbers are accepted too.
               full, limited,   - indicates full and/or limited membership for
               both               this port.  When omitted (or unrecognized)
                                  limited membership is assumed.  Both
                                  indicates both full and limited membership
                                  for this port.

            MCast Group:
               mgid=gid[,mgroup_flag]*<newline>

                                - gid specified is verified to be a Multicast
                                  address.  IP groups are verified to match
                                  the rate and mtu of the broadcast group.
                                  The P_Key bits of the mgid for IP groups are
                                  verified to either match the P_Key specified
                                  in by "Partition Definition" or if they are
                                  0x0000 the P_Key will be copied into those
                                  bits.

            mgroup_flag:
               rate=<val>  - specifies rate for this MC group
                             (default is 3 (10GBps))
               mtu=<val>   - specifies MTU for this MC group
                             (default is 4 (2048))
               sl=<val>    - specifies SL for this MC group
                             (default is 0)
               scope=<val> - specifies scope for this MC group
                             (default is 2 (link local)).  Multiple scope
                             settings are permitted for a partition.
                             NOTE: This overwrites the scope nibble of the
                                   specified mgid.  Furthermore specifying
                                   multiple scope settings will result in
                                   multiple MC groups being created.
               Q_Key=<val>     - specifies the Q_Key for this MC group
                                 (default: 0x0b1b for IP groups, 0 for other
                                  groups)
                                 WARNING: changing this for the broadcast
                                          group may break IPoIB on client
                                          nodes!!
               TClass=<val>    - specifies tclass for this MC group
                                 (default is 0)
               FlowLabel=<val> - specifies FlowLabel for this MC group
                                 (default   is  0)       NOTE:  All  mgroup_flag  flags  MUST  be
       separated by comma (,).

       Note that values for rate, mtu, and scope,  for  both  partitions  and  multicast  groups,
       should be specified as defined in the IBTA specification (for example, mtu=4 for 2048).

       There are several useful keywords for PortGUID definition:

        - 'ALL' means all end ports in this subnet.
        - 'ALL_CAS' means all Channel Adapter end ports in this subnet.
        - 'ALL_SWITCHES' means all Switch end ports in this subnet.
        - 'ALL_ROUTERS' means all Router end ports in this subnet.
        - 'SELF' means subnet manager's port.

       Empty list means no ports in this partition.

       Notes:

       White space is permitted between delimiters ('=', ',',':',';').

       PartitionName  does  not  need  to  be  unique,  PKey  does need to be unique.  If PKey is
       repeated then those partition configurations will be merged and first  PartitionName  will
       be used (see also next note).

       It is possible to split partition configuration in more than one definition, but then PKey
       should be explicitly specified (otherwise different PKey  values  will  be  generated  for
       those definitions).

       Examples:

        Default=0x7fff : ALL, SELF=full ;
        Default=0x7fff : ALL, ALL_SWITCHES=full, SELF=full ;

        NewPartition , ipoib : 0x123456=full, 0x3456789034=limi, 0x2134af2306 ;

        YetAnotherOne = 0x300 : SELF=full ;
        YetAnotherOne = 0x300 : ALL=limited ;

        ShareIO = 0x80 , defmember=full : 0x123451, 0x123452;
        # 0x123453, 0x123454 will be limited
        ShareIO = 0x80 : 0x123453, 0x123454, 0x123455=full;
        # 0x123456, 0x123457 will be limited
        ShareIO = 0x80 : defmember=limited : 0x123456, 0x123457, 0x123458=full;
        ShareIO = 0x80 , defmember=full : 0x123459, 0x12345a;
        ShareIO = 0x80 , defmember=full : 0x12345b, 0x12345c=limited, 0x12345d;

        # multicast groups added to default
        Default=0x7fff,ipoib:
               mgid=ff12:401b::0707,sl=1 # random IPv4 group
               mgid=ff12:601b::16    # MLDv2-capable routers
               mgid=ff12:401b::16    # IGMP
               mgid=ff12:601b::2     # All routers
               mgid=ff12::1,sl=1,Q_Key=0xDEADBEEF,rate=3,mtu=2 # random group
               ALL=full;

       Note:

       The following rule is equivalent to how OpenSM used to run prior to the partition manager:

        Default=0x7fff,ipoib:ALL=full;

QOS CONFIGURATION

       There  are  a  set of QoS related low-level configuration parameters.  All these parameter
       names are prefixed by "qos_" string. Here is a full list of these parameters:

        qos_max_vls    - The maximum number of VLs that will be on the subnet
        qos_high_limit - The limit of High Priority component of VL
                         Arbitration table (IBA 7.6.9)
        qos_vlarb_low  - Low priority VL Arbitration table (IBA 7.6.9)
                         template
        qos_vlarb_high - High priority VL Arbitration table (IBA 7.6.9)
                         template
                         Both VL arbitration templates are pairs of
                         VL and weight
        qos_sl2vl      - SL2VL Mapping table (IBA 7.6.6) template. It is
                         a list of VLs corresponding to SLs 0-15 (Note
                         that VL15 used here means drop this SL)

       Typical default values (hard-coded in OpenSM initialization) are:

        qos_max_vls 15
        qos_high_limit 0
        qos_vlarb_low 0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4
        qos_vlarb_high 0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0
        qos_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7

       The syntax is compatible with rest of OpenSM  configuration  options  and  values  may  be
       stored in OpenSM config file (cached options file).

       In  addition  to  the  above, we may define separate QoS configuration parameters sets for
       various target types. As targets, we  currently  support  CAs,  routers,  switch  external
       ports, and switch's enhanced port 0. The names of such specialized parameters are prefixed
       by "qos_<type>_" string. Here is a full list of the currently supported sets:

        qos_ca_  - QoS configuration parameters set for CAs.
        qos_rtr_ - parameters set for routers.
        qos_sw0_ - parameters set for switches' port 0.
        qos_swe_ - parameters set for switches' external ports.

       Examples:
        qos_sw0_max_vls=2
        qos_ca_sl2vl=0,1,2,3,5,5,5,12,12,0,
        qos_swe_high_limit=0

PREFIX ROUTES

       Prefix routes control how the SA responds to path record queries for off-subnet DGIDs.  By
       default,  the  SA  fails  such  queries.  Note that IBA does not specify how the SA should
       obtain off-subnet path record information.  The prefix routes configuration is meant as  a
       stop-gap until the specification is completed.

       Each  line  in  the  configuration  file  is  a  64-bit  prefix followed by a 64-bit GUID,
       separated by white space.  The GUID specifies the router port on  the  local  subnet  that
       will handle the prefix.  Blank lines are ignored, as is anything between a # character and
       the end of the line.  The prefix and GUID are both in hex, the  leading  0x  is  optional.
       Either,  or  both,  can  be  wild-carded  by specifying an asterisk instead of an explicit
       prefix or GUID.

       When responding to a path record query for an off-subnet DGID,  opensm  searches  for  the
       first  prefix  match  in the configuration file.  Therefore, the order of the lines in the
       configuration  file  is  important:  a  wild-carded  prefix  at  the  beginning   of   the
       configuration  file  renders  all  subsequent  lines  useless.  If there is no match, then
       opensm fails the query.  It is legal to repeat prefixes in the configuration file,  opensm
       will  return the path to the first available matching router.  A configuration file with a
       single line where both prefix and GUID are wild-carded means  that  a  path  record  query
       specifying  any  off-subnet DGID should return a path to the first available router.  This
       configuration yields  the  same  behavior  formerly  achieved  by  compiling  opensm  with
       -DROUTER_EXP which has been obsoleted.

MKEY CONFIGURATION

       OpenSM supports configuring a single management key (MKey) for use across the subnet.

       The following configuration options are available:

        m_key                  - the 64-bit MKey to be used on the subnet
                                 (IBA 14.2.4)
        m_key_protection_level - the numeric value of the MKey ProtectBits
                                 (IBA 14.2.4.1)
        m_key_lease_period     - the number of seconds a CA will wait for a
                                 response from the SM before resetting the
                                 protection level to 0 (IBA 14.2.4.2).

       OpenSM will configure all ports with the MKey specified by m_key, defaulting to a value of
       0. A m_key value of 0 disables MKey protection on the subnet.  Switches and  HCAs  with  a
       non-zero  MKey  will  not accept requests to change their configuration unless the request
       includes the proper MKey.

       MKey Protection Levels

       MKey protection levels modify how switches and CAs respond to SMPs lacking a  valid  MKey.
       OpenSM  will  configure  each  port's  ProtectBits  to  support  the  level defined by the
       m_key_protection_level parameter.  If  no  parameter  is  specified,  OpenSM  defaults  to
       operating at protection level 0.

       There are currently 4 protection levels defined by the IBA:

        0 - Queries return valid data, including MKey.  Configuration changes
            are not allowed unless the request contains a valid MKey.
        1 - Like level 0, but the MKey is set to 0 (0x00000000) in queries,
            unless the request contains a valid MKey.
        2 - Neither queries nor configuration changes are allowed, unless the
            request contains a valid MKey.
        3 - Identical to 2.  Maintained for backwards compatibility.

       MKey Lease Period

       InfiniBand  supports  a MKey lease timeout, which is intended to allow administrators or a
       new SM to recover/reset lost MKeys on a fabric.

       If MKeys are enabled on the subnet and a switch or CA receives a request that  requires  a
       valid  MKey  but  does not contain one, it warns the SM by sending a trap (Bad M_Key, Trap
       256).  If the MKey lease period is non-zero, it also starts a countdown timer for the time
       specified  by  the lease period.  If a SM (or other agent) responds with the correct MKey,
       the timer is stopped and reset.  Should the timer reach zero, the switch or CA will  reset
       its MKey protection level to 0, exposing the MKey and allowing recovery.

       OpenSM  will  initialize  all  ports  to  use a mkey lease period of the number of seconds
       specified in the config file.  If no mkey_lease_period is specified, a default of  0  will
       be used.

       OpenSM  normally  quickly  responds  to  all  Bad_M_Key traps, resetting the lease timers.
       Additionally, OpenSM's subnet sweeps will also cancel any  running  timers.   For  maximum
       protection  against  accidentally-exposed  MKeys,  the  MKey  lease  time  should be a few
       multiples of the subnet sweep time.  If OpenSM detects at startup that your sweep interval
       is  greater than your MKey lease period, it will reset the lease period to be greater than
       the sweep interval.  Similarly, if sweeping is disabled at startup, it will be  re-enabled
       with an interval less than the Mkey lease period.

       If OpenSM is required to recover a subnet for which it is missing mkeys, it must do so one
       switch level at a time.  As such, the total time to recover the subnet may be as  long  as
       the  mkey  lease  period  multiplied  by  the maximum number of hops between the SM and an
       endpoint, plus one.

       MKey Effects on Diagnostic Utilities

       Setting a MKey may have a detrimental effect on diagnostic software  run  on  the  subnet,
       unless your diagnostic software is able to retrieve MKeys from the SA or can be explicitly
       configured with the proper MKey.  This is particularly true at protection level  2,  where
       CAs will ignore queries for management information that do not contain the proper MKey.

ROUTING

       OpenSM now offers ten routing engines:

       1.   Min  Hop  Algorithm - based on the minimum hops to each node where the path length is
       optimized.

       2.  UPDN Unicast routing algorithm - also based on the minimum hops to each node,  but  it
       is  constrained  to  ranking rules. This algorithm should be chosen if the subnet is not a
       pure Fat Tree, and deadlock may occur due to a loop in the subnet.

       3. DNUP Unicast routing algorithm - similar to UPDN but allows routing  in  fabrics  which
       have some CA nodes attached closer to the roots than some switch nodes.

       4.   Fat Tree Unicast routing algorithm - this algorithm optimizes routing for congestion-
       free "shift" communication pattern.  It should be chosen if a subnet is a  symmetrical  or
       almost  symmetrical fat-tree of various types, not just K-ary-N-Trees: non-constant K, not
       fully staffed, any Constant Bisectional Bandwidth (CBB) ratio.  Similar to UPDN, Fat  Tree
       routing is constrained to ranking rules.

       5.  LASH  unicast  routing  algorithm  -  uses  InfiniBand  virtual layers (SL) to provide
       deadlock-free shortest-path routing while also distributing the paths between layers. LASH
       is  an  alternative  deadlock-free  topology-agnostic routing algorithm to the non-minimal
       UPDN algorithm avoiding the use of a potentially congested root node.

       6. DOR Unicast routing algorithm - based  on  the  Min  Hop  algorithm,  but  avoids  port
       equalization  except  for  redundant  links  between the same two switches.  This provides
       deadlock free routes for hypercubes when the fabric is  cabled  as  a  hypercube  and  for
       meshes when cabled as a mesh (see details below).

       7.  Torus-2QoS  unicast  routing algorithm - a DOR-based routing algorithm specialized for
       2D/3D torus topologies.  Torus-2QoS provides deadlock-free routing  while  supporting  two
       quality  of  service (QoS) levels.  In addition it is able to route around multiple failed
       fabric links or a single failed fabric switch without introducing deadlocks,  and  without
       changing path SL values granted before the failure.

       8. DFSSSP unicast routing algorithm - a deadlock-free single-source-shortest-path routing,
       which uses the SSSP algorithm (see algorithm 9.) as the base to optimize link  utilization
       and uses InfiniBand virtual lanes (SL) to provide deadlock-freedom.

       9. SSSP unicast routing algorithm - a single-source-shortest-path routing algorithm, which
       globally balances the number of routes per link to optimize link utilization. This routing
       algorithm has no restrictions in terms of the underlying topology.

       10.  Nue unicast routing algorithm - a 100%-applicable and deadlock-free routing which can
       be used for any arbitrary or faulty network topology and any number of virtual lanes (this
       includes  the  absence  of  VLs  as well). Paths are globally balanced w.r.t the number of
       routes per link, and are kept as short as possible while enforcing deadlock-freedom within
       the VL constraint.

       OpenSM  also  supports  a  file  method  which  can load routes from a table. See ´Modular
       Routing Engine´ for more information on this.

       The basic routing algorithm is comprised of two stages:

       1. MinHop matrix calculation
          How many hops are required to get from each port to each LID ?
          The algorithm to fill these tables is different  if  you  run  standard  (min  hop)  or
       Up/Down.
          For  standard routing, a "relaxation" algorithm is used to propagate min hop from every
       destination LID through neighbor switches
          For Up/Down routing, a BFS from every target is used. The BFS tracks link direction (up
       or down) and avoid steps that will perform up after a down step was used.

       2.  Once  MinHop matrices exist, each switch is visited and for each target LID a decision
       is made as to what port should be used to get to that LID.
          This step is common to standard and Up/Down routing. Each port has a  counter  counting
       the number of target LIDs going through it.
          When  there are multiple alternative ports with same MinHop to a LID, the one with less
       previously assigned LIDs is selected.
          If LMC > 0, more checks are added: Within each group of LIDs assigned  to  same  target
       port,
          a. use only ports which have same MinHop
          b. first prefer the ones that go to different systemImageGuid (then the previous LID of
       the same LMC group)
          c. if none - prefer those which go through another NodeGuid
          d. fall back to the number of paths method (if all go to same node).

       Effect of Topology Changes

       OpenSM will preserve existing routing in any case where there is no change in  the  fabric
       switches unless the -r (--reassign_lids) option is specified.

       -r
       --reassign_lids
                 This option causes OpenSM to reassign LIDs to all
                 end nodes. Specifying -r on a running subnet
                 may disrupt subnet traffic.
                 Without -r, OpenSM attempts to preserve existing
                 LID assignments resolving multiple use of same LID.

       If  a link is added or removed, OpenSM does not recalculate the routes that do not have to
       change. A route has to change if the port is no longer UP or no longer  the  MinHop.  When
       routing changes are performed, the same algorithm for balancing the routes is invoked.

       In  the  case  of using the file based routing, any topology changes are currently ignored
       The 'file' routing engine just loads the LFTs from the file specified, with no reaction to
       real topology. Obviously, this will not be able to recheck LIDs (by GUID) for disconnected
       nodes, and LFTs for non-existent switches will be skipped. Multicast is  not  affected  by
       'file' routing engine (this uses min hop tables).

       Min Hop Algorithm

       The  Min Hop algorithm is invoked by default if no routing algorithm is specified.  It can
       also be invoked by specifying '-R minhop'.

       The Min Hop algorithm is divided into two stages: computation of min-hop tables  on  every
       switch  and  LFT  output  port  assignment.  Link  subscription is also equalized with the
       ability to override based on port GUID. The latter is supplied by:

       -i <equalize-ignore-guids-file>
       --ignore_guids <equalize-ignore-guids-file>
                 This option provides the means to define a set of ports
                 (by guid) that will be ignored by the link load
                 equalization algorithm. Note that only endports (CA,
                 switch port 0, and router ports) and not switch external
                 ports are supported.

       LMC awareness routes based on (remote) system or switch basis.

       Purpose of UPDN Algorithm

       The UPDN algorithm is designed to prevent deadlocks from occurring in loops of the subnet.
       A  loop-deadlock is a situation in which it is no longer possible to send data between any
       two hosts connected through the loop. As such, the UPDN routing algorithm should  be  used
       if the subnet is not a pure Fat Tree, and one of its loops may experience a deadlock (due,
       for example, to high pressure).

       The UPDN algorithm is based on the following main stages:

       1.  Auto-detect root nodes - based on the CA hop length from any switch in the  subnet,  a
       statistical  histogram is built for each switch (hop num vs number of occurrences). If the
       histogram reflects a specific column (higher than others) for a certain node, then  it  is
       marked as a root node. Since the algorithm is statistical, it may not find any root nodes.
       The list of the root nodes found by this auto-detect stage is used by the ranking  process
       stage.

           Note 1: The user can override the node list manually.
           Note 2: If this stage cannot find any root nodes, and the user did
                   not specify a guid list file, OpenSM defaults back to the
                   Min Hop routing algorithm.

       2.   Ranking  process - All root switch nodes (found in stage 1) are assigned a rank of 0.
       Using the BFS  algorithm,  the  rest  of  the  switch  nodes  in  the  subnet  are  ranked
       incrementally.  This  ranking aids in the process of enforcing rules that ensure loop-free
       paths.

       3.  Min Hop Table setting - after ranking is done, a BFS algorithm is run from each (CA or
       switch)  node  in  the  subnet.  During the BFS process, the FDB table of each switch node
       traversed by BFS is updated, in reference to the starting node, based on the ranking rules
       and guid values.

       At  the  end  of  the  process,  the updated FDB tables ensure loop-free paths through the
       subnet.

       Note: Up/Down routing does not allow LID routing communication between switches  that  are
       located  inside spine "switch systems".  The reason is that there is no way to allow a LID
       route between them that does not break the Up/Down rule.  One ramification of this is that
       you cannot run SM on switches other than the leaf switches of the fabric.

       UPDN Algorithm Usage

       Activation through OpenSM

       Use  '-R  updn'  option  (instead  of  old  '-u') to activate the UPDN algorithm.  Use '-a
       <root_guid_file>' for adding an UPDN guid file that contains the root nodes  for  ranking.
       If the `-a' option is not used, OpenSM uses its auto-detect root nodes algorithm.

       Notes on the guid list file:

       1.    A valid guid file specifies one guid in each line. Lines with an invalid format will
       be discarded.
       2.   The user should specify the root switch  guids.  However,  it  is  also  possible  to
       specify  CA guids; OpenSM will use the guid of the switch (if it exists) that connects the
       CA to the subnet as a root node.

       Purpose of DNUP Algorithm

       The DNUP algorithm is designed to serve a similar purpose to UPDN. However it is  intended
       to  work  in  network  topologies  which are unsuited to UPDN due to nodes being connected
       closer to the roots than some of the  switches.   An  example  would  be  a  fabric  which
       contains nodes and uplinks connected to the same switch. The operation of DNUP is the same
       as UPDN with the exception of the ranking process.  In DNUP all switch  nodes  are  ranked
       based  solely  on  their distance from CA Nodes, all switch nodes directly connected to at
       least one CA are assigned a value of 1 all other switch nodes are assigned a value of  one
       more than the minimum rank of all neighbor switch nodes.

       Fat-tree Routing Algorithm

       The  fat-tree algorithm optimizes routing for "shift" communication pattern.  It should be
       chosen if a subnet is a symmetrical or almost symmetrical fat-tree of various  types.   It
       supports not just K-ary-N-Trees, by handling for non-constant K, cases where not all leafs
       (CAs) are present, any CBB  ratio.   As  in  UPDN,  fat-tree  also  prevents  credit-loop-
       deadlocks.

       If  the  root guid file is not provided ('-a' or '--root_guid_file' options), the topology
       has to be pure fat-tree that complies with the following rules:
         - Tree rank should be between two and eight (inclusively)
         - Switches of the same rank should have the same number
           of UP-going port groups*, unless they are root switches,
           in which case the shouldn't have UP-going ports at all.
         - Switches of the same rank should have the same number
           of DOWN-going port groups, unless they are leaf switches.
         - Switches of the same rank should have the same number
           of ports in each UP-going port group.
         - Switches of the same rank should have the same number
           of ports in each DOWN-going port group.
         - All the CAs have to be at the same tree level (rank).

       If the root guid file is provided, the topology doesn't have to be pure fat-tree,  and  it
       should only comply with the following rules:
         - Tree rank should be between two and eight (inclusively)
         - All the Compute Nodes** have to be at the same tree level (rank).
           Note that non-compute node CAs are allowed here to be at different
           tree ranks.

       * ports that are connected to the same remote switch are referenced as ´port group´.

       **  list  of  compute  nodes  (CNs)  can  be  specified by ´-u´ or ´--cn_guid_file´ OpenSM
       options.

       Topologies that do not comply cause a fallback to min hop routing.   Note  that  this  can
       also occur on link failures which cause the topology to no longer be "pure" fat-tree.

       Note  that  although  fat-tree  algorithm  supports  trees with non-integer CBB ratio, the
       routing will not be as balanced as in case of integer CBB ratio.   In  addition  to  this,
       although the algorithm allows leaf switches to have any number of CAs, the closer the tree
       is to be fully populated, the more effective the "shift" communication  pattern  will  be.
       In  general,  even  if  the  root  list is provided, the closer the topology to a pure and
       symmetrical fat-tree, the more optimal the routing will be.

       The algorithm also dumps compute node ordering file  (opensm-ftree-ca-order.dump)  in  the
       same directory where the OpenSM log resides. This ordering file provides the CN order that
       may be used to create efficient communication pattern, that will match the routing tables.

       Routing between non-CN nodes

       The use of the cn_guid_file option allows non-CN nodes to be located on  different  levels
       in  the  fat  tree.   In  such case, it is not guaranteed that the Fat Tree algorithm will
       route between two non-CN nodes.  To solve this problem, a list  of  non-CN  nodes  can  be
       specified  by  ´-G´  or  ´--io_guid_file´  option.   Theses  nodes  will be allowed to use
       switches  the  wrong  way  round  a  specific  number  of  times  (specified  by  ´-H´  or
       ´--max_reverse_hops´.   With  the proper max_reverse_hops and io_guid_file values, you can
       ensure full connectivity in the Fat Tree.

       Please note that using max_reverse_hops creates routes that use the switch in  a  counter-
       stream way.  This option should never be used to connect nodes with high bandwidth traffic
       between them ! It should only be used to allow connectivity for HA  purposes  or  similar.
       Also having routes the other way around can in theory cause credit loops.

       Use these options with extreme care !

       Activation through OpenSM

       Use  '-R  ftree'  option to activate the fat-tree algorithm.  Use '-a <root_guid_file>' to
       provide root nodes for ranking. If the `-a' option is not  used,  routing  algorithm  will
       detect roots automatically.  Use '-u <root_cn_file>' to provide the list of compute nodes.
       If the `-u' option is not used, all the CAs are considered as compute nodes.

       Note: LMC > 0 is not supported by fat-tree routing. If  this  is  specified,  the  default
       routing algorithm is invoked instead.

       LASH Routing Algorithm

       LASH  is an acronym for LAyered SHortest Path Routing. It is a deterministic shortest path
       routing  algorithm  that  enables   topology   agnostic   deadlock-free   routing   within
       communication networks.

       When  computing the routing function, LASH analyzes the network topology for the shortest-
       path routes between all pairs of sources  /  destinations  and  groups  these  paths  into
       virtual layers in such a way as to avoid deadlock.

       Note LASH analyzes routes and ensures deadlock freedom between switch pairs. The link from
       HCA between and switch does not need virtual layers as deadlock  will  not  arise  between
       switch and HCA.

       In more detail, the algorithm works as follows:

       1)  LASH  determines the shortest-path between all pairs of source / destination switches.
       Note, LASH ensures the same SL is used for all SRC/DST - DST/SRC pairs  and  there  is  no
       guarantee  that  the  return  path  for  a  given DST/SRC will be the reverse of the route
       SRC/DST.

       2) LASH then begins an SL assignment process where a route is assigned to a layer (SL)  if
       the  addition of that route does not cause deadlock within that layer. This is achieved by
       maintaining and analysing a channel dependency graph for each layer.  Once  the  potential
       addition  of  a  path  could  lead  to  deadlock, LASH opens a new layer and continues the
       process.

       3) Once this stage has been completed, it is highly likely that the first layers processed
       will  contain  more paths than the latter ones.  To better balance the use of layers, LASH
       moves paths from one layer to another so that the number of paths in each  layer  averages
       out.

       Note, the implementation of LASH in opensm attempts to use as few layers as possible. This
       number can be less than the number of actual layers available.

       In general LASH is a very flexible algorithm. It can, for  example,  reduce  to  Dimension
       Order Routing in certain topologies, it is topology agnostic and fares well in the face of
       faults.

       It has been shown that  for  both  regular  and  irregular  topologies,  LASH  outperforms
       Up/Down.  The  reason  for this is that LASH distributes the traffic more evenly through a
       network, avoiding the bottleneck issues related to a root node and always routes shortest-
       path.

       The algorithm was developed by Simula Research Laboratory.

       Use '-R lash -Q ' option to activate the LASH algorithm.

       Note: QoS support has to be turned on in order that SL/VL mappings are used.

       Note:  LMC  >  0  is  not supported by the LASH routing. If this is specified, the default
       routing algorithm is invoked instead.

       For open regular cartesian meshes the DOR algorithm is the ideal  routing  algorithm.  For
       toroidal  meshes  on the other hand there are routing loops that can cause deadlocks. LASH
       can  be  used  to  route  these  cases.  The  performance  of  LASH  can  be  improved  by
       preconditioning  the  mesh in cases where there are multiple links connecting switches and
       also in cases where the switches are not cabled consistently. An option exists for LASH to
       do  this.  To invoke this use '-R lash -Q --do_mesh_analysis'. This will add an additional
       phase that analyses the mesh to try to determine the dimension and size of a mesh.  If  it
       determines that the mesh looks like an open or closed cartesian mesh it reorders the ports
       in dimension order before the rest of the LASH algorithm runs.

       DOR Routing Algorithm

       The Dimension Order Routing algorithm is based on  the  Min  Hop  algorithm  and  so  uses
       shortest  paths.   Instead  of  spreading traffic out across different paths with the same
       shortest distance, it chooses among the available shortest paths based on an  ordering  of
       dimensions.  Each port must be consistently cabled to represent a hypercube dimension or a
       mesh dimension.  Alternatively, the -O option can be  used  to  assign  a  custom  mapping
       between the ports on a given switch, and the associated dimension.  Paths are grown from a
       destination back to a source using the lowest dimension (port) of available paths at  each
       step.   This  provides  the ordering necessary to avoid deadlock.  When there are multiple
       links between any two switches, they still represent only one  dimension  and  traffic  is
       balanced  across  them unless port equalization is turned off.  In the case of hypercubes,
       the same port must be used throughout the fabric to represent the hypercube dimension  and
       match  on  both  ends of the cable, or the -O option used to accomplish the alignment.  In
       the case of meshes, the dimension should consistently use the same pair of ports, one port
       on  one  end  of the cable, and the other port on the other end, continuing along the mesh
       dimension, or the -O option used as an override.

       Use '-R dor' option to activate the DOR algorithm.

       DFSSSP and SSSP Routing Algorithm

       The (Deadlock-Free) Single-Source-Shortest-Path routing algorithm is designed to  optimize
       link  utilization  thru global balancing of routes, while supporting arbitrary topologies.
       The DFSSSP routing algorithm uses InfiniBand  virtual  lanes  (SL)  to  provide  deadlock-
       freedom.

       The DFSSSP algorithm consists of five major steps:
       1)  It  discovers  the subnet and models the subnet as a directed multigraph in which each
       node represents a node of the physical network and each edge represents one  direction  of
       the full-duplex links used to connect the nodes.
       2) A loop, which iterates over all CA and switches of the subnet, will perform three steps
       to generate the linear forwarding tables for each switch:
       2.1) use Dijkstra's algorithm to find the shortest path from  all  nodes  to  the  current
       selected destination;
       2.2) update the edge weights in the graph, i.e. add the number of routes, which use a link
       to reach the destination, to the link/edge;
       2.3) update the LFT of each switch with the outgoing port which was used  in  the  current
       step to route the traffic to the destination node.
       3)  After  the number of available virtual lanes or layers in the subnet is detected and a
       channel dependency graph is initialized for  each  layer,  the  algorithm  will  put  each
       possible route of the subnet into the first layer.
       4)  A  loop  iterates  over all channel dependency graphs (CDG) and performs the following
       substeps:
       4.1) search for a cycle in the current CDG;
       4.2) when a cycle is found, i.e. a possible deadlock is present, one edge is selected  and
       all  routes,  which  induced  this  edge,  are  moved  to  the "next higher" virtual layer
       (CDG[i+1]);
       4.3) the cycle search is continued until all cycles are broken and routes are moved "up".
       5) When the number of needed layers does not exceeds the  number  of  available  SL/VL  to
       remove  all  cycles  in  all  CDGs,  the routing is deadlock-free and an relation table is
       generated, which contains the assignment of routes from source to destination to a SL

       Note on SSSP:
       This algorithm does not perform the steps 3)-5) and can not be considered to be  deadlock-
       free  for  all  topologies.  But on the one hand, you can choose this algorithm for really
       large networks (5,000+ CAs and deadlock-free by design)  to  reduce  the  runtime  of  the
       algorithm.  On the other hand, you might use the SSSP routing algorithm as an alternative,
       when all deadlock-free routing algorithms fail to route the network for  whatever  reason.
       In  the last case, SSSP was designed to deliver an equal or higher bandwidth due to better
       congestion avoidance than the Min Hop routing algorithm.

       Notes for usage:
       a) running DFSSSP: '-R dfsssp -Q'
       a.1) QoS has to be configured to equally spread the load on the available  SL  or  virtual
       lanes
       a.2)  applications  must  perform a path record query to get path SL for each route, which
       the application will use to transmit packages
       b) running SSSP:   '-R sssp'
       c) both algorithms support LMC > 0

       Hints for optimizing I/O traffic:
       Having more nodes (I/O and compute) connected to a switch than incoming links  can  result
       in  a  'bad'  routing  of  the I/O traffic as long as (DF)SSSP routing is not aware of the
       dedicated I/O nodes, i.e., in the following network configuration CN1-CN3 might  send  all
       I/O traffic via Link2 to IO1,IO2:

            CN1         Link1        IO1
               \       /----\       /
         CN2 -- Switch1      Switch2 -- CN4
               /       \----/       \
            CN3         Link2        IO2

       To  prevent  this  from happening (DF)SSSP can use both the compute node guid file and the
       I/O guid file specified by the ´-u´  or  ´--cn_guid_file´  and  ´-G´  or  ´--io_guid_file´
       options  (similar  to  the  Fat-Tree  routing).  This ensures that traffic towards compute
       nodes and I/O nodes is balanced separately and therefore distributed as much  as  possible
       across  the  available links. Port GUIDs, as listed by ibstat, must be specified (not Node
       GUIDs).
       The priority for the optimization is as follows:
         compute nodes -> I/O nodes -> other nodes
       Possible use case scenarios:
       a) neither ´-u´ nor ´-G´ are specified: all nodes a treated as ´other nodes´ and therefore
       balanced equally;
       b) ´-G´ is specified: traffic towards I/O nodes will be balanced optimally;
       c)  the  system  has  three  node  types,  such  as  login/admin, compute and I/O, but the
       balancing focus should be I/O, then one has to use ´-u´ and ´-G´ with I/O guids listed  in
       cn_guid_file and compute node guids listed in io_guid_file;
       d) ...

       Torus-2QoS Routing Algorithm

       Torus-2QoS  is  routing  algorithm  designed  for  large-scale  2D/3D  torus  fabrics; see
       torus-2QoS(8) for full documentation.

       Use '-R torus-2QoS -Q' or  '-R  torus-2QoS,no_fallback  -Q'  to  activate  the  torus-2QoS
       algorithm.

       Nue Routing Algorithm

       Use either `-R nue' or `-R nue -Q --nue_max_num_vls <int>' to activate Nue.

       Note:  if  `--nue_max_num_vls'  is  specified  and  unequal to 1, then QoS support must be
       turned on, so that SL2VL mappings are valid and applications comply with suggested SLs  to
       avoid credit-loops. For more details on QoS and Nue see below.

       The implementation of Nue routing for OpenSM is a 100%-applicable, balanced, and deadlock-
       free unicast routing  engine  (which  also  configures  multicast  tables,  see  'Note  on
       multicast' below). The key points of this algorithm are the following:
         - 100% fault-tolerant, oblivious routing strategy
         - topology-agnostic, i.e., applicable to every topology (no matter if topology
           is regular, irregular after faults, or random)
         - 100% deadlock-free routing within the resource limits (i.e., it never
           exceeds the given number of available virtual lanes, and it does not
           necessarily require virtual lanes) for every topology
         - very good path balancing and therefore high throughput (even better when
           using METIS, see notes below)
         - QoS (via SLs/VLs) + deadlock-freedom can be combined (since both rely on
           VLs), e.g., using VL0-3 for Nue's deadlock-freedom (and 1. QoS level) and
           VL4-7 as second QoS level
         - forwarding tables are fast to calculate: O(n^2 * log n), however slightly
           slower compared to topology-aware routings (for obvious reasons), and
         - the path-to-VL mapping only depends on the destination, which may be useful
           for scalable, efficient path resolution and caching mechanisms.
       From  a  very  high level perspective, Nue routing is similar to DFSSSP (see above) in the
       sense that both use Dijkstra and edge weight updates for path  balancing,  and  paths  are
       mapped to virtual layers assuming a 1:1 mapping of SL2VL tables.  However, the fundamental
       difference is that  Nue  routing  doesn't  perform  the  path  calculation  on  the  graph
       representing  the  real  fabric, and instead routes directly within the channel dependency
       graph. This approach allows Nue routing  to  place  routing  restrictions  (to  avoid  any
       credit-loops)  in  an  on-demand manner, which overcomes the problem of all other good VL-
       based algorithms.  Meaning, the competitors cannot control or limit the use  of  VLs,  and
       might  run  out of them and have to give up. On the flip side, Nue may have to use detours
       for a few routes, and hence cannot really be considered "shortest-path"  routing,  because
       it is impossible to accomplish deadlock-free, shortest-path routing with an limited number
       of available virtual lanes for arbitrary network topologies.

       Note on the use of METIS library with Nue:
       Nue routing may has to separate the LIDs into multiple  subsets,  one  for  every  virtual
       layer,  if multiple layers are used. Nue has two options to perform this partitioning (not
       to be confused with IB partitions); the first is a fairly simple semi-random assignment of
       LIDs  to  layers/subsets,  and the second partitioning uses the METIS library to partition
       the network graph into k approximately equal sized parts. The latter  approach  has  shown
       better results in terms of path balancing and avoidance of using fallback paths, and hence
       it is HIGHLY advised to install/use the METIS library with OpenSM (enforced via `--enable-
       metis'  configure flag when building OpenSM). For the rare case, that METIS isn't packaged
       with the Linux distro, here is a link to the official  website  to  download  and  install
       METIS 5.1.0 manually:
          http://glaros.dtc.umn.edu/gkhome/metis/metis/overview
       OpenSM's  configure  script  also provides options in case METIS header and library aren't
       found in the default path.

       Runtime options for Nue:
       The behavior of Nue routing can be directly influenced by the osm.conf parameter (which is
       also available as command line option):
         - nue_max_num_vls: controls/limits the number of virtual lanes/layers which
              Nue is allowed to use (detailed explanation in osm.conf file).
       Furthermore,   Nue   supports   TRUE   and   FALSE   settings   of  avoid_throttled_links,
       use_ucast_cache, and qos (more on this hereafter); and lmc > 0.

       Notes on Quality of Service (QoS):
       The advantage of Nue is that it works with AND without QoS being enabled, i.e., the  usage
       of  SLs/VLs  for  deadlock-freedom  can  be  avoided.  Here  are  the three possible usage
       scenarios:
         - neither setting `--nue_max_num_vls <int>' nor `-Q': Nue assumes that only 1
              virtual layer (identical to physical network; or OperVLs equal to VL0) is
              usable and all paths are to be calculated within this one layer. Hence,
              there is no need for special SL2VL mappings in the network and the use of
              specific SLs by applications.
         - setting `-Q' but not `--nue_max_num_vls <int>': This combination works like
              the previous one, meaning the SL returned for path record requests is not
              defined by Nue, since all paths are deadlock-free without using VLs.
              However, any separate QoS settings may influence the SL returned to
              applications.
         - setting `-Q --nue_max_num_vls <int>' with int != 1: In this configuration,
              applications have to query and obey the SL for path records as returned
              by Nue because otherwise the deadlock-freedom cannot be guaranteed
              anymore. Furthermore, errors in the fabric may require applications to
              repath to avoid message deadlocks. Since Nue operates on virtual layer,
              admins should configure the SL2VL mapping tables in an homogeneous 1:1
              manner across the entire subnet to separate the layers.
       As an additional note, using more  VLs  for  Nue  usually  improves  the  overall  network
       throughput,  so  there  are  trade  offs  admins may have to consider when configuring the
       subnet manager with Nue routing.

       Note on multicast:
       The Nue routing engine configures multicast forwarding tables by utilizing a spanning tree
       calculation  routed at a subnet switch suggested by OpenSM. This spanning tree for a mcast
       group will try  to  use  the  least  overloaded  links  (w.r.t  the  ucast  paths-per-link
       metric/weight)  in the fabric. However, Nue routing currently does not guarantee deadlock-
       freedom for the set of multicast routes on all topologies,  nor  for  the  combination  of
       deadlock-free  unicast  routes  with  additional  multicast  routes. Assuming, for a given
       topology the calculated mcast routes are dl-free, then an admin may fix the latter problem
       by   separating   the   VLs,   e.g.,   using  VL0-6  for  unicast  routing  by  specifying
       `--nue_max_num_vls 7' and utilizing VL7 for multicast.

       Routing References

       To learn more about deadlock-free routing, see the article "Deadlock Free Message  Routing
       in Multiprocessor Interconnection Networks" by William J Dally and Charles L Seitz (1985).

       To  learn more about the up/down algorithm, see the article "Effective Strategy to Compute
       Forwarding Tables for InfiniBand Networks" by Jose Carlos Sancho, Antonio Robles, and Jose
       Duato at the Universidad Politecnica de Valencia.

       To  learn  more  about  LASH  and  the  flexibility behind it, the requirement for layers,
       performance comparisons to other algorithms, see the following articles:

       "Layered Routing in Irregular Networks", Lysne et al, IEEE Transactions  on  Parallel  and
       Distributed Systems, VOL.16, No12, December 2005.

       "Routing for the ASI Fabric Manager", Solheim et al. IEEE Communications Magazine, Vol.44,
       No.7, July 2006.

       "Layered Shortest Path (LASH) Routing in Irregular System Area  Networks",  Skeie  et  al.
       IEEE Computer Society Communication Architecture for Clusters 2002.

       To learn more about the DFSSSP and SSSP routing algorithm, see the articles:
       J.  Domke,  T.  Hoefler  and  W.  Nagel:  Deadlock-Free  Oblivious  Routing  for Arbitrary
       Topologies,  In  Proceedings  of  the  25th  IEEE  International  Parallel  &  Distributed
       Processing Symposium (IPDPS 2011)
       T.  Hoefler,  T.  Schneider and A. Lumsdaine: Optimized Routing for Large-Scale InfiniBand
       Networks, In 17th Annual IEEE Symposium on High Performance Interconnects (HOTI 2009)

       To learn more about the Nue routing algorithm, see the article "Routing on the  Dependency
       Graph:  A  New Approach to Deadlock-Free High-Performance Routing" by J. Domke, T. Hoefler
       and S. Matsuoka (published in HPDC'16).

       Modular Routing Engine

       Modular routing engine structure allows for the ease of "plugging" new routing modules.

       Currently, only unicast callbacks are supported. Multicast can be added later.

       One existing routing module is up-down "updn", which  may  be  activated  with  '-R  updn'
       option (instead of old '-u').

       General usage is: $ opensm -R 'module-name'

       There is also a trivial routing module which is able to load LFT tables from a file.

       Main features:

        - this will load switch LFTs and/or LID matrices (min hops tables)
        - this will load switch LFTs according to the path entries introduced
          in the file
        - no additional checks will be performed (such as "is port connected",
          etc.)
        - in case when fabric LIDs were changed this will try to reconstruct
          LFTs correctly if endport GUIDs are represented in the file
          (in order to disable this, GUIDs may be removed from the file
           or zeroed)

       The  file  format  is compatible with output of 'ibroute' util and for whole fabric can be
       generated with dump_lfts.sh script.

       To activate file based routing module, use:

         opensm -R file -U /path/to/lfts_file

       If the lfts_file is not found or is in error, the default routing algorithm is utilized.

       The ability to dump switch lid matrices (aka min hops tables) to file and  later  to  load
       these is also supported.

       The  usage is similar to unicast forwarding tables loading from a lfts file (introduced by
       'file' routing engine), but new lid  matrix  file  name  should  be  specified  by  -M  or
       --lid_matrix_file option. For example:

         opensm -R file -M ./opensm-lid-matrix.dump

       The  dump  file is named ´opensm-lid-matrix.dump´ and will be generated in standard opensm
       dump directory (/var/log by default) when OSM_LOG_ROUTING logging flag is set.

       When routing engine 'file' is activated, but the lfts file is not specified or not  cannot
       be open default lid matrix algorithm will be used.

       There  is  also  a  switch forwarding tables dumper which generates a file compatible with
       dump_lfts.sh output. This file can be used as  input  for  forwarding  tables  loading  by
       'file'  routing  engine.   Both or one of options -U and -M can be specified together with
       ´-R file´.

PER MODULE LOGGING CONFIGURATION

       To enable per module logging, configure per_module_logging_file to the per module  logging
       config file name in the opensm options file. To disable, configure per_module_logging_file
       to (null) there.

       The per module logging config file format is a set of lines with module name  and  logging
       level as follows:

        <module name><separator><logging level>

        <module name> is the file name including .c
        <separator> is either = , space, or tab
        <logging level> is the same levels as used in the coarse/overall
        logging as follows:

        BIT    LOG LEVEL ENABLED
        ----   -----------------
        0x01 - ERROR (error messages)
        0x02 - INFO (basic messages, low volume)
        0x04 - VERBOSE (interesting stuff, moderate volume)
        0x08 - DEBUG (diagnostic, high volume)
        0x10 - FUNCS (function entry/exit, very high volume)
        0x20 - FRAMES (dumps all SMP and GMP frames)
        0x40 - ROUTING (dump FDB routing information)
        0x80 - SYS (syslog at LOG_INFO level in addition to OpenSM logging)

FILES

       /etc/opensm/opensm.conf
              default OpenSM config file.

       /etc/opensm/ib-node-name-map
              default node name map file.  See ibnetdiscover for more information on format.

       /etc/opensm/partitions.conf
              default partition config file

       /etc/opensm/qos-policy.conf
              default QOS policy config file

       /etc/opensm/prefix-routes.conf
              default prefix routes file

       /etc/opensm/per-module-logging.conf
              default per module logging config file

       /etc/opensm/torus-2QoS.conf
              default torus-2QoS config file

AUTHORS

       Hal Rosenstock
              <hal@mellanox.com>

       Sasha Khapyorsky
              <sashak@voltaire.com>

       Eitan Zahavi
              <eitan@mellanox.co.il>

       Yevgeny Kliteynik
              <kliteyn@mellanox.co.il>

       Thomas Sodring
              <tsodring@simula.no>

       Ira Weiny
              <weiny2@llnl.gov>

       Dale Purdy
              <purdy@sgi.com>

SEE ALSO

       torus-2QoS(8), torus-2QoS.conf(5).