jammy (8) torus-2QoS.8.gz

Provided by: opensm_3.3.23-2_amd64 bug

NAME

       torus-2QoS - Routing engine for OpenSM subnet manager

DESCRIPTION

       Torus-2QoS  is  routing  algorithm  designed  for  large-scale  2D/3D  torus fabrics.  The
       torus-2QoS routing engine can provide the following functionality on a 2D/3D torus:

         – Routing that is free of credit loops.
         – Two levels of Quality of Service (QoS), assuming switches support eight data VLs and
           channel adapters support two data VLs.
         – The ability to route around a single failed switch, and/or multiple failed links,
           without
           – introducing credit loops, or
           – changing path SL values.
         – Very short run times, with good scaling properties as fabric size increases.

UNICAST ROUTING

       Unicast routing in torus-2QoS is based on Dimension Order Routing (DOR).   It  avoids  the
       deadlocks that would otherwise occur in a DOR-routed torus using the concept of a dateline
       for each torus dimension.  It encodes into a path SL which datelines the path crosses,  as
       follows:

           sl = 0;
           for (d = 0; d < torus_dimensions; d++) {
            /* path_crosses_dateline(d) returns 0 or 1 */
            sl |= path_crosses_dateline(d) << d;
           }

       On  a  3D  torus  this consumes three SL bits, leaving one SL bit unused.  Torus-2QoS uses
       this SL bit to implement two QoS levels.

       Torus-2QoS also makes use of the output port dependence of switch  SL2VL  maps  to  encode
       into  one  VL  bit  the  information encoded in three SL bits.  It computes in which torus
       coordinate direction each inter-switch link "points", and writes SL2VL maps for such ports
       as follows:

           for (sl = 0; sl < 16; sl++) {
            /* cdir(port) computes which torus coordinate direction
             * a switch port "points" in; returns 0, 1, or 2
             */
            sl2vl(iport,oport,sl) = 0x1 & (sl >> cdir(oport));
           }

       Thus,  on  a pristine 3D torus, i.e., in the absence of failed fabric switches, torus-2QoS
       consumes eight SL values (SL bits 0-2) and two VL values (VL  bit  0)  per  QoS  level  to
       provide deadlock-free routing.

       Torus-2QoS  routes  around  link  failure  by  "taking  the  long  way around" any 1D ring
       interrupted by link failure.  For example, consider the 2D 6x5 torus below, where switches
       are denoted by [+a-zA-Z]:
                                         |    |    |    |    |    |
                                    4  --+----+----+----+----+----+--
                                         |    |    |    |    |    |
                                    3  --+----+----+----D----+----+--
                                         |    |    |    |    |    |
                                    2  --+----+----I----r----+----+--
                                         |    |    |    |    |    |
                                    1  --m----S----n----T----o----p--
                                         |    |    |    |    |    |
                                  y=0  --+----+----+----+----+----+--
                                         |    |    |    |    |    |

                                       x=0    1    2    3    4    5

       For  a  pristine fabric the path from S to D would be S-n-T-r-D.  In the event that either
       link S-n or n-T has failed, torus-2QoS would use the path S-m-p-o-T-r-D.  Note that it can
       do this without changing the path SL value; once the 1D ring m-S-n-T-o-p-m has been broken
       by failure, path segments using it cannot contribute  to  deadlock,  and  the  x-direction
       dateline (between, say, x=5 and x=0) can be ignored for path segments on that ring.

       One result of this is that torus-2QoS can route around many simultaneous link failures, as
       long as no 1D ring is broken into disjoint segments.  For example, if links  n-T  and  T-o
       have  both  failed, that ring has been broken into two disjoint segments, T and o-p-m-S-n.
       Torus-2QoS checks for such issues, reports if they are found, and refuses  to  route  such
       fabrics.

       Note  that in the case where there are multiple parallel links between a pair of switches,
       torus-2QoS will allocate routes across such links in a round-robin fashion, based on ports
       at  the  path  destination  switch  that  are  active and not used for inter-switch links.
       Should a link that is one of several such parallel links fail,  routes  are  redistributed
       across  the remaining links.  When the last of such a set of parallel links fails, traffic
       is rerouted as described above.

       Handling a failed switch under DOR requires introducing into a path at least one turn that
       would  be  otherwise "illegal", i.e., not allowed by DOR rules.  Torus-2QoS will introduce
       such a turn as close as possible to the failed switch in order to route around it.

       In the above example, suppose switch T has failed, and consider the  path  from  S  to  D.
       Torus-2QoS  will produce the path S-n-I-r-D, rather than the S-n-T-r-D path for a pristine
       torus, by introducing an early turn at n.  Normal DOR rules will cause traffic arriving at
       switch  I to be forwarded to switch r; for traffic arriving from I due to the "early" turn
       at n, this will generate an "illegal" turn at I.

       Torus-2QoS will also use the input port dependence of SL2VL maps to set VL  bit  1  (which
       would be otherwise unused) for y-x, z-x, and z-y turns, i.e., those turns that are illegal
       under DOR.  This causes the first hop after any such turn to use  a  separate  set  of  VL
       values, and prevents deadlock in the presence of a single failed switch.

       For any given path, only the hops after a turn that is illegal under DOR can contribute to
       a credit loop that leads to deadlock.  So in the example above with failed switch  T,  the
       location  of  the  illegal turn at I in the path from S to D requires that any credit loop
       caused by that turn must encircle the failed switch at T.  Thus the second and later  hops
       after  the  illegal  turn  at I (i.e., hop r-D) cannot contribute to a credit loop because
       they cannot be used to construct a loop encircling T.  The hop I-r uses a separate VL,  so
       it cannot contribute to a credit loop encircling T.

       Extending this argument shows that in addition to being capable of routing around a single
       switch failure without introducing deadlock, torus-2QoS can  also  route  around  multiple
       failed  switches  on  the condition they are adjacent in the last dimension routed by DOR.
       For example, consider the following case on a 6x6 2D torus:
                                         |    |    |    |    |    |
                                    5  --+----+----+----+----+----+--
                                         |    |    |    |    |    |
                                    4  --+----+----+----D----+----+--
                                         |    |    |    |    |    |
                                    3  --+----+----I----u----+----+--
                                         |    |    |    |    |    |
                                    2  --+----+----q----R----+----+--
                                         |    |    |    |    |    |
                                    1  --m----S----n----T----o----p--
                                         |    |    |    |    |    |
                                  y=0  --+----+----+----+----+----+--
                                         |    |    |    |    |    |

                                       x=0    1    2    3    4    5

       Suppose switches T and R have failed, and consider the path from S to D.  Torus-2QoS  will
       generate  the path S-n-q-I-u-D, with an illegal turn at switch I, and with hop I-u using a
       VL with bit 1 set.

       As a further example, consider a case that torus-2QoS cannot route without  deadlock:  two
       failed switches adjacent in a dimension that is not the last dimension routed by DOR; here
       the failed switches are O and T:
                                         |    |    |    |    |    |
                                    5  --+----+----+----+----+----+--
                                         |    |    |    |    |    |
                                    4  --+----+----+----+----+----+--
                                         |    |    |    |    |    |
                                    3  --+----+----+----+----D----+--
                                         |    |    |    |    |    |
                                    2  --+----+----I----q----r----+--
                                         |    |    |    |    |    |
                                    1  --m----S----n----O----T----p--
                                         |    |    |    |    |    |
                                  y=0  --+----+----+----+----+----+--
                                         |    |    |    |    |    |

                                       x=0    1    2    3    4    5

       In a pristine fabric, torus-2QoS would generate the path from S to D as S-n-O-T-r-D.  With
       failed  switches O and T, torus-2QoS will generate the path S-n-I-q-r-D, with illegal turn
       at switch I, and with hop I-q using a VL with bit 1  set.   In  contrast  to  the  earlier
       examples,  the  second  hop after the illegal turn, q-r, can be used to construct a credit
       loop encircling the failed switches.

MULTICAST ROUTING

       Since torus-2QoS uses all four available SL bits, and the three  data  VL  bits  that  are
       typically  available  in current switches, there is no way to use SL/VL values to separate
       multicast traffic from unicast traffic.  Thus, torus-2QoS must generate multicast  routing
       such  that  credit  loops  cannot  arise  from a combination of multicast and unicast path
       segments.

       It turns out that it is possible to construct spanning trees for  multicast  routing  that
       have  that property.  For the 2D 6x5 torus example above, here is the full-fabric spanning
       tree that torus-2QoS will construct, where "x" is the root switch and each "+" is  a  non-
       root switch:
                                    4    +    +    +    +    +    +
                                         |    |    |    |    |    |
                                    3    +    +    +    +    +    +
                                         |    |    |    |    |    |
                                    2    +----+----+----x----+----+
                                         |    |    |    |    |    |
                                    1    +    +    +    +    +    +
                                         |    |    |    |    |    |
                                  y=0    +    +    +    +    +    +

                                       x=0    1    2    3    4    5

       For  multicast traffic routed from root to tip, every turn in the above spanning tree is a
       legal DOR turn.

       For traffic routed from tip to root, and some traffic routed through the root,  turns  are
       not  legal DOR turns.  However, to construct a credit loop, the union of multicast routing
       on this spanning tree with DOR unicast routing can only provide 3 of the  4  turns  needed
       for the loop.

       In  addition,  if  none  of  the  above spanning tree branches crosses a dateline used for
       unicast credit loop avoidance on a torus, and if multicast traffic is confined to SL 0  or
       SL  8  (recall  that  torus-2QoS uses SL bit 3 to differentiate QoS level), then multicast
       traffic also cannot contribute to the "ring" credit loops that are otherwise possible in a
       torus.

       Torus-2QoS  uses  these  ideas  to  create  a master spanning tree.  Every multicast group
       spanning tree will be constructed as a subset of the master tree, with the  same  root  as
       the master tree.

       Such  multicast group spanning trees will in general not be optimal for groups which are a
       subset of the full fabric. However, this compromise must be made to enable support for two
       QoS levels on a torus while preventing credit loops.

       In  the  presence  of link or switch failures that result in a fabric for which torus-2QoS
       can generate credit-loop-free unicast routes, it is also possible  to  generate  a  master
       spanning  tree  for multicast that retains the required properties.  For example, consider
       that same 2D 6x5 torus, with the  link  from  (2,2)  to  (3,2)  failed.   Torus-2QoS  will
       generate the following master spanning tree:
                                    4    +    +    +    +    +    +
                                         |    |    |    |    |    |
                                    3    +    +    +    +    +    +
                                         |    |    |    |    |    |
                                    2  --+----+----+    x----+----+--
                                         |    |    |    |    |    |
                                    1    +    +    +    +    +    +
                                         |    |    |    |    |    |
                                  y=0    +    +    +    +    +    +

                                       x=0    1    2    3    4    5

       Two  things  are  notable about this master spanning tree.  First, assuming the x dateline
       was between x=5 and x=0, this spanning tree  has  a  branch  that  crosses  the  dateline.
       However,  just  as  for unicast, crossing a dateline on a 1D ring (here, the ring for y=2)
       that is broken by a failure cannot contribute to a torus credit loop.

       Second, this spanning tree is no longer optimal even for multicast groups  that  encompass
       the  entire  fabric.  That, unfortunately, is a compromise that must be made to retain the
       other desirable properties of torus-2QoS routing.

       In the event that a single switch fails, torus-2QoS will generate a master  spanning  tree
       that  has  no "extra" turns by appropriately selecting a root switch.  In the 2D 6x5 torus
       example, assume now that the switch at (3,2), i.e., the root for a pristine fabric, fails.
       Torus-2QoS will generate the following master spanning tree for that case:
                                                  |
                                    4    +    +    +    +    +    +
                                         |    |    |    |    |    |
                                    3    +    +    +    +    +    +
                                         |    |    |         |    |
                                    2    +    +    +         +    +
                                         |    |    |         |    |
                                    1    +----+----x----+----+----+
                                         |    |    |    |    |    |
                                  y=0    +    +    +    +    +    +
                                                  |

                                       x=0    1    2    3    4    5

       Assuming  the  y  dateline  was  between y=4 and y=0, this spanning tree has a branch that
       crosses a dateline.  However, again this cannot contribute to credit loops as it occurs on
       a 1D ring (the ring for x=3) that is broken by a failure, as in the above example.

TORUS TOPOLOGY DISCOVERY

       The algorithm used by torus-2QoS to construct the torus topology from the undirected graph
       representing the fabric requires that the  radix  of  each  dimension  be  configured  via
       torus-2QoS.conf.   It  also  requires  that the torus topology be "seeded"; for a 3D torus
       this requires configuring four switches that define the three coordinate directions of the
       torus.

       Given  this starting information, the algorithm is to examine the cube formed by the eight
       switch locations bounded by the corners (x,y,z)  and  (x+1,y+1,z+1).   Based  on  switches
       already  placed into the torus topology at some of these locations, the algorithm examines
       4-loops of inter-switch links to find the one that is consistent with a face of  the  cube
       of  switch  locations,  and  adds  its  swiches  to the discovered topology in the correct
       locations.

       Because the algorithm is based on examining the topology of 4-loops of links, a torus with
       one   or   more  radix-4  dimensions  requires  extra  initial  seed  configuration.   See
       torus-2QoS.conf(5)  for  details.   Torus-2QoS  will  detect  and  report  when   it   has
       insufficient configuration for a torus with radix-4 dimensions.

       In the event the torus is significantly degraded, i.e., there are many missing switches or
       links, it may happen that torus-2QoS is unable to  place  into  the  torus  some  switches
       and/or links that were discovered in the fabric, and will generate a warning in that case.
       A similar condition occurs if torus-2QoS is misconfigured, i.e.,  the  radix  of  a  torus
       dimension  as  configured  does  not match the radix of that torus dimension as wired, and
       many switches/links in the fabric will not be placed into the torus.

QUALITY OF SERVICE CONFIGURATION

       OpenSM will not program switches and channel adapters with SL2VL maps  or  VL  arbitration
       configuration   unless   it  is  invoked  with  -Q.   Since  torus-2QoS  depends  on  such
       functionality for correct operation, always invoke OpenSM with -Q when  torus-2QoS  is  in
       the list of routing engines.

       Any quality of service configuration method supported by OpenSM will work with torus-2QoS,
       subject to the following limitations and considerations.

       For all routing engines supported by OpenSM  except  torus-2QoS,  there  is  a  one-to-one
       correspondence  between  QoS  level  and  SL.   Torus-2QoS can only support two quality of
       service levels, so only  the  high-order  bit  of  any  SL  value  used  for  unicast  QoS
       configuration will be honored by torus-2QoS.

       For multicast QoS configuration, only SL values 0 and 8 should be used with torus-2QoS.

       Since  SL  to  VL  map configuration must be under the complete control of torus-2QoS, any
       configuration via qos_sl2vl, qos_swe_sl2vl, etc., must and  will be ignored, and a warning
       will be generated.

       For  inter-switch  links,  Torus-2QoS uses VL values 0-3 to implement one of its supported
       QoS levels, and VL values 4-7 to implement the  other.  For  endport  links  (CA,  router,
       switch  management  port),  Torus-2QoS uses VL value 0 for one of its supported QoS levels
       and VL value 1 to implement the other.  Hard-to-diagnose application issues may  arise  if
       traffic  is  not  delivered  fairly  across  each of these two VL ranges. For inter-switch
       links, Torus-2QoS will detect and warn if VL arbitration is configured unfairly across VLs
       in  the  range 0-3, and also in the range 4-7. Note that the default OpenSM VL arbitration
       configuration does not meet this constraint, so all torus-2QoS users should  configure  VL
       arbitration      via      qos_ca_vlarb_high,     qos_swe_vlarb_high,     qos_ca_vlarb_low,
       qos_swe_vlarb_low, etc.

       Note that torus-2QoS maps SL values to VL values differently for inter-switch and  endport
       links.  This is why qos_vlarb_high and qos_vlarb_low should not be used, as using them may
       result in VL arbitration for a QoS level being different  across  inter-switch  links  vs.
       across endport links.

OPERATIONAL CONSIDERATIONS

       Any  routing  algorithm  for  a torus IB fabric must employ path SL values to avoid credit
       loops.  As a result, all applications run over such fabrics must  perform  a  path  record
       query  to  obtain the correct path SL for connection setup.  Applications that use rdma_cm
       for connection setup will automatically meet this requirement.

       If a change in fabric topology causes changes in path SL values required to route  without
       credit  loops, in general all applications would need to repath to avoid message deadlock.
       Since torus-2QoS has the ability to reroute after a single switch failure without changing
       path  SL  values,  repathing  by  running  applications is not required when the fabric is
       routed with torus-2QoS.

       Torus-2QoS can provide unchanging path  SL  values  in  the  presence  of  subnet  manager
       failover  provided that all OpenSM instances have the same idea of dateline location.  See
       torus-2QoS.conf(5) for details.

       Torus-2QoS will detect configurations of failed switches and links  that  prevent  routing
       that is free of credit loops, and will log warnings and refuse to route.  If "no_fallback"
       was configured in the list of OpenSM routing engines, then no other  routing  engine  will
       attempt  to  route  the  fabric.   In  that  case all paths that do not transit the failed
       components will continue to work, and the subset of paths that are still operational  will
       continue  to  remain  free  of credit loops.  OpenSM will continue to attempt to route the
       fabric after every sweep interval, and after any change (such as a link up) in the  fabric
       topology.  When the fabric components are repaired, full functionality will be restored.

       In  the  event  OpenSM  was  configured  to allow some other engine to route the fabric if
       torus-2QoS fails, then credit loops and message deadlock  are  likely  if  torus-2QoS  had
       previously routed the fabric successfully.  Even if the other engine is capable of routing
       a torus without credit loops, applications that built  connections  with  path  SL  values
       granted  under  torus-2QoS will likely experience message deadlock under routing generated
       by a different engine, unless they repath.

       To verify that a torus fabric is routed free of credit loops, use ibdmchk to analyze  data
       collected via ibdiagnet -vlr.

FILES

       /etc/opensm/opensm.conf
              default OpenSM config file.

       /etc/opensm/qos-policy.conf
              default QoS policy config file.

       /etc/opensm/torus-2QoS.conf
              default torus-2QoS config file.

SEE ALSO

       opensm(8), torus-2QoS.conf(5), ibdiagnet(1), ibdmchk(1), rdma_cm(7).