Provided by: libfabric-dev_1.17.0-3ubuntu1_amd64 bug

NAME

       fi_collective - Collective operations

       fi_join_collective
              Operation where a subset of peers join a new collective group.

       fi_barrier / fi_barrier2
              Collective  operation  that  does  not  complete  until  all peers have entered the
              barrier call.

       fi_broadcast
              A single sender transmits data to all peers, including itself.

       fi_alltoall
              Each peer distributes a slice of its local data to all peers.

       fi_allreduce
              Collective operation where all peers broadcast an atomic  operation  to  all  other
              peers.

       fi_allgather
              Each peer sends a complete copy of its local data to all peers.

       fi_reduce_scatter
              Collective  call  where data is collected from all peers and merged (reduced).  The
              results of the reduction is distributed back to the peers, with each peer receiving
              a slice of the results.

       fi_reduce
              Collective  call  where  data is collected from all peers to a root peer and merged
              (reduced).

       fi_scatter
              A single sender distributes (scatters) a slice of its local data to all peers.

       fi_gather
              All peers send their data to a root peer.

       fi_query_collective
              Returns information about which collective operations are supported by a  provider,
              and limitations on the collective.

SYNOPSIS

              #include <rdma/fi_collective.h>

              int fi_join_collective(struct fid_ep *ep, fi_addr_t coll_addr,
                  const struct fid_av_set *set,
                  uint64_t flags, struct fid_mc **mc, void *context);

              ssize_t fi_barrier(struct fid_ep *ep, fi_addr_t coll_addr,
                  void *context);

              ssize_t fi_barrier2(struct fid_ep *ep, fi_addr_t coll_addr,
                  uint64_t flags, void *context);

              ssize_t fi_broadcast(struct fid_ep *ep, void *buf, size_t count, void *desc,
                  fi_addr_t coll_addr, fi_addr_t root_addr, enum fi_datatype datatype,
                  uint64_t flags, void *context);

              ssize_t fi_alltoall(struct fid_ep *ep, const void *buf, size_t count,
                  void *desc, void *result, void *result_desc,
                  fi_addr_t coll_addr, enum fi_datatype datatype,
                  uint64_t flags, void *context);

              ssize_t fi_allreduce(struct fid_ep *ep, const void *buf, size_t count,
                  void *desc, void *result, void *result_desc,
                  fi_addr_t coll_addr, enum fi_datatype datatype, enum fi_op op,
                  uint64_t flags, void *context);

              ssize_t fi_allgather(struct fid_ep *ep, const void *buf, size_t count,
                  void *desc, void *result, void *result_desc,
                  fi_addr_t coll_addr, enum fi_datatype datatype,
                  uint64_t flags, void *context);

              ssize_t fi_reduce_scatter(struct fid_ep *ep, const void *buf, size_t count,
                  void *desc, void *result, void *result_desc,
                  fi_addr_t coll_addr, enum fi_datatype datatype, enum fi_op op,
                  uint64_t flags, void *context);

              ssize_t fi_reduce(struct fid_ep *ep, const void *buf, size_t count,
                  void *desc, void *result, void *result_desc, fi_addr_t coll_addr,
                  fi_addr_t root_addr, enum fi_datatype datatype, enum fi_op op,
                  uint64_t flags, void *context);

              ssize_t fi_scatter(struct fid_ep *ep, const void *buf, size_t count,
                  void *desc, void *result, void *result_desc, fi_addr_t coll_addr,
                  fi_addr_t root_addr, enum fi_datatype datatype,
                  uint64_t flags, void *context);

              ssize_t fi_gather(struct fid_ep *ep, const void *buf, size_t count,
                  void *desc, void *result, void *result_desc, fi_addr_t coll_addr,
                  fi_addr_t root_addr, enum fi_datatype datatype,
                  uint64_t flags, void *context);

              int fi_query_collective(struct fid_domain *domain,
                  fi_collective_op coll, struct fi_collective_attr *attr, uint64_t flags);

ARGUMENTS

       ep     Fabric endpoint on which to initiate collective operation.

       set    Address vector set defining the collective membership.

       mc     Multicast group associated with the collective.

       buf    Local data buffer that specifies first operand of collective operation

       datatype
              Datatype associated with atomic operands

       op     Atomic operation to perform

       result Local data buffer to store the result of the collective operation.

       desc / result_desc
              Data  descriptor  associated  with  the  local data buffer and local result buffer,
              respectively.

       coll_addr
              Address referring to the collective group of endpoints.

       root_addr
              Single endpoint that is the source or destination of collective data.

       flags  Additional flags to apply for the atomic operation

       context
              User specified pointer to associate with the operation.  This parameter is  ignored
              if  the  operation  will  not  generate  a successful completion, unless an op flag
              specifies the context parameter be used for required input.

DESCRIPTION (EXPERIMENTAL APIs)

       The collective APIs are new to the 1.9 libfabric release.   Although,  efforts  have  been
       made  to design the APIs such that they align well with applications and are implementable
       by the providers, the APIs should be considered experimental and may be subject to  change
       in future versions of the library until the experimental tag has been removed.

       In  general  collective  operations  can  be  thought  of as coordinated atomic operations
       between a set of peer endpoints.  Readers should refer to the fi_atomic(3)  man  page  for
       details on the atomic operations and datatypes defined by libfabric.

       A  collective  operation  is  a  group communication exchange.  It involves multiple peers
       exchanging data with  other  peers  participating  in  the  collective  call.   Collective
       operations require close coordination by all participating members.  All participants must
       invoke the same collective call before  any  single  member  can  complete  its  operation
       locally.  As a result, collective calls can strain the fabric, as well as local and remote
       data buffers.

       Libfabric collective interfaces target fabrics that support  offloading  portions  of  the
       collective  communication  into  network  switches,  NICs, and other devices.  However, no
       implementation requirement is placed on the provider.

       The first step in using a collective call is identifying  the  peer  endpoints  that  will
       participate.   Collective  membership  follows  one  of  two  models,  both  supported  by
       libfabric.  In the first model, the application  manages  the  membership.   This  usually
       means  that  the  application  is  performing a collective operation itself using point to
       point communication to identify the  members  who  will  participate.   Additionally,  the
       application may be interacting with a fabric resource manager to reserve network resources
       needed to execute collective operations.  In  this  model,  the  application  will  inform
       libfabric that the membership has already been established.

       A  separate  model  moves  the membership management under libfabric and directly into the
       provider.  In this model, the application must  identify  which  peer  addresses  will  be
       members.   That  information  is  conveyed  to  the  libfabric  provider,  which  is  then
       responsible for coordinating the creation  of  the  collective  group.   In  the  provider
       managed  model,  the  provider  will usually perform the necessary collective operation to
       establish the communication group and interact with any fabric management agents.

       In both models, the collective membership is communicated to the provider by creating  and
       configuring  an  address  vector  set (AV set).  An AV set represents an ordered subset of
       addresses in an address vector (AV).  Details on creating and configuring an  AV  set  are
       available in fi_av_set(3).

       Once an AV set has been programmed with the collective membership information, an endpoint
       is  joined  to  the  set.   This  uses  the  fi_join_collective  operation  and   operates
       asynchronously.   This differs from how an endpoint is associated synchronously with an AV
       using the fi_ep_bind() call.  Upon completion  of  the  fi_join_collective  operation,  an
       fi_addr  is  provided  that  is  used  as  the  target  address when invoking a collective
       operation.

       For developer convenience, a set of collective APIs are defined.  Collective  APIs  differ
       from  message  and RMA interfaces in that the format of the data is known to the provider,
       and the collective may  perform  an  operation  on  that  data.   This  aligns  collective
       operations closely with the atomic interfaces.

   Join Collective (fi_join_collective)
       This  call  attaches  an  endpoint  to  a  collective  membership group.  Libfabric treats
       collective members as a multicast group, and  the  fi_join_collective  call  attaches  the
       endpoint  to  that multicast group.  By default, the endpoint will join the group based on
       the data transfer capabilities of the endpoint.  For example, if  the  endpoint  has  been
       configured  to  both send and receive data, then the endpoint will be able to initiate and
       receive transfers to and from the collective.  The input flags may  be  used  to  restrict
       access to the collective group, subject to endpoint capability limitations.

       Join  collective  operations  complete  asynchronously,  and may involve fabric transfers,
       dependent on the provider implementation.  An endpoint must be bound  to  an  event  queue
       prior to calling fi_join_collective.  The result of the join operation will be reported to
       the EQ as an FI_JOIN_COMPLETE event.  Applications cannot issue collective transfers until
       receiving  notification  that the join operation has completed.  Note that an endpoint may
       begin receiving messages from the collective group as soon as the  join  completes,  which
       can occur prior to the FI_JOIN_COMPLETE event being generated.

       The  join  collective operation is itself a collective operation.  All participating peers
       must call fi_join_collective before any individual peer will  report  that  the  join  has
       completed.  Application managed collective memberships are an exception.  With application
       managed memberships, the fi_join_collective call may be completed locally  without  fabric
       communication.   For  provider  managed  memberships, the join collective call requires as
       input a coll_addr that refers to  either  an  address  associated  with  an  AV  set  (see
       fi_av_set_addr)  or  an  existing  collective  group  (obtained through a previous call to
       fi_join_collective).  The fi_join_collective call will create a new  collective  subgroup.
       If application managed memberships are used, coll_addr should be set to FI_ADDR_UNAVAIL.

       Applications  must  call  fi_close on the collective group to disconnect the endpoint from
       the group.  After a join operation has completed, the  fi_mc_addr  call  may  be  used  to
       retrieve  the  address  associated  with the multicast group.  See fi_cm(3) for additional
       details on fi_mc_addr().

   Barrier (fi_barrier)
       The fi_barrier operation provides a mechanism to  synchronize  peers.   Barrier  does  not
       result  in  any  data  being  transferred  at  the  application level.  A barrier does not
       complete locally until all peers have invoked the barrier call.   This  signifies  to  the
       local  application  that  work  by  peers that completed prior to them calling barrier has
       finished.

   Barrier (fi_barrier2)
       The fi_barrier2 operations is the same as fi_barrier, but with an extra parameter to  pass
       in operation flags.

   Broadcast (fi_broadcast)
       fi_broadcast  transfers  an array of data from a single sender to all other members of the
       collective group.  The input buf parameter is treated as the transmit buffer if the  local
       rank  is the root, otherwise it is the receive buffer.  The broadcast operation acts as an
       atomic write or read to a data array.  As a result, the format  of  the  data  in  buf  is
       specified through the datatype parameter.  Any non-void datatype may be broadcast.

       The  following  diagram  shows  an example of broadcast being used to transfer an array of
       integers to a group of peers.

              [1]  [1]  [1]
              [5]  [5]  [5]
              [9]  [9]  [9]
               |____^    ^
               |_________|
               broadcast

   All to All (fi_alltoall)
       The fi_alltoall collective involves distributing (or scattering) different portions of  an
       array  of data to peers.  It is best explained using an example.  Here three peers perform
       an all to all collective to exchange different entries in an integer array.

              [1]   [2]   [3]
              [5]   [6]   [7]
              [9]  [10]  [11]
                 \   |   /
                 All to all
                 /   |   \
              [1]   [5]   [9]
              [2]   [6]  [10]
              [3]   [7]  [11]

       Each peer sends a piece of its data to the other peers.

       All to all operations may be performed on any non-void datatype.  However, all to all does
       not perform an operation on the data itself, so no operation is specified.

   All Reduce (fi_allreduce)
       fi_allreduce  can be described as all peers providing input into an atomic operation, with
       the result copied back to each peer.  Conceptually,  this  can  be  viewed  as  each  peer
       issuing  a  multicast  atomic  operation  to  all  other  peers, fetching the results, and
       combining them.  The combining of the results  is  referred  to  as  the  reduction.   The
       fi_allreduce()  operation  takes  as  input  an  array  of  data  and the specified atomic
       operation to perform.  The results of the reduction are written into the result buffer.

       Any non-void datatype may be specified.  Valid atomic operations are listed below  in  the
       fi_query_collective  call.   The  following  diagram  shows  an  example  of an all reduce
       operation involving summing an array of integers between three peers.

               [1]  [1]  [1]
               [5]  [5]  [5]
               [9]  [9]  [9]
                 \   |   /
                    sum
                 /   |   \
               [3]  [3]  [3]
              [15] [15] [15]
              [27] [27] [27]
                All Reduce

   All Gather (fi_allgather)
       Conceptually, all gather can be viewed as the  opposite  of  the  scatter  component  from
       reduce-scatter.   All gather collects data from all peers into a single array, then copies
       that array back to each peer.

              [1]  [5]  [9]
                \   |   /
               All gather
                /   |   \
              [1]  [1]  [1]
              [5]  [5]  [5]
              [9]  [9]  [9]

       All gather may be performed on any  non-void  datatype.   However,  all  gather  does  not
       perform an operation on the data itself, so no operation is specified.

   Reduce-Scatter (fi_reduce_scatter)
       The  fi_reduce_scatter collective is similar to an fi_allreduce operation, followed by all
       to all.  With reduce scatter, all peers provide input into an atomic operation, similar to
       all  reduce.   However,  rather  than  the  full  result  being  copied to each peer, each
       participant receives only a slice of the result.

       This is shown by the following example:

              [1]  [1]  [1]
              [5]  [5]  [5]
              [9]  [9]  [9]
                \   |   /
                   sum (reduce)
                    |
                   [3]
                  [15]
                  [27]
                    |
                 scatter
                /   |   \
              [3] [15] [27]

       The reduce scatter call supports the same datatype and atomic operation as fi_allreduce.

   Reduce (fi_reduce)
       The fi_reduce collective is the first half of an fi_allreduce operation.  With reduce, all
       peers  provide  input into an atomic operation, with the the results collected by a single
       `root' endpoint.

       This is shown by the following example, with the leftmost peer identified as the root:

              [1]  [1]  [1]
              [5]  [5]  [5]
              [9]  [9]  [9]
                \   |   /
                   sum (reduce)
                  /
               [3]
              [15]
              [27]

       The reduce call supports the same datatype and atomic operation as fi_allreduce.

   Scatter (fi_scatter)
       The fi_scatter collective is the second half of an fi_reduce_scatter operation.  The  data
       from a single `root' endpoint is split and distributed to all peers.

       This is shown by the following example:

               [3]
              [15]
              [27]
                  \
                 scatter
                /   |   \
              [3] [15] [27]

       The  scatter operation is used to distribute results to the peers.  No atomic operation is
       performed on the data.

   Gather (fi_gather)
       The fi_gather operation is used to collect (gather) the results from all peers  and  store
       them at a `root' peer.

       This is shown by the following example, with the leftmost peer identified as the root.

              [1]  [5]  [9]
                \   |   /
                  gather
                 /
              [1]
              [5]
              [9]

       The gather operation does not perform any operation on the data itself.

   Query Collective Attributes (fi_query_collective)
       The  fi_query_collective  call  reports  which  collective operations are supported by the
       underlying provider, for suitably configured endpoints.  Collective operations  needed  by
       an  application  that  are  not  supported  by  the  provider  must  be implemented by the
       application.  The query call checks whether a  provider  supports  a  specific  collective
       operation for a given datatype and operation, if applicable.

       The  name  of  the  collective,  as  well  as  the  datatype  and associated operation, if
       applicable, and are provided as input into fi_query_collective.

       The coll parameter may reference  one  of  these  collectives:  FI_BARRIER,  FI_BROADCAST,
       FI_ALLTOALL,  FI_ALLREDUCE,  FI_ALLGATHER,  FI_REDUCE_SCATTER,  FI_REDUCE,  FI_SCATTER, or
       FI_GATHER.  Additional details on the collective operation is specified through the struct
       fi_collective_attr parameter.  For collectives that act on data, the operation and related
       data type must be specified through the given attributes.

              struct fi_collective_attr {
                  enum fi_op op;
                  enum fi_datatype datatype;
                  struct fi_atomic_attr datatype_attr;
                  size_t max_members;
                    uint64_t mode;
              };

       For a description of struct fi_atomic_attr, see fi_atomic(3).

       op     On input, this specifies the atomic operation involved with  the  collective  call.
              This should be set to one of the following values: FI_MIN, FI_MAX, FI_SUM, FI_PROD,
              FI_LOR,   FI_LAND,   FI_BOR,    FI_BAND,    FI_LXOR,    FI_BXOR,    FI_ATOMIC_READ,
              FI_ATOMIC_WRITE, of FI_NOOP.  For collectives that do not exchange application data
              (fi_barrier), this should be set to FI_NOOP.

       datatype
              On onput, specifies the datatype of the data  being  modified  by  the  collective.
              This  should  be  set  to one of the following values: FI_INT8, FI_UINT8, FI_INT16,
              FI_UINT16,  FI_INT32,  FI_UINT32,   FI_INT64,   FI_UINT64,   FI_FLOAT,   FI_DOUBLE,
              FI_FLOAT_COMPLEX,  FI_DOUBLE_COMPLEX,  FI_LONG_DOUBLE,  FI_LONG_DOUBLE_COMPLEX,  or
              FI_VOID.  For collectives that do not exchange application data (fi_barrier),  this
              should be set to FI_VOID.

       datatype_attr.count
              The maximum number of elements that may be used with the collective.

       datatype.size
              The  size  of  the  datatype  as  supported  by  the provider.  Applications should
              validate the size  of  datatypes  that  differ  based  on  the  platform,  such  as
              FI_LONG_DOUBLE.

       max_members
              The maximum number of peers that may participate in a collective operation.

       mode   This field is reserved and should be 0.

       If  a collective operation is supported, the query call will return FI_SUCCESS, along with
       attributes on the limits for using that collective operation through the provider.

   Completions
       Collective operations map to underlying fi_atomic operations.  For a discussion of  atomic
       completion  semantics,  see  fi_atomic(3).   The  completion,  ordering,  and atomicity of
       collective operations match those defined for point to point atomic operations.

FLAGS

       The following flags are defined for the specified operations.

       FI_SCATTER
              Applies to fi_query_collective.  When set, requests attribute  information  on  the
              reduce-scatter collective operation.

RETURN VALUE

       Returns  0  on  success.   On  error,  a  negative  value corresponding to fabric errno is
       returned.  Fabric errno values are defined in rdma/fi_errno.h.

ERRORS

       -FI_EAGAIN
              See fi_msg(3) for a detailed description of handling FI_EAGAIN.

       -FI_EOPNOTSUPP
              The requested atomic operation is not supported on this endpoint.

       -FI_EMSGSIZE
              The number of collective operations in a single request exceeds that  supported  by
              the underlying provider.

NOTES

       Collective  operations  map  to  atomic  operations.   As  such,  they  follow most of the
       conventions and restrictions as peer  to  peer  atomic  operations.   This  includes  data
       atomicity,   data  alignment,  and  message  ordering  semantics.   See  fi_atomic(3)  for
       additional information on the datatypes and operations defined for atomic  and  collective
       operations.

SEE ALSO

       fi_getinfo(3), fi_av(3), fi_atomic(3), fi_cm(3)

AUTHORS

       OpenFabrics.