Provided by: libfabric-dev_1.17.0-3ubuntu1_amd64 bug

NAME

       fi_export_fid / fi_import_fid
              Share a fabric object between different providers or resources

       struct fid_peer_av
              An address vector sharable between independent providers

       struct fid_peer_av_set
              An AV set sharable between independent providers

       struct fid_peer_cq
              A completion queue that may be shared between independent providers

       struct fid_peer_srx
              A shared receive context that may be shared between independent providers

SYNOPSIS

              #include <rdma/fabric.h>
              #include <rdma/fi_ext.h>

              int fi_export_fid(struct fid *fid, uint64_t flags,
                  struct fid **expfid, void *context);

              int fi_import_fid(struct fid *fid, struct fid *expfid, uint64_t flags);

ARGUMENTS

       fid    Returned fabric identifier for opened object.

       expfid Exported fabric object that may be shared with another provider.

       flags  Control flags for the operation.

       *context:
              User defined context that will be associated with a fabric object.

DESCRIPTION

       NOTICE:  The  peer APIs describe by this man page are developmental and may change between
       libfabric versions.  The data structures and API  definitions  should  not  be  considered
       stable  between  versions.   Providers  being used as peers must target the same libfabric
       version.

       Functions defined in this man page are typically used by  providers  to  communicate  with
       other  providers,  known  as peer providers, or by other libraries to communicate with the
       libfabric core, known as peer libraries.  Most middleware and applications should not need
       to access this functionality, as the documentation mainly targets provider developers.

       Peer  providers  are  a way for independently developed providers to be used together in a
       tight fashion, such that layering overhead and duplicate  provider  functionality  can  be
       avoided.   Peer  providers are linked by having one provider export specific functionality
       to another.  This is done by having one provider export a sharable  fabric  object  (fid),
       which is imported by one or more peer providers.

       As  an example, a provider which uses TCP to communicate with remote peers may wish to use
       the shared memory provider to communicate with local peers.  To remove layering  overhead,
       the  TCP  based  provider  may  export its completion queue and shared receive context and
       import those into the shared memory provider.

       The general mechanisms used to share fabric objects between peer  providers  are  similar,
       independent  from  the  object  being  shared.   However,  because  the goal of using peer
       providers is to avoid overhead, providers must be explicitly written to support  the  peer
       provider mechanisms.

       There  are  two  peer  provider  models.  In the example listed above, both peers are full
       providers in their own right and usable in a stand-alone fashion.  In a second model,  one
       of  the peers is known as an offload provider.  An offload provider implements a subset of
       the libfabric API and targets the use of specific  acceleration  hardware.   For  example,
       network  switches  may  support  collective  operations, such as barrier or broadcast.  An
       offload provider may be written specifically to leverage this capability; however, such  a
       provider  is  not usable for general purposes.  As a result, an offload provider is paired
       with a main peer provider.

PEER AV

       The peer AV allows the sharing of addressing metadata between providers.  It  specifically
       targets  the use case of having a main provider paired with an offload provider, where the
       offload provider leverages the communication that has already been established through the
       main  provider.   In other situations, such as that mentioned above pairing a tcp provider
       with a shared memory provider, each peer will likely have their own AV that is not shared.

       The setup for a peer AV is similar to the setup for a shared  CQ,  described  below.   The
       owner  of  the  AV creates a fid_peer_av object that links back to its actual fid_av.  The
       fid_peer_av is then imported by the offload provider.

       Peer AVs are configured by the owner calling the peer’s fi_av_open() call, passing in  the
       FI_PEER flag, and pointing the context parameter to struct fi_peer_av_context.

       The data structures to support peer AVs are:

              struct fid_peer_av;

              struct fi_ops_av_owner {
                  size_t  size;
                  int (*query)(struct fid_peer_av *av, struct fi_av_attr *attr);
                  fi_addr_t (*ep_addr)(struct fid_peer_av *av, struct fid_ep *ep);
              };

              struct fid_peer_av {
                  struct fid fid;
                  struct fi_ops_av_owner *owner_ops;
              };

              struct fi_peer_av_context {
                  size_t size;
                  struct fid_peer_av *av;
              };

   fi_ops_av_owner::query()
       This  call  returns  current attributes for the peer AV.  The owner sets the fields of the
       input struct fi_av_attr based on the current state of the AV for return to the caller.

   fi_ops_av_owner::ep_addr()
       This lookup function returns the fi_addr of the address associated with  the  given  local
       endpoint.   If  the  address  of the local endpoint has not been inserted into the AV, the
       function should return FI_ADDR_NOTAVAIL.

PEER AV SET

       The peer AV set allows the sharing of collective addressing data  between  providers.   It
       specifically  targets  the  use  case  pairing  a  main provider with a collective offload
       provider.  The setup for a peer AV set is similar to a shared CQ,  described  below.   The
       owner  of  the  AV set creates a fid_peer_av_set object that links back to its fid_av_set.
       The fid_peer_av_set is imported by the offload provider.

       Peer AV sets are configured by the owner calling the peer’s fi_av_set_open() call, passing
       in    the    FI_PEER_AV   flag,   and   pointing   the   context   parameter   to   struct
       fi_peer_av_set_context.

       The data structures to support peer AV sets are:

              struct fi_ops_av_set_owner {
                  size_t  size;
                  int (*members)(struct fid_peer_av_set *av, fi_addr_t *addr,
                             size_t *count);
              };

              struct fid_peer_av_set {
                  struct fid fid;
                  struct fi_ops_av_set_owner *owner_ops;
              };

              struct fi_peer_av_set_context {
                  size_t size;
                  struct fi_peer_av_set *av_set;
              };

   fi_ops_peer_av_owner::members
       This call returns an array of AV addresses that are members of the AV set.   The  size  of
       the array is specified through the count parameter.  On return, count is set to the number
       of addresses in the AV set.  If the input count value is too small, the  function  returns
       -FI_ETOOSMALL.  Otherwise, the function returns an array of fi_addr values.

PEER CQ

       The  peer  CQ defines a mechanism by which a peer provider may insert completions into the
       CQ owned by another provider.  This avoids the overhead of the libfabric user  needing  to
       access multiple CQs.

       To  setup  a  peer  CQ,  a  provider creates a fid_peer_cq object, which links back to the
       provider’s actual fid_cq.  The fid_peer_cq object is then imported  by  a  peer  provider.
       The  fid_peer_cq  defines callbacks that the providers use to communicate with each other.
       The provider that allocates the fid_peer_cq is known as the owner, with the other provider
       referred to as the peer.  An owner may setup peer relationships with multiple providers.

       Peer  CQs  are  configured  by  the owner calling the peer’s fi_cq_open() call.  The owner
       passes in the FI_PEER flag to  fi_cq_open().   When  FI_PEER  is  specified,  the  context
       parameter  passed into fi_cq_open() must reference a struct fi_peer_cq_context.  Providers
       that do not support peer CQs must fail the fi_cq_open() call with  -FI_EINVAL  (indicating
       an  invalid  flag).   The  fid_peer_cq referenced by struct fi_peer_cq_context must remain
       valid until the peer’s CQ is closed.

       The data structures to support peer CQs are defined as follows:

              struct fi_ops_cq_owner {
                  size_t  size;
                  ssize_t (*write)(struct fid_peer_cq *cq, void *context, uint64_t flags,
                      size_t len, void *buf, uint64_t data, uint64_t tag, fi_addr_t src);
                  ssize_t (*writeerr)(struct fid_peer_cq *cq,
                      const struct fi_cq_err_entry *err_entry);
              };

              struct fid_peer_cq {
                  struct fid fid;
                  struct fi_ops_cq_owner *owner_ops;
              };

              struct fi_peer_cq_context {
                  size_t size;
                  struct fid_peer_cq *cq;
              };

       For struct fid_peer_cq, the owner  initializes  the  fid  and  owner_ops  fields.   struct
       fi_ops_cq_owner is used by the peer to communicate with the owning provider.

       If  manual progress is needed on the peer CQ, the owner should drive progress by using the
       fi_cq_read() function with the buf parameter set to NULL and  count  equal  0.   The  peer
       provider   should   set   other   functions   that   attempt   to   read   the  peer’s  CQ
       (i.e. fi_cq_readerr, fi_cq_sread, etc.) to return -FI_ENOSYS.

   fi_ops_cq_owner::write()
       This call directs the owner to insert new completions into the CQ.  The fi_cq_attr::format
       field,  along  with other related attributes, determines which input parameters are valid.
       Parameters that are not reported as part of a completion are ignored  by  the  owner,  and
       should  be set to 0, NULL, or other appropriate value by the user.  For example, if source
       addressing is not returned with a completion, then the src  parameter  should  be  set  to
       FI_ADDR_NOTAVAIL and ignored on input.

       The  owner  is  responsible  for locking, event signaling, and handling CQ overflow.  Data
       passed through the write callback is relative to the user.  For example, the fi_addr_t  is
       relative  to the peer’s AV.  The owner is responsible for converting the address if source
       addressing is needed.

       (TBD: should CQ overflow push back to the user for flow control?  Do  we  need  backoff  /
       resume callbacks in ops_cq_user?)

   fi_ops_cq_owner::writeerr()
       The  behavior  of  this  call  is  similar  to  the  write() ops.  It inserts a completion
       indicating that a data transfer has failed into the CQ.

   EXAMPLE PEER CQ SETUP
       The above description defines the generic mechanism for  sharing  CQs  between  providers.
       This  section outlines one possible implementation to demonstrate the use of the APIs.  In
       the example, provider A uses provider B as a peer for data transfers  targeting  endpoints
       on the local node.

              1. Provider A is configured to use provider B as a peer.  This may be coded
                 into provider A or set through an environment variable.
              2. The application calls:
                 fi_cq_open(domain_a, attr, &cq_a, app_context)
              3. Provider A allocates cq_a and automatically configures it to be used
                 as a peer cq.
              4. Provider A takes these steps:
                 allocate peer_cq and reference cq_a
                 set peer_cq_context->cq = peer_cq
                 set attr_b.flags |= FI_PEER
                 fi_cq_open(domain_b, attr_b, &cq_b, peer_cq_context)
              5. Provider B allocates a cq, but configures it such that all completions
                 are written to the peer_cq.  The cq ops to read from the cq are
                 set to enosys calls.
              8. Provider B inserts its own callbacks into the peer_cq object.  It
                 creates a reference between the peer_cq object and its own cq.

PEER DOMAIN

       The peer domain allows a provider to access the operations of a domain object of its peer.
       For example, an offload provider can use a peer domain to register memory buffers with the
       main provider.

       The  setup  of  a  peer  domain  is similar to the setup for a peer CQ outline above.  The
       owner’s domain object is imported directly into the peer.

       Peer domains are configured by the owner calling the peer’s fi_domain2() call.  The  owner
       passes  in  the  FI_PEER  flag  to  fi_domain2().   When FI_PEER is specified, the context
       parameter  passed  into  fi_domain2()  must  reference  a  struct  fi_peer_domain_context.
       Providers  that  do  not  support  peer  domains  must  fail  the  fi_domain2()  call with
       -FI_EINVAL.  The fid_domain referenced by struct fi_peer_domain_context must remain  valid
       until the peer’s domain is closed.

       The data structures to support peer domains are defined as follows:

              struct fi_peer_domain_context {
                  size_t size;
                  struct fid_domain *domain;
              };

PEER EQ

       The  peer  EQ  defines  a mechanism by which a peer provider may insert events into the EQ
       owned by another provider.  This avoids the overhead of  the  libfabric  user  needing  to
       access multiple EQs.

       The  setup  of a peer EQ is similar to the setup for a peer CQ outline above.  The owner’s
       EQ object is imported directly into the peer provider.

       Peer EQs are configured by the owner calling the  peer’s  fi_eq_open()  call.   The  owner
       passes  in  the  FI_PEER  flag  to  fi_eq_open().   When FI_PEER is specified, the context
       parameter passed into fi_eq_open() must reference a struct fi_peer_eq_context.   Providers
       that  do  not support peer EQs must fail the fi_eq_open() call with -FI_EINVAL (indicating
       an invalid flag).  The fid_eq referenced by struct fi_peer_eq_context  must  remain  valid
       until the peer’s EQ is closed.

       The data structures to support peer EQs are defined as follows:

              struct fi_peer_eq_context {
                  size_t size;
                  struct fid_eq *eq;
              };

PEER SRX

       The peer SRX defines a mechanism by which peer providers may share a common shared receive
       context.  This avoids the overhead of having separate receive queues, can eliminate memory
       copies, and ensures correct application level message ordering.

       The  setup  of  a  peer  SRX  is  similar  to  the  setup for a peer CQ outlined above.  A
       fid_peer_srx object links the owner of the SRX with the  peer  provider.   Peer  SRXs  are
       configured  by  the  owner  calling the peer’s fi_srx_context() call with the FI_PEER flag
       set.   The   context   parameter   passed   to   fi_srx_context()   must   be   a   struct
       fi_peer_srx_context.

       The  owner provider initializes all elements of the fid_peer_srx and referenced structures
       (fi_ops_srx_owner and fi_ops_srx_peer), with the exception of the fi_ops_srx_peer callback
       functions.   Those  must  be  initialized by the peer provider prior to returning from the
       fi_srx_contex() call and are used by the owner to control peer actions.

       The data structures to support peer SRXs are defined as follows:

              struct fid_peer_srx;

              /* Castable to dlist_entry */
              struct fi_peer_rx_entry {
                  struct fi_peer_rx_entry *next;
                  struct fi_peer_rx_entry *prev;
                  struct fi_peer_srx *srx;
                  fi_addr_t addr;
                  size_t size;
                  uint64_t tag;
                  uint64_t flags;
                  void *context;
                  size_t count;
                  void **desc;
                  void *peer_context;
                  void *user_context;
                  struct iovec *iov;
              };

              struct fi_ops_srx_owner {
                  size_t size;
                  int (*get_msg)(struct fid_peer_srx *srx, fi_addr_t addr,
                                 size_t size, struct fi_peer_rx_entry **entry);
                  int (*get_tag)(struct fid_peer_srx *srx, fi_addr_t addr,
                                 uint64_t tag, struct fi_peer_rx_entry **entry);
                  int (*queue_msg)(struct fi_peer_rx_entry *entry);
                  int (*queue_tag)(struct fi_peer_rx_entry *entry);
                  void (*free_entry)(struct fi_peer_rx_entry *entry);
              };

              struct fi_ops_srx_peer {
                  size_t size;
                  int (*start_msg)(struct fid_peer_srx *srx);
                  int (*start_tag)(struct fid_peer_srx *srx);
                  int (*discard_msg)(struct fid_peer_srx *srx);
                  int (*discard_tag)(struct fid_peer_srx *srx);
              };

              struct fid_peer_srx {
                  struct fid_ep ep_fid;
                  struct fi_ops_srx_owner *owner_ops;
                  struct fi_ops_srx_peer *peer_ops;
              };

              struct fi_peer_srx_context {
                  size_t size;
                  struct fid_peer_srx *srx;
              };

       The ownership of structure field values and callback functions is similar to those defined
       for peer CQs, relative to owner versus peer ops.

   fi_ops_srx_owner::get_msg_entry() / get_tag_entry()
       These  calls  are  invoked  by  the peer provider to obtain the receive buffer(s) where an
       incoming message should be placed.  The peer provider will pass in the relevant fields  to
       request  a  matching  rx_entry from the owner.  If source addressing is required, the addr
       will be passed in; otherwise, the address will be  set  to  FI_ADDR_NOT_AVAIL.   The  size
       field  indicates the received message size.  This field is used by the owner when handling
       multi-received data  buffers,  but  may  be  ignored  otherwise.   The  peer  provider  is
       responsible  for  checking that an incoming message fits within the provided buffer space.
       The tag parameter is used for tagged messages.  An fi_peer_rx_entry is  allocated  by  the
       owner,  whether  or  not  a  match was found.  If a match was found, the owner will return
       FI_SUCCESS and the rx_entry will be filled in with the appropriate receive fields for  the
       peer to process accordingly.  If no match was found, the owner will return -FI_ENOENT; the
       rx_entry will still be valid but will not match to an existing posted receive.   When  the
       peer gets FI_ENOENT, it should allocate whatever resources it needs to process the message
       later (on start_msg/tag) and set the rx_entry->user_context appropriately, followed  by  a
       call to the owner’s queue_msg/tag.  The get and queue messages should be serialized.  When
       the owner gets a matching receive for the queued unexpected  message,  it  will  call  the
       peer’s  start  function  to notify the peer of the updated rx_entry (or the peer’s discard
       function if the message is to be discarded) (TBD: The peer may need to update the src addr
       if the remote endpoint is inserted into the AV after the message has been received.)

   fi_ops_srx_peer::start_msg() / start_tag()
       These calls indicate that an asynchronous get_msg_entry() or get_tag_entry() has completed
       and a buffer is now available to receive the message.  Control of the fi_peer_rx_entry  is
       returned to the peer provider and has been initialized for receiving the incoming message.

   fi_ops_srx_peer::discard_msg() / discard_tag()
       Indicates  that the message and data associated with the specified fi_peer_rx_entry should
       be discarded.  This often indicates that the application has  canceled  or  discarded  the
       receive operation.  No completion should be generated by the peer provider for a discarded
       message.  Control of the fi_peer_rx_entry is returned to the peer provider.

   EXAMPLE PEER SRX SETUP
       The above description defines the generic mechanism for sharing  SRXs  between  providers.
       This  section outlines one possible implementation to demonstrate the use of the APIs.  In
       the example, provider A uses provider B as a peer for data transfers  targeting  endpoints
       on the local node.

              1. Provider A is configured to use provider B as a peer.  This may be coded
                 into provider A or set through an environment variable.
              2. The application calls:
                 fi_srx_context(domain_a, attr, &srx_a, app_context)
              3. Provider A allocates srx_a and automatically configures it to be used
                 as a peer srx.
              4. Provider A takes these steps:
                 allocate peer_srx and reference srx_a
                 set peer_srx_context->srx = peer_srx
                 set attr_b.flags |= FI_PEER
                 fi_srx_context(domain_b, attr_b, &srx_b, peer_srx_context)
              5. Provider B allocates an srx, but configures it such that all receive
                 buffers are obtained from the peer_srx.  The srx ops to post receives are
                 set to enosys calls.
              8. Provider B inserts its own callbacks into the peer_srx object.  It
                 creates a reference between the peer_srx object and its own srx.

   EXAMPLE PEER SRX RECEIVE FLOW
       The  following  outlines  shows  simplified,  example  software  flows for receive message
       handling using a peer SRX.  The first flow demonstrates the case where a receive buffer is
       waiting when the message arrives.

              1. Application calls fi_recv() / fi_trecv() on owner.
              2. Owner queues the receive buffer.
              3. A message is received by the peer provider.
              4. The peer calls owner->get_msg() / get_tag().
              5. The owner removes the queued receive buffer and returns it to
                 the peer.  The get entry call will complete with FI_SUCCESS.
              6. When the peer finishes processing the message and completes it on its own
                 CQ, the peer will call free_entry to free the entry with the owner.

       The  second  case  below  shows the flow when a message arrives before the application has
       posted the matching receive buffer.

              1. A message is received by the peer provider.
              2. The peer calls owner->get_msg() / get_tag().
              3. The owner fails to find a matching receive buffer.
              4. The owner allocates a rx_entry with any known fields and returns -FI_ENOENT.
              5. The peer allocates any resources needed to handle the asynchronous processing
                 and sets peer_context accordingly.
              6. The peer calls the peer's queue function and the owner queues the peer request
                 on an unexpected/pending list.
              5. The application calls fi_recv() / fi_trecv() on owner, posting the
                 matching receive buffer.
              6. The owner matches the receive with the queued message on the peer.
              7. The owner removes the queued request, fills in the rest of the known fields
                 and calls the peer->start_msg() / start_tag() function.
              9. When the peer finishes processing the message and completes it on its own
                 CQ, the peer will call free_entry to free the entry with the owner.

fi_export_fid / fi_import_fid

       The fi_export_fid function is reserved for future use.

       The fi_import_fid call may be used to import a fabric object  created  and  owned  by  the
       libfabric  user.   This  allows  upper  level  libraries or the application to override or
       define low-level libfabric behavior.   Details  on  specific  uses  of  fi_import_fid  are
       outside the scope of this documentation.

FI_PEER_TRANSFER

       Providers  frequently  send control messages to their remote counterparts as part of their
       wire protocol.  For example, a provider may send an  ACK  message  to  guarantee  reliable
       delivery  of  a  message  or  to  meet  a requested completion semantic.  When two or more
       providers are coordinating as peers, it can be more efficient if control messages for both
       peer  providers  go over the same transport.  In some cases, such as when one of the peers
       is an offload provider, it may even be required.  Peer transfers define the  mechanism  by
       which such communication occurs.

       Peer  transfers  enable  one  peer  to send and receive data transfers over its associated
       peer.   Providers  that  require  this  functionality  indicate  this   by   setting   the
       FI_PEER_TRANSFER flag as a mode bit, i.e. fi_info::mode.

       To  use  such a provider as a peer, the main, or owner, provider must setup peer transfers
       by opening a peer transfer endpoint and accepting transfers with this flag set.  Setup  of
       peer transfers involves the following data structures:

              struct fi_ops_transfer_peer {
                  size_t size;
                  ssize_t (*complete)(struct fid_ep *ep, struct fi_cq_tagged_entry *buf,
                          fi_addr_t *src_addr);
                  ssize_t (*comperr)(struct fid_ep *ep, struct fi_cq_err_entry *buf);
              };

              struct fi_peer_transfer_context {
                  size_t size;
                  struct fi_info *info;
                  struct fid_ep *ep;
                  struct fi_ops_transfer_peer *peer_ops;
              };

       Peer transfer contexts form a virtual link between endpoints allocated on each of the peer
       providers.  The setup of a peer transfer context occurs through  the  fi_endpoint2()  API.
       The  main  provider  calls  fi_endpoint2()  passing  in  the  FI_PEER_TRANSFER flag.  When
       specified, the context parameter reference  the  struct  fi_peer_transfer_context  defined
       above.

       The  size  field indicates the size of struct fi_peer_transfer_context being passed to the
       peer.  This is used for backward compatibility.  The info field is optional.  If given, it
       defines  the  attributes  of  the  main  provider’s objects.  It may be used to report the
       capabilities and restrictions on peer transfers, such as whether  memory  registration  is
       required, maximum message sizes, data and completion ordering semantics, and so forth.  If
       the importing provider cannot meet these restrictions, it  must  fail  the  fi_endpoint2()
       call.

       The  peer_ops field contains callbacks from the main provider into the peer and is used to
       report the completion (success or failure) of peer initiated data transfers.  The callback
       functions  defined  in struct fi_ops_transfer_peer must be set by the peer provider before
       returning from the fi_endpoint2() call.  Actions that the  peer  provider  can  take  from
       within  the  completion  callbacks  are  most  unrestricted,  and  can  include any of the
       following types of operations: initiation of additional data transfers, writing events  to
       the  owner’s CQ or EQ, and memory registration/deregistration.  The owner must ensure that
       deadlock cannot occur prior to invoking the peer’s callback should the peer invoke any  of
       these  operations.   Further,  the  owner  must  avoid recursive calls into the completion
       callbacks.

RETURN VALUE

       Returns FI_SUCCESS on success.  On error, a negative value corresponding to  fabric  errno
       is returned.  Fabric errno values are defined in rdma/fi_errno.h.

SEE ALSO

       fi_provider(7), fi_provider(3), fi_cq(3),

AUTHORS

       OpenFabrics.