Ubuntu Manpage: fi_endpoint - Fabric endpoint operations

Provided by: libfabric-dev_1.5.3-1_amd64

NAME

       fi_endpoint - Fabric endpoint operations

       fi_endpoint / fi_scalable_ep / fi_passive_ep / fi_close
              Allocate or close an endpoint.

       fi_ep_bind
              Associate  an  endpoint  with  hardware resources, such as event queues, completion
              queues, counters, address vectors, or shared transmit/receive contexts.

       fi_scalable_ep_bind
              Associate a scalable endpoint with an address vector

       fi_pep_bind
              Associate a passive endpoint with an event queue

       fi_enable
              Transitions an active endpoint into an enabled state.

       fi_cancel
              Cancel a pending asynchronous data transfer

       fi_ep_alias
              Create an alias to the endpoint

       fi_control
              Control endpoint operation.

       fi_getopt / fi_setopt
              Get or set endpoint options.

       fi_rx_context / fi_tx_context / fi_srx_context / fi_stx_context
              Open a transmit or receive context.

       fi_rx_size_left / fi_tx_size_left (DEPRECATED)
              Query the lower bound on how  many  RX/TX  operations  may  be  posted  without  an
              operation  returning  -FI_EAGAIN.   This functions have been deprecated and will be
              removed in a future version of the library.

SYNOPSIS

              #include <rdma/fabric.h>

              #include <rdma/fi_endpoint.h>

              int fi_endpoint(struct fid_domain *domain, struct fi_info *info,
                  struct fid_ep **ep, void *context);

              int fi_scalable_ep(struct fid_domain *domain, struct fi_info *info,
                  struct fid_ep **sep, void *context);

              int fi_passive_ep(struct fi_fabric *fabric, struct fi_info *info,
                  struct fid_pep **pep, void *context);

              int fi_tx_context(struct fid_ep *sep, int index,
                  struct fi_tx_attr *attr, struct fid_ep **tx_ep,
                  void *context);

              int fi_rx_context(struct fid_ep *sep, int index,
                  struct fi_rx_attr *attr, struct fid_ep **rx_ep,
                  void *context);

              int fi_stx_context(struct fid_domain *domain,
                  struct fi_tx_attr *attr, struct fid_stx **stx,
                  void *context);

              int fi_srx_context(struct fid_domain *domain,
                  struct fi_rx_attr *attr, struct fid_ep **rx_ep,
                  void *context);

              int fi_close(struct fid *ep);

              int fi_ep_bind(struct fid_ep *ep, struct fid *fid, uint64_t flags);

              int fi_scalable_ep_bind(struct fid_ep *sep, struct fid *fid, uint64_t flags);

              int fi_pep_bind(struct fid_pep *pep, struct fid *fid, uint64_t flags);

              int fi_enable(struct fid_ep *ep);

              int fi_cancel(struct fid_ep *ep, void *context);

              int fi_ep_alias(struct fid_ep *ep, struct fid_ep **alias_ep, uint64_t flags);

              int fi_control(struct fid *ep, int command, void *arg);

              int fi_getopt(struct fid *ep, int level, int optname,
                  void *optval, size_t *optlen);

              int fi_setopt(struct fid *ep, int level, int optname,
                  const void *optval, size_t optlen);

              DEPRECATED ssize_t fi_rx_size_left(struct fid_ep *ep);

              DEPRECATED ssize_t fi_tx_size_left(struct fid_ep *ep);

ARGUMENTS

       fid : On creation, specifies a fabric or access domain.  On  bind,  identifies  the  event
       queue,  completion  queue,  counter,  or address vector to bind to the endpoint.  In other
       cases, it's a fabric identifier of an associated resource.

       info : Details about the fabric interface endpoint to be opened, obtained from fi_getinfo.

       ep : A fabric endpoint.

       sep : A scalable fabric endpoint.

       pep : A passive fabric endpoint.

       context : Context associated with the endpoint or asynchronous operation.

       index : Index to retrieve a specific transmit/receive context.

       attr : Transmit or receive context attributes.

       flags : Additional flags to apply to the operation.

       command : Command of control operation to perform on endpoint.

       arg : Optional control argument.

       level : Protocol level at which the desired option resides.

       optname : The protocol option to read or set.

       optval : The option value that was read or to set.

       optlen : The size of the optval buffer.

DESCRIPTION

Endpoints are transport level communication portals. There are two types of endpoints:
active and passive. Passive endpoints belong to a fabric domain and are most often used
to listen for incoming connection requests. However, a passive endpoint may be used to
reserve a fabric address that can be granted to an active endpoint. Active endpoints
belong to access domains and can perform data transfers.

Active endpoints may be connection-oriented or connectionless, and may provide data
reliability. The data transfer interfaces -- messages (fi_msg), tagged messages
(fi_tagged), RMA (fi_rma), and atomics (fi_atomic) -- are associated with active
endpoints. In basic configurations, an active endpoint has transmit and receive queues.
In general, operations that generate traffic on the fabric are posted to the transmit
queue. This includes all RMA and atomic operations, along with sent messages and sent
tagged messages. Operations that post buffers for receiving incoming data are submitted
to the receive queue.

Active endpoints are created in the disabled state. They must transition into an enabled
state before accepting data transfer operations, including posting of receive buffers.
The fi_enable call is used to transition an active endpoint into an enabled state. The
fi_connect and fi_accept calls will also transition an endpoint into the enabled state, if
it is not already active.

In order to transition an endpoint into an enabled state, it must be bound to one or more
fabric resources. An endpoint that will generate asynchronous completions, either through
data transfer operations or communication establishment events, must be bound to the
appropriate completion queues or event queues, respectively, before being enabled.
Unconnected endpoints must be bound to an address vector.

Once an endpoint has been activated, it may be associated with an address vector. Receive
buffers may be posted to it and calls may be made to connection establishment routines.
Connectionless endpoints may also perform data transfers.

The behavior of an endpoint may be adjusted by setting its control data and protocol
options. This allows the underlying provider to redirect function calls to
implementations optimized to meet the desired application behavior.

If an endpoint experiences a critical error, it will transition back into a disabled
state. Critical errors are reported through the event queue associated with the EP. In
certain cases, a disabled endpoint may be re-enabled. The ability to transition back into
an enabled state is provider specific and depends on the type of error that the endpoint
experienced. When an endpoint is disabled as a result of a critical error, all pending
operations are discarded.

fi_endpoint / fi_passive_ep / fi_scalable_ep
fi_endpoint allocates a new active endpoint. fi_passive_ep allocates a new passive
endpoint. fi_scalable_ep allocates a scalable endpoint. The properties and behavior of
the endpoint are defined based on the provided struct fi_info. See fi_getinfo for
additional details on fi_info. fi_info flags that control the operation of an endpoint
are defined below. See section SCALABLE ENDPOINTS.

If an active endpoint is allocated in order to accept a connection request, the fi_info
parameter must be the same as the fi_info structure provided with the connection request
(FI_CONNREQ) event.

An active endpoint may acquire the properties of a passive endpoint by setting the fi_info
handle field to the passive endpoint fabric descriptor. This is useful for applications
that need to reserve the fabric address of an endpoint prior to knowing if the endpoint
will be used on the active or passive side of a connection. For example, this feature is
useful for simulating socket semantics. Once an active endpoint acquires the properties
of a passive endpoint, the passive endpoint is no longer bound to any fabric resources and
must no longer be used. The user is expected to close the passive endpoint after opening
the active endpoint in order to free up any lingering resources that had been used.

fi_close
Closes an endpoint and release all resources associated with it.

When closing a scalable endpoint, there must be no opened transmit contexts, or receive
contexts associated with the scalable endpoint. If resources are still associated with
the scalable endpoint when attempting to close, the call will return -FI_EBUSY.

Outstanding operations posted to the endpoint when fi_close is called will be discarded.
Discarded operations will silently be dropped, with no completions reported.
Additionally, a provider may discard previously completed operations from the associated
completion queue(s). The behavior to discard completed operations is provider specific.

fi_ep_bind
fi_ep_bind is used to associate an endpoint with hardware resources. The common use of
fi_ep_bind is to direct asynchronous operations associated with an endpoint to a
completion queue. An endpoint must be bound with CQs capable of reporting completions for
any asynchronous operation initiated on the endpoint. This is true even for endpoints
which are configured to suppress successful completions, in order that operations that
complete in error may be reported to the user. For passive endpoints, this requires
binding the endpoint with an EQ that supports the communication management (CM) domain.

An active endpoint may direct asynchronous completions to different CQs, based on the type
of operation. This is specified using fi_ep_bind flags. The following flags may be used
separately or OR'ed together when binding an endpoint to a completion domain CQ.

FI_TRANSMIT : Directs the completion of outbound data transfer requests to the specified
completion queue. This includes send message, RMA, and atomic operations.

FI_RECV : Directs the notification of inbound data transfers to the specified completion
queue. This includes received messages. This binding automatically includes
FI_REMOTE_WRITE, if applicable to the endpoint.

FI_SELECTIVE_COMPLETION : By default, data transfer operations generate completion entries
into a completion queue after they have successfully completed. Applications can use this
bind flag to selectively enable when completions are generated. If
FI_SELECTIVE_COMPLETION is specified, data transfer operations will not generate entries
for successful completions unless FI_COMPLETION is set as an operational flag for the
given operation. FI_SELECTIVE_COMPLETION must be OR'ed with FI_TRANSMIT and/or FI_RECV
flags.

When FI_SELECTIVE_COMPLETION is set, the user must determine when a request that does NOT
have FI_COMPLETION set has completed indirectly, usually based on the completion of a
subsequent operation. Use of this flag may improve performance by allowing the provider
to avoid writing a completion entry for every operation.

Example: An application can selectively generate send completions by using the following
general approach:

fi_tx_attr::op_flags = 0; // default - no completion
fi_ep_bind(ep, cq, FI_TRANSMIT | FI_SELECTIVE_COMPLETION);
fi_send(ep, ...); // no completion
fi_sendv(ep, ...); // no completion
fi_sendmsg(ep, ..., FI_COMPLETION); // completion!
fi_inject(ep, ...); // no completion

Example: An application can selectively disable send completions by modifying the
operational flags:

fi_tx_attr::op_flags = FI_COMPLETION; // default - completion
fi_ep_bind(ep, cq, FI_TRANSMIT | FI_SELECTIVE_COMPLETION);
fi_send(ep, ...); // completion
fi_sendv(ep, ...); // completion
fi_sendmsg(ep, ..., 0); // no completion!
fi_inject(ep, ...); // no completion!

Example: Omitting FI_SELECTIVE_COMPLETION when binding will generate completions for all
non-fi_inject calls:

fi_tx_attr::op_flags = 0;
fi_ep_bind(ep, cq, FI_TRANSMIT); // default - completion
fi_send(ep, ...); // completion
fi_sendv(ep, ...); // completion
fi_sendmsg(ep, ..., 0); // completion!
fi_sendmsg(ep, ..., FI_COMPLETION); // completion
fi_sendmsg(ep, ..., FI_INJECT|FI_COMPLETION); // completion!
fi_inject(ep, ...); // no completion!

An endpoint may also, or instead, be bound to a fabric counter. When binding an endpoint
to a counter, the following flags may be specified.

FI_SEND : Increments the specified counter whenever a message transfer initiated over the
endpoint has completed successfully or in error. Sent messages include both tagged and
normal message operations.

FI_RECV : Increments the specified counter whenever a message is received over the
endpoint. Received messages include both tagged and normal message operations.

FI_READ : Increments the specified counter whenever an RMA read or atomic fetch operation
initiated from the endpoint has completed successfully or in error.

FI_WRITE : Increments the specified counter whenever an RMA write or atomic operation
initiated from the endpoint has completed successfully or in error.

FI_REMOTE_READ : Increments the specified counter whenever an RMA read or atomic fetch
operation is initiated from a remote endpoint that targets the given endpoint. Use of
this flag requires that the endpoint be created using FI_RMA_EVENT.

FI_REMOTE_WRITE : Increments the specified counter whenever an RMA write or atomic
operation is initiated from a remote endpoint that targets the given endpoint. Use of
this flag requires that the endpoint be created using FI_RMA_EVENT.

An endpoint may only be bound to a single CQ or counter for a given type of operation.
For example, a EP may not bind to two counters both using FI_WRITE. Furthermore,
providers may limit CQ and counter bindings to endpoints of the same endpoint type (DGRAM,
MSG, RDM, etc.).

Connectionless endpoints must be bound to a single address vector.

If an endpoint is using a shared transmit and/or receive context, the shared contexts must
be bound to the endpoint. CQs, counters, AV, and shared contexts must be bound to
endpoints before they are enabled.

fi_scalable_ep_bind
fi_scalable_ep_bind is used to associate a scalable endpoint with an address vector. See
section on SCALABLE ENDPOINTS. A scalable endpoint has a single transport level address
and can support multiple transmit and receive contexts. The transmit and receive contexts
share the transport-level address. Address vectors that are bound to scalable endpoints
are implicitly bound to any transmit or receive contexts created using the scalable
endpoint.

fi_enable
This call transitions the endpoint into an enabled state. An endpoint must be enabled
before it may be used to perform data transfers. Enabling an endpoint typically results
in hardware resources being assigned to it. Endpoints making use of completion queues,
counters, event queues, and/or address vectors must be bound to them before being enabled.

Calling connect or accept on an endpoint will implicitly enable an endpoint if it has not
already been enabled.

fi_enable may also be used to re-enable an endpoint that has been disabled as a result of
experiencing a critical error. Applications should check the return value from fi_enable
to see if a disabled endpoint has successfully be re-enabled.

fi_cancel
fi_cancel attempts to cancel an outstanding asynchronous operation. Canceling an
operation causes the fabric provider to search for the operation and, if it is still
pending, complete it as having been canceled. An error queue entry will be available in
the the associated error queue with error code FI_ECANCELED. On the other hand, if the
operation completed before the call to fi_cancel, then the completion status of that
operation will be available in the associated completion queue. No specific entry related
to fi_cancel itself will be posted. If multiple outstanding operations match the context
parameter, only one will be canceled. In this case, the operation which is canceled is
provider specific. The cancel operation is asynchronous, but will complete within a
bounded period of time.

fi_ep_alias
This call creates an alias to the specified endpoint. Conceptually, an endpoint alias
provides an alternate software path from the application to the underlying provider
hardware. An alias EP differs from its parent endpoint only by its default data transfer
flags. For example, an alias EP may be configured to use a different completion mode. By
default, an alias EP inherits the same data transfer flags as the parent endpoint. An
application can use fi_control to modify the alias EP operational flags.

When allocating an alias, an application may configure either the transmit or receive
operational flags. This avoids needing a separate call to fi_control to set those flags.
The flags passed to fi_ep_alias must include FI_TRANSMIT or FI_RECV (not both) with other
operational flags OR'ed in. This will override the transmit or receive flags,
respectively, for operations posted through the alias endpoint. All allocated aliases
must be closed for the underlying endpoint to be released.

fi_control
The control operation is used to adjust the default behavior of an endpoint. It allows
the underlying provider to redirect function calls to implementations optimized to meet
the desired application behavior. As a result, calls to fi_ep_control must be serialized
against all other calls to an endpoint.

The base operation of an endpoint is selected during creation using struct fi_info. The
following control commands and arguments may be assigned to an endpoint.

**FI_GETOPSFLAG -- uint64_t *flags** : Used to retrieve the current value of flags
associated with the data transfer operations initiated on the endpoint. The control
argument must include FI_TRANSMIT or FI_RECV (not both) flags to indicate the type of data
transfer flags to be returned. See below for a list of control flags.

**FI_SETOPSFLAG -- uint64_t *flags** : Used to change the data transfer operation flags
associated with an endpoint. The control argument must include FI_TRANSMIT or FI_RECV
(not both) to indicate the type of data transfer that the flags should apply to, with
other flags OR'ed in. The given flags will override the previous transmit and receive
attributes that were set when the endpoint was created. Valid control flags are defined
below.

**FI_BACKLOG - int *value** : This option only applies to passive endpoints. It is used
to set the connection request backlog for listening endpoints.

FI_GETWAIT (void **) : This command allows the user to retrieve the file descriptor
associated with a socket endpoint. The fi_control arg parameter should be an address
where a pointer to the returned file descriptor will be written. See fi_eq.3 for addition
details using fi_control with FI_GETWAIT. The file descriptor may be used for
notification that the endpoint is ready to send or receive data.

fi_getopt / fi_setopt
Endpoint protocol operations may be retrieved using fi_getopt or set using fi_setopt.
Applications specify the level that a desired option exists, identify the option, and
provide input/output buffers to get or set the option. fi_setopt provides an application
a way to adjust low-level protocol and implementation specific details of an endpoint.

The following option levels and option names and parameters are defined.

FI_OPT_ENDPOINT

• FI_OPT_MIN_MULTI_RECV - size_t : Defines the minimum receive buffer space available when
the receive buffer is released by the provider (see FI_MULTI_RECV). Modifying this
value is only guaranteed to set the minimum buffer space needed on receives posted after
the value has been changed. It is recommended that applications that want to override
the default MIN_MULTI_RECV value set this option before enabling the corresponding
endpoint.

• FI_OPT_CM_DATA_SIZE - size_t : Defines the size of available space in CM messages for
user-defined data. This value limits the amount of data that applications can exchange
between peer endpoints using the fi_connect, fi_accept, and fi_reject operations. The
size returned is dependent upon the properties of the endpoint, except in the case of
passive endpoints, in which the size reflects the maximum size of the data that may be
present as part of a connection request event. This option is read only.

fi_rx_size_left (DEPRECATED)
This function has been deprecated and will be removed in a future version of the library.
It may not be supported by all providers.

The fi_rx_size_left call returns a lower bound on the number of receive operations that
may be posted to the given endpoint without that operation returning -FI_EAGAIN.
Depending on the specific details of the subsequently posted receive operations (e.g.,
number of iov entries, which receive function is called, etc.), it may be possible to post
more receive operations than originally indicated by fi_rx_size_left.

fi_tx_size_left (DEPRECATED)
This function has been deprecated and will be removed in a future version of the library.
It may not be supported by all providers.

The fi_tx_size_left call returns a lower bound on the number of transmit operations that
may be posted to the given endpoint without that operation returning -FI_EAGAIN.
Depending on the specific details of the subsequently posted transmit operations (e.g.,
number of iov entries, which transmit function is called, etc.), it may be possible to
post more transmit operations than originally indicated by fi_tx_size_left.

ENDPOINT ATTRIBUTES

The fi_ep_attr structure defines the set of attributes associated with an endpoint.
Endpoint attributes may be further refined using the transmit and receive context
attributes as shown below.

struct fi_ep_attr {
enum fi_ep_type type;
uint32_t protocol;
uint32_t protocol_version;
size_t max_msg_size;
size_t msg_prefix_size;
size_t max_order_raw_size;
size_t max_order_war_size;
size_t max_order_waw_size;
uint64_t mem_tag_format;
size_t tx_ctx_cnt;
size_t rx_ctx_cnt;
size_t auth_key_size;
uint8_t *auth_key;
};

type - Endpoint Type
If specified, indicates the type of fabric interface communication desired. Supported
types are:

FI_EP_UNSPEC : The type of endpoint is not specified. This is usually provided as input,
with other attributes of the endpoint or the provider selecting the type.

FI_EP_MSG : Provides a reliable, connection-oriented data transfer service with flow
control that maintains message boundaries.

FI_EP_DGRAM : Supports a connectionless, unreliable datagram communication. Message
boundaries are maintained, but the maximum message size may be limited to the fabric MTU.
Flow control is not guaranteed.

FI_EP_RDM : Reliable datagram message. Provides a reliable, unconnected data transfer
service with flow control that maintains message boundaries.

FI_EP_SOCK_STREAM : Data streaming endpoint with TCP socket-like semantics. Provides a
reliable, connection-oriented data transfer service that does not maintain message
boundaries. FI_EP_SOCK_STREAM is most useful for applications designed around using TCP
sockets. See the SOCKET ENDPOINT section for additional details and restrictions that
apply to stream endpoints.

FI_EP_SOCK_DGRAM : A connectionless, unreliable datagram endpoint with UDP socket-like
semantics. FI_EP_SOCK_DGRAM is most useful for applications designed around using UDP
sockets. See the SOCKET ENDPOINT section for additional details and restrictions that
apply to datagram socket endpoints.

Protocol
Specifies the low-level end to end protocol employed by the provider. A matching protocol
must be used by communicating endpoints to ensure interoperability. The following
protocol values are defined. Provider specific protocols are also allowed. Provider
specific protocols will be indicated by having the upper bit of the protocol value set to
one.

FI_PROTO_UNSPEC : The protocol is not specified. This is usually provided as input, with
other attributes of the socket or the provider selecting the actual protocol.

FI_PROTO_RDMA_CM_IB_RC : The protocol runs over Infiniband reliable-connected queue pairs,
using the RDMA CM protocol for connection establishment.

FI_PROTO_IWARP : The protocol runs over the Internet wide area RDMA protocol transport.

FI_PROTO_IB_UD : The protocol runs over Infiniband unreliable datagram queue pairs.

FI_PROTO_PSMX : The protocol is based on an Intel proprietary protocol known as PSM,
performance scaled messaging. PSMX is an extended version of the PSM protocol to support
the libfabric interfaces.

FI_PROTO_UDP : The protocol sends and receives UDP datagrams. For example, an endpoint
using FI_PROTO_UDP will be able to communicate with a remote peer that is using Berkeley
SOCK_DGRAM sockets using IPPROTO_UDP.

FI_PROTO_SOCK_TCP : The protocol is layered over TCP packets.

protocol_version - Protocol Version
Identifies which version of the protocol is employed by the provider. The protocol
version allows providers to extend an existing protocol, by adding support for additional
features or functionality for example, in a backward compatible manner. Providers that
support different versions of the same protocol should inter-operate, but only when using
the capabilities defined for the lesser version.

max_msg_size - Max Message Size
Defines the maximum size for an application data transfer as a single operation.

msg_prefix_size - Message Prefix Size
Specifies the size of any required message prefix buffer space. This field will be 0
unless the FI_MSG_PREFIX mode is enabled. If msg_prefix_size is > 0 the specified value
will be a multiple of 8-bytes.

Max RMA Ordered Size
The maximum ordered size specifies the delivery order of transport data into target memory
for RMA and atomic operations. Data ordering is separate, but dependent on message
ordering (defined below). Data ordering is unspecified where message order is not
defined.

Data ordering refers to the access of target memory by subsequent operations. When back
to back RMA read or write operations access the same registered memory location, data
ordering indicates whether the second operation reads or writes the target memory after
the first operation has completed. Because RMA ordering applies between two operations,
and not within a single data transfer, ordering is defined per byte-addressable memory
location. I.e. ordering specifies whether location X is accessed by the second operation
after the first operation. Nothing is implied about the completion of the first operation
before the second operation is initiated.

In order to support large data transfers being broken into multiple packets and sent using
multiple paths through the fabric, data ordering may be limited to transfers of a specific
size or less. Providers specify when data ordering is maintained through the following
values. Note that even if data ordering is not maintained, message ordering may be.

max_order_raw_size : Read after write size. If set, an RMA or atomic read operation
issued after an RMA or atomic write operation, both of which are smaller than the size,
will be ordered. Where the target memory locations overlap, the RMA or atomic read
operation will see the results of the previous RMA or atomic write.

max_order_war_size : Write after read size. If set, an RMA or atomic write operation
issued after an RMA or atomic read operation, both of which are smaller than the size,
will be ordered. The RMA or atomic read operation will see the initial value of the
target memory location before a subsequent RMA or atomic write updates the value.

max_order_waw_size : Write after write size. If set, an RMA or atomic write operation
issued after an RMA or atomic write operation, both of which are smaller than the size,
will be ordered. The target memory location will reflect the results of the second RMA or
atomic write.

An order size value of 0 indicates that ordering is not guaranteed. A value of -1
guarantees ordering for any data size.

mem_tag_format - Memory Tag Format
The memory tag format is a bit array used to convey the number of tagged bits supported by
a provider. Additionally, it may be used to divide the bit array into separate fields.
The mem_tag_format optionally begins with a series of bits set to 0, to signify bits which
are ignored by the provider. Following the initial prefix of ignored bits, the array will
consist of alternating groups of bits set to all 1's or all 0's. Each group of bits
corresponds to a tagged field. The implication of defining a tagged field is that when a
mask is applied to the tagged bit array, all bits belonging to a single field will either
be set to 1 or 0, collectively.

For example, a mem_tag_format of 0x30FF indicates support for 14 tagged bits, separated
into 3 fields. The first field consists of 2-bits, the second field 4-bits, and the final
field 8-bits. Valid masks for such a tagged field would be a bitwise OR'ing of zero or
more of the following values: 0x3000, 0x0F00, and 0x00FF.

By identifying fields within a tag, a provider may be able to optimize their search
routines. An application which requests tag fields must provide tag masks that either set
all mask bits corresponding to a field to all 0 or all 1. When negotiating tag fields, an
application can request a specific number of fields of a given size. A provider must
return a tag format that supports the requested number of fields, with each field being at
least the size requested, or fail the request. A provider may increase the size of the
fields. When reporting completions (see FI_CQ_FORMAT_TAGGED), the provider must provide
the exact value of the received tag, clearing out any unsupported tag bits.

It is recommended that field sizes be ordered from smallest to largest. A generic,
unstructured tag and mask can be achieved by requesting a bit array consisting of
alternating 1's and 0's.

tx_ctx_cnt - Transmit Context Count
Number of transmit contexts to associate with the endpoint. If not specified (0), 1
context will be assigned if the endpoint supports outbound transfers. Transmit contexts
are independent transmit queues that may be separately configured. Each transmit context
may be bound to a separate CQ, and no ordering is defined between contexts. Additionally,
no synchronization is needed when accessing contexts in parallel.

If the count is set to the value FI_SHARED_CONTEXT, the endpoint will be configured to use
a shared transmit context, if supported by the provider. Providers that do not support
shared transmit contexts will fail the request.

See the scalable endpoint and shared contexts sections for additional details.

rx_ctx_cnt - Receive Context Count
Number of receive contexts to associate with the endpoint. If not specified, 1 context
will be assigned if the endpoint supports inbound transfers. Receive contexts are
independent processing queues that may be separately configured. Each receive context may
be bound to a separate CQ, and no ordering is defined between contexts. Additionally, no
synchronization is needed when accessing contexts in parallel.

If the count is set to the value FI_SHARED_CONTEXT, the endpoint will be configured to use
a shared receive context, if supported by the provider. Providers that do not support
shared receive contexts will fail the request.

See the scalable endpoint and shared contexts sections for additional details.

auth_key_size - Authorization Key Length
The length of the authorization key in bytes. This field will be 0 if authorization keys
are not available or used. This field is ignored unless the fabric is opened with API
version 1.5 or greater.

auth_key - Authorization Key
If supported by the fabric, an authorization key (a.k.a. job key) to associate with the
endpoint. An authorization key is used to limit communication between endpoints. Only
peer endpoints that are programmed to use the same authorization key may communicate.
Authorization keys are often used to implement job keys, to ensure that processes running
in different jobs do not accidentally cross traffic. The domain authorization key will be
used if auth_key_size is set to 0. This field is ignored unless the fabric is opened with
API version 1.5 or greater.

TRANSMIT CONTEXT ATTRIBUTES

Attributes specific to the transmit capabilities of an endpoint are specified using struct
fi_tx_attr.

struct fi_tx_attr {
uint64_t caps;
uint64_t mode;
uint64_t op_flags;
uint64_t msg_order;
uint64_t comp_order;
size_t inject_size;
size_t size;
size_t iov_limit;
size_t rma_iov_limit;
};

caps - Capabilities
The requested capabilities of the context. The capabilities must be a subset of those
requested of the associated endpoint. See the CAPABILITIES section of fi_getinfo(3) for
capability details. If the caps field is 0 on input to fi_getinfo(3), the caps value from
the fi_info structure will be used.

mode
The operational mode bits of the context. The mode bits will be a subset of those
associated with the endpoint. See the MODE section of fi_getinfo(3) for details. A mode
value of 0 will be ignored on input to fi_getinfo(3), with the mode value of the fi_info
structure used instead. On return from fi_getinfo(3), the mode will be set only to those
constraints specific to transmit operations.

op_flags - Default transmit operation flags
Flags that control the operation of operations submitted against the context. Applicable
flags are listed in the Operation Flags section.

msg_order - Message Ordering
Message ordering refers to the order in which transport layer headers (as viewed by the
application) are processed. Relaxed message order enables data transfers to be sent and
received out of order, which may improve performance by utilizing multiple paths through
the fabric from the initiating endpoint to a target endpoint. Message order applies only
between a single source and destination endpoint pair. Ordering between different target
endpoints is not defined.

Message order is determined using a set of ordering bits. Each set bit indicates that
ordering is maintained between data transfers of the specified type. Message order is
defined for [read | write | send] operations submitted by an application after [read |
write | send] operations.

Message ordering only applies to the end to end transmission of transport headers.
Message ordering is necessary, but does not guarantee, the order in which message data is
sent or received by the transport layer. Message ordering requires matching ordering
semantics on the receiving side of a data transfer operation in order to guarantee that
ordering is met.

FI_ORDER_NONE : No ordering is specified. This value may be used as input in order to
obtain the default message order supported by the provider. FI_ORDER_NONE is an alias for
the value 0.

FI_ORDER_RAR : Read after read. If set, RMA and atomic read operations are transmitted in
the order submitted relative to other RMA and atomic read operations. If not set, RMA and
atomic reads may be transmitted out of order from their submission.

FI_ORDER_RAW : Read after write. If set, RMA and atomic read operations are transmitted
in the order submitted relative to RMA and atomic write operations. If not set, RMA and
atomic reads may be transmitted ahead of RMA and atomic writes.

FI_ORDER_RAS : Read after send. If set, RMA and atomic read operations are transmitted in
the order submitted relative to message send operations, including tagged sends. If not
set, RMA and atomic reads may be transmitted ahead of sends.

FI_ORDER_WAR : Write after read. If set, RMA and atomic write operations are transmitted
in the order submitted relative to RMA and atomic read operations. If not set, RMA and
atomic writes may be transmitted ahead of RMA and atomic reads.

FI_ORDER_WAW : Write after write. If set, RMA and atomic write operations are transmitted
in the order submitted relative to other RMA and atomic write operations. If not set, RMA
and atomic writes may be transmitted out of order from their submission.

FI_ORDER_WAS : Write after send. If set, RMA and atomic write operations are transmitted
in the order submitted relative to message send operations, including tagged sends. If
not set, RMA and atomic writes may be transmitted ahead of sends.

FI_ORDER_SAR : Send after read. If set, message send operations, including tagged sends,
are transmitted in order submitted relative to RMA and atomic read operations. If not
set, message sends may be transmitted ahead of RMA and atomic reads.

FI_ORDER_SAW : Send after write. If set, message send operations, including tagged sends,
are transmitted in order submitted relative to RMA and atomic write operations. If not
set, message sends may be transmitted ahead of RMA and atomic writes.

FI_ORDER_SAS : Send after send. If set, message send operations, including tagged sends,
are transmitted in the order submitted relative to other message send. If not set,
message sends may be transmitted out of order from their submission.

comp_order - Completion Ordering
Completion ordering refers to the order in which completed requests are written into the
completion queue. Completion ordering is similar to message order. Relaxed completion
order may enable faster reporting of completed transfers, allow acknowledgments to be sent
over different fabric paths, and support more sophisticated retry mechanisms. This can
result in lower-latency completions, particularly when using unconnected endpoints.
Strict completion ordering may require that providers queue completed operations or limit
available optimizations.

For transmit requests, completion ordering depends on the endpoint communication type.
For unreliable communication, completion ordering applies to all data transfer requests
submitted to an endpoint. For reliable communication, completion ordering only applies to
requests that target a single destination endpoint. Completion ordering of requests that
target different endpoints over a reliable transport is not defined.

Applications should specify the completion ordering that they support or require.
Providers should return the completion order that they actually provide, with the
constraint that the returned ordering is stricter than that specified by the application.
Supported completion order values are:

FI_ORDER_NONE : No ordering is defined for completed operations. Requests submitted to
the transmit context may complete in any order.

FI_ORDER_STRICT : Requests complete in the order in which they are submitted to the
transmit context.

inject_size
The requested inject operation size (see the FI_INJECT flag) that the context will
support. This is the maximum size data transfer that can be associated with an inject
operation (such as fi_inject) or may be used with the FI_INJECT data transfer flag.

size
The size of the context. The size is specified as the minimum number of transmit
operations that may be posted to the endpoint without the operation returning -FI_EAGAIN.

iov_limit
This is the maximum number of IO vectors (scatter-gather elements) that a single posted
operation may reference.

rma_iov_limit
This is the maximum number of RMA IO vectors (scatter-gather elements) that an RMA or
atomic operation may reference. The rma_iov_limit corresponds to the rma_iov_count values
in RMA and atomic operations. See struct fi_msg_rma and struct fi_msg_atomic in fi_rma.3
and fi_atomic.3, for additional details. This limit applies to both the number of RMA IO
vectors that may be specified when initiating an operation from the local endpoint, as
well as the maximum number of IO vectors that may be carried in a single request from a
remote endpoint.

RECEIVE CONTEXT ATTRIBUTES

Attributes specific to the receive capabilities of an endpoint are specified using struct
fi_rx_attr.

struct fi_rx_attr {
uint64_t caps;
uint64_t mode;
uint64_t op_flags;
uint64_t msg_order;
uint64_t comp_order;
size_t total_buffered_recv;
size_t size;
size_t iov_limit;
};

caps - Capabilities
The requested capabilities of the context. The capabilities must be a subset of those
requested of the associated endpoint. See the CAPABILITIES section if fi_getinfo(3) for
capability details. If the caps field is 0 on input to fi_getinfo(3), the caps value from
the fi_info structure will be used.

op_flags - Default receive operation flags
Flags that control the operation of operations submitted against the context. Applicable
flags are listed in the Operation Flags section.

msg_order - Message Ordering
For a description of message ordering, see the msg_order field in the Transmit Context
Attribute section. Receive context message ordering defines the order in which received
transport message headers are processed when received by an endpoint.

The following ordering flags, as defined for transmit ordering, also apply to the
processing of received operations: FI_ORDER_NONE, FI_ORDER_RAR, FI_ORDER_RAW,
FI_ORDER_RAS, FI_ORDER_WAR, FI_ORDER_WAW, FI_ORDER_WAS, FI_ORDER_SAR, FI_ORDER_SAW, and
FI_ORDER_SAS.

comp_order - Completion Ordering
For a description of completion ordering, see the comp_order field in the Transmit Context
Attribute section.

FI_ORDER_NONE : No ordering is defined for completed operations. Receive operations may
complete in any order, regardless of their submission order.

FI_ORDER_STRICT : Receive operations complete in the order in which they are processed by
the receive context, based on the receive side msg_order attribute.

FI_ORDER_DATA : When set, this bit indicates that received data is written into memory in
order. Data ordering applies to memory accessed as part of a single operation and between
operations if message ordering is guaranteed.

total_buffered_recv
This field is supported for backwards compatibility purposes. It is a hint to the
provider of the total available space that may be needed to buffer messages that are
received for which there is no matching receive operation. The provider may adjust or
ignore this value. The allocation of internal network buffering among received message is
provider specific. For instance, a provider may limit the size of messages which can be
buffered or the amount of buffering allocated to a single message.

If receive side buffering is disabled (total_buffered_recv = 0) and a message is received
by an endpoint, then the behavior is dependent on whether resource management has been
enabled (FI_RM_ENABLED has be set or not). See the Resource Management section of
fi_domain.3 for further clarification. It is recommended that applications enable
resource management if they anticipate receiving unexpected messages, rather than
modifying this value.

size
The size of the context. The size is specified as the minimum number of receive
operations that may be posted to the endpoint without the operation returning -FI_EAGAIN.

iov_limit
This is the maximum number of IO vectors (scatter-gather elements) that a single posted
operating may reference.

SCALABLE ENDPOINTS

A scalable endpoint is a communication portal that supports multiple transmit and receive
contexts. Scalable endpoints are loosely modeled after the networking concept of
transmit/receive side scaling, also known as multi-queue. Support for scalable endpoints
is domain specific. Scalable endpoints may improve the performance of multi-threaded and
parallel applications, by allowing threads to access independent transmit and receive
queues. A scalable endpoint has a single transport level address, which can reduce the
memory requirements needed to store remote addressing data, versus using standard
endpoints. Scalable endpoints cannot be used directly for communication operations, and
require the application to explicitly create transmit and receive contexts as described
below.

fi_tx_context
Transmit contexts are independent transmit queues. Ordering and synchronization between
contexts are not defined. Conceptually a transmit context behaves similar to a send-only
endpoint. A transmit context may be configured with fewer capabilities than the base
endpoint and with different attributes (such as ordering requirements and inject size)
than other contexts associated with the same scalable endpoint. Each transmit context has
its own completion queue. The number of transmit contexts associated with an endpoint is
specified during endpoint creation.

The fi_tx_context call is used to retrieve a specific context, identified by an index (see
above for details on transmit context attributes). Providers may dynamically allocate
contexts when fi_tx_context is called, or may statically create all contexts when
fi_endpoint is invoked. By default, a transmit context inherits the properties of its
associated endpoint. However, applications may request context specific attributes
through the attr parameter. Support for per transmit context attributes is provider
specific and not guaranteed. Providers will return the actual attributes assigned to the
context through the attr parameter, if provided.

fi_rx_context
Receive contexts are independent receive queues for receiving incoming data. Ordering and
synchronization between contexts are not guaranteed. Conceptually a receive context
behaves similar to a receive-only endpoint. A receive context may be configured with
fewer capabilities than the base endpoint and with different attributes (such as ordering
requirements and inject size) than other contexts associated with the same scalable
endpoint. Each receive context has its own completion queue. The number of receive
contexts associated with an endpoint is specified during endpoint creation.

Receive contexts are often associated with steering flows, that specify which incoming
packets targeting a scalable endpoint to process. However, receive contexts may be
targeted directly by the initiator, if supported by the underlying protocol. Such
contexts are referred to as 'named'. Support for named contexts must be indicated by
setting the caps FI_NAMED_RX_CTX capability when the corresponding endpoint is created.
Support for named receive contexts is coordinated with address vectors. See fi_av(3) and
fi_rx_addr(3).

The fi_rx_context call is used to retrieve a specific context, identified by an index (see
above for details on receive context attributes). Providers may dynamically allocate
contexts when fi_rx_context is called, or may statically create all contexts when
fi_endpoint is invoked. By default, a receive context inherits the properties of its
associated endpoint. However, applications may request context specific attributes
through the attr parameter. Support for per receive context attributes is provider
specific and not guaranteed. Providers will return the actual attributes assigned to the
context through the attr parameter, if provided.

SHARED CONTEXTS

Shared contexts are transmit and receive contexts explicitly shared among one or more
endpoints. A shareable context allows an application to use a single dedicated provider
resource among multiple transport addressable endpoints. This can greatly reduce the
resources needed to manage communication over multiple endpoints by multiplexing transmit
and/or receive processing, with the potential cost of serializing access across multiple
endpoints. Support for shareable contexts is domain specific.

Conceptually, shareable transmit contexts are transmit queues that may be accessed by many
endpoints. The use of a shared transmit context is mostly opaque to an application.
Applications must allocate and bind shared transmit contexts to endpoints, but operations
are posted directly to the endpoint. Shared transmit contexts are not associated with
completion queues or counters. Completed operations are posted to the CQs bound to the
endpoint. An endpoint may only be associated with a single shared transmit context.

Unlike shared transmit contexts, applications interact directly with shared receive
contexts. Users post receive buffers directly to a shared receive context, with the
buffers usable by any endpoint bound to the shared receive context. Shared receive
contexts are not associated with completion queues or counters. Completed receive
operations are posted to the CQs bound to the endpoint. An endpoint may only be
associated with a single receive context, and all connectionless endpoints associated with
a shared receive context must also share the same address vector.

Endpoints associated with a shared transmit context may use dedicated receive contexts,
and vice-versa. Or an endpoint may use shared transmit and receive contexts. And there
is no requirement that the same group of endpoints sharing a context of one type also
share the context of an alternate type. Furthermore, an endpoint may use a shared context
of one type, but a scalable set of contexts of the alternate type.

fi_stx_context
This call is used to open a shareable transmit context (see above for details on the
transmit context attributes). Endpoints associated with a shared transmit context must
use a subset of the transmit context's attributes. Note that this is the reverse of the
requirement for transmit contexts for scalable endpoints.

fi_srx_context
This allocates a shareable receive context (see above for details on the receive context
attributes). Endpoints associated with a shared receive context must use a subset of the
receive context's attributes. Note that this is the reverse of the requirement for
receive contexts for scalable endpoints.

SOCKET ENDPOINTS

The following feature and description should be considered experimental. Until the
experimental tag is removed, the interfaces, semantics, and data structures associated
with socket endpoints may change between library versions.

This section applies to endpoints of type FI_EP_SOCK_STREAM and FI_EP_SOCK_DGRAM, commonly
referred to as socket endpoints.

Socket endpoints are defined with semantics that allow them to more easily be adopted by
developers familiar with the UNIX socket API, or by middleware that exposes the socket
API, while still taking advantage of high-performance hardware features.

The key difference between socket endpoints and other active endpoints are socket
endpoints use synchronous data transfers. Buffers passed into send and receive operations
revert to the control of the application upon returning from the function call. As a
result, no data transfer completions are reported to the application, and socket endpoints
are not associated with completion queues or counters.

Socket endpoints support a subset of message operations: fi_send, fi_sendv, fi_sendmsg,
fi_recv, fi_recvv, fi_recvmsg, and fi_inject. Because data transfers are synchronous, the
return value from send and receive operations indicate the number of bytes transferred on
success, or a negative value on error, including -FI_EAGAIN if the endpoint cannot send or
receive any data because of full or empty queues, respectively.

Socket endpoints are associated with event queues and address vectors, and process
connection management events asynchronously, similar to other endpoints. Unlike UNIX
sockets, socket endpoint must still be declared as either active or passive.

Socket endpoints behave like non-blocking sockets. In order to support select and poll
semantics, active socket endpoints are associated with a file descriptor that is signaled
whenever the endpoint is ready to send and/or receive data. The file descriptor may be
retrieved using fi_control.

OPERATION FLAGS

Operation flags are obtained by OR-ing the following flags together. Operation flags
define the default flags applied to an endpoint's data transfer operations, where a flags
parameter is not available. Data transfer operations that take flags as input override
the op_flags value of transmit or receive context attributes of an endpoint.

FI_INJECT : Indicates that all outbound data buffers should be returned to the user's
control immediately after a data transfer call returns, even if the operation is handled
asynchronously. This may require that the provider copy the data into a local buffer and
transfer out of that buffer. A provider can limit the total amount of send data that may
be buffered and/or the size of a single send that can use this flag. This limit is
indicated using inject_size (see inject_size above).

FI_MULTI_RECV : Applies to posted receive operations. This flag allows the user to post a
single buffer that will receive multiple incoming messages. Received messages will be
packed into the receive buffer until the buffer has been consumed. Use of this flag may
cause a single posted receive operation to generate multiple completions as messages are
placed into the buffer. The placement of received data into the buffer may be subjected
to provider specific alignment restrictions. The buffer will be released by the provider
when the available buffer space falls below the specified minimum (see
FI_OPT_MIN_MULTI_RECV).

FI_COMPLETION : Indicates that a completion entry should be generated for data transfer
operations. This flag only applies to operations issued on endpoints that were bound to a
CQ or counter with the FI_SELECTIVE_COMPLETION flag. See the fi_ep_bind section above for
more detail.

FI_INJECT_COMPLETE : Indicates that a completion should be generated when the source
buffer(s) may be reused. A completion guarantees that the buffers will not be read from
again and the application may reclaim them. No other guarantees are made with respect to
the state of the operation.

Note: This flag is used to control when a completion entry is inserted into a completion
queue. It does not apply to operations that do not generate a completion queue entry,
such as the fi_inject operation, and is not subject to the inject_size message limit
restriction.

FI_TRANSMIT_COMPLETE : Indicates that a completion should be generated when the transmit
operation has completed relative to the local provider. The exact behavior is dependent
on the endpoint type.

For reliable endpoints:

Indicates that a completion should be generated when the operation has been delivered to
the peer endpoint. A completion guarantees that the operation is no longer dependent on
the fabric or local resources. The state of the operation at the peer endpoint is not
defined.

For unreliable endpoints:

Indicates that a completion should be generated when the operation has been delivered to
the fabric. A completion guarantees that the operation is no longer dependent on local
resources. The state of the operation within the fabric is not defined.

FI_DELIVERY_COMPLETE : Indicates that a completion should not be generated until an
operation has been processed by the destination endpoint(s). A completion guarantees that
the result of the operation is available.

This completion mode applies only to reliable endpoints. For operations that return data
to the initiator, such as RMA read or atomic-fetch, the source endpoint is also considered
a destination endpoint. This is the default completion mode for such operations.

FI_MULTICAST : Indicates that data transfers will target multicast addresses by default.
Any fi_addr_t passed into a data transfer operation will be treated as a multicast
address.

NOTES

       Users should call fi_close to release all resources allocated to the fabric endpoint.

       Endpoints allocated with the FI_CONTEXT mode set must typically provide struct  fi_context
       as  their  per operation context parameter.  (See fi_getinfo.3 for details.) However, when
       FI_SELECTIVE_COMPLETION is enabled to suppress completion entries,  and  an  operation  is
       initiated  without  FI_COMPLETION  flag  set,  then  the context parameter is ignored.  An
       application does not need to pass in a valid struct fi_context into such data transfers.

       Operations that complete in error that are not associated with valid  operational  context
       will use the endpoint context in any error reporting structures.

       Although  applications  typically  associate individual completions with either completion
       queues or counters, an endpoint can be attached to both a counter  and  completion  queue.
       When combined with using selective completions, this allows an application to use counters
       to track successful completions, with  a  CQ  used  to  report  errors.   Operations  that
       complete  with  an error increment the error counter and generate a completion event.  The
       generation   of   entries   going   to   the   CQ   can   then   be    controlled    using
       FI_SELECTIVE_COMPLETION.

       As  mentioned  in fi_getinfo(3), the ep_attr structure can be used to query providers that
       support various endpoint attributes.  fi_getinfo can return provider info structures  that
       can  support  the  minimal  set  of  requirements  (such  that  the  application maintains
       correctness).   However,  it  can  also  return  provider  info  structures  that   exceed
       application  requirements.  As an example, consider an application requesting msg_order as
       FI_ORDER_NONE.  The resulting output from getinfo may have all the ordering bits set.  The
       application  can reset the ordering bits it does not require before creating the endpoint.
       The provider is free to implement a stricter ordering than is required by the application.

RETURN VALUES

       Returns 0 on success.  On error,  a  negative  value  corresponding  to  fabric  errno  is
       returned.   For  fi_cancel,  a  return  value  of  0 indicates that the cancel request was
       submitted for processing.

       Fabric errno values are defined in rdma/fi_errno.h.

ERRORS

       -FI_EDOMAIN : A resource domain was not bound to the endpoint or an attempt  was  made  to
       bind multiple domains.

       -FI_ENOCQ : The endpoint has not been configured with necessary event queue.

       -FI_EOPBADSTATE : The endpoint's state does not permit the requested operation.

AUTHORS

       OpenFabrics.