Ubuntu Manpage: fi_psm2 - The PSM2 Fabric Provider

Provided by: libfabric-dev_1.17.0-3build2_amd64

NAME

       fi_psm2 - The PSM2 Fabric Provider

OVERVIEW

       The psm2 provider runs over the PSM 2.x interface that is supported by the Intel Omni-Path
       Fabric.  PSM 2.x has all the PSM 1.x features plus a set of new  functions  with  enhanced
       capabilities.   Since  PSM  1.x  and PSM 2.x are not ABI compatible the psm2 provider only
       works with PSM 2.x and doesn’t support Intel TrueScale Fabric.

LIMITATIONS

The psm2 provider doesn’t support all the features defined in the libfabric API. Here are
some of the limitations:

Endpoint types
Only support non-connection based types FI_DGRAM and FI_RDM

Endpoint capabilities
Endpoints can support any combination of data transfer capabilities FI_TAGGED,
FI_MSG, FI_ATOMICS, and FI_RMA. These capabilities can be further refined by
FI_SEND, FI_RECV, FI_READ, FI_WRITE, FI_REMOTE_READ, and FI_REMOTE_WRITE to limit
the direction of operations.

FI_MULTI_RECV is supported for non-tagged message queue only.

Scalable endpoints are supported if the underlying PSM2 library supports multiple
endpoints. This condition must be satisfied both when the provider is built and when the
provider is used. See the Scalable endpoints section for more information.

Other supported capabilities include FI_TRIGGER, FI_REMOTE_CQ_DATA, FI_RMA_EVENT,
FI_SOURCE, and FI_SOURCE_ERR. Furthermore, FI_NAMED_RX_CTX is supported when scalable
endpoints are enabled.

Modes FI_CONTEXT is required for the FI_TAGGED and FI_MSG capabilities. That means, any
request belonging to these two categories that generates a completion must pass as
the operation context a valid pointer to type struct fi_context, and the space
referenced by the pointer must remain untouched until the request has completed.
If none of FI_TAGGED and FI_MSG is asked for, the FI_CONTEXT mode is not required.

Progress
The psm2 provider requires manual progress. The application is expected to call
fi_cq_read or fi_cntr_read function from time to time when no other libfabric
function is called to ensure progress is made in a timely manner. The provider
does support auto progress mode. However, the performance can be significantly
impacted if the application purely depends on the provider to make auto progress.

Scalable endpoints
Scalable endpoints support depends on the multi-EP feature of the PSM2 library. If
the PSM2 library supports this feature, the availability is further controlled by
an environment variable PSM2_MULTI_EP. The psm2 provider automatically sets this
variable to 1 if it is not set. The feature can be disabled explicitly by setting
PSM2_MULTI_EP to 0.

When creating a scalable endpoint, the exact number of contexts requested should be set in
the “fi_info” structure passed to the fi_scalable_ep function. This number should be set
in “fi_info->ep_attr->tx_ctx_cnt” or “fi_info->ep_attr->rx_ctx_cnt” or both, whichever
greater is used. The psm2 provider allocates all requested contexts upfront when the
scalable endpoint is created. The same context is used for both Tx and Rx.

For optimal performance, it is advised to avoid having multiple threads accessing the same
context, either directly by posting send/recv/read/write request, or indirectly by polling
associated completion queues or counters.

Using the scalable endpoint as a whole in communication functions is not supported.
Instead, individual tx context or rx context of the scalable endpoint should be used.
Similarly, using the address of the scalable endpoint as the source address or destination
address doesn’t collectively address all the tx/rx contexts. It addresses only the first
tx/rx context, instead.

Shared Tx contexts
In order to achieve the purpose of saving PSM context by using shared Tx context,
the endpoints bound to the shared Tx contexts need to be Tx only. The reason is
that Rx capability always requires a PSM context, which can also be automatically
used for Tx. As the result, allocating a shared Tx context for Rx capable
endpoints actually consumes one extra context instead of saving some.

Unsupported features
These features are unsupported: connection management, passive endpoint, and shared
receive context.

RUNTIME PARAMETERS

The psm2 provider checks for the following environment variables:

FI_PSM2_UUID
PSM requires that each job has a unique ID (UUID). All the processes in the same
job need to use the same UUID in order to be able to talk to each other. The PSM
reference manual advises to keep UUID unique to each job. In practice, it
generally works fine to reuse UUID as long as (1) no two jobs with the same UUID
are running at the same time; and (2) previous jobs with the same UUID have exited
normally. If running into “resource busy” or “connection failure” issues with
unknown reason, it is advisable to manually set the UUID to a value different from
the default.

The default UUID is 00FF00FF-0000-0000-0000-00FF0F0F00FF.

It is possible to create endpoints with UUID different from the one set here. To achieve
that, set `info->ep_attr->auth_key' to the uuid value and `info->ep_attr->auth_key_size'
to its size (16 bytes) when calling fi_endpoint() or fi_scalable_ep(). It is still true
that an endpoint can only communicate with endpoints with the same UUID.

FI_PSM2_NAME_SERVER
The psm2 provider has a simple built-in name server that can be used to resolve an
IP address or host name into a transport address needed by the fi_av_insert call.
The main purpose of this name server is to allow simple client-server type
applications (such as those in fabtests) to be written purely with libfabric,
without using any out-of-band communication mechanism. For such applications, the
server would run first to allow endpoints be created and registered with the name
server, and then the client would call fi_getinfo with the node parameter set to
the IP address or host name of the server. The resulting fi_info structure would
have the transport address of the endpoint created by the server in the dest_addr
field. Optionally the service parameter can be used in addition to node. Notice
that the service number is interpreted by the provider and is not a TCP/IP port
number.

The name server is on by default. It can be turned off by setting the variable to 0.
This may save a small amount of resource since a separate thread is created when the name
server is on.

The provider detects OpenMPI and MPICH runs and changes the default setting to off.

FI_PSM2_TAGGED_RMA
The RMA functions are implemented on top of the PSM Active Message functions. The
Active Message functions have limit on the size of data can be transferred in a
single message. Large transfers can be divided into small chunks and be pipe-
lined. However, the bandwidth is sub-optimal by doing this way.

The psm2 provider use PSM tag-matching message queue functions to achieve higher bandwidth
for large size RMA. It takes advantage of the extra tag bits available in PSM2 to
separate the RMA traffic from the regular tagged message queue.

The option is on by default. To turn it off set the variable to 0.

FI_PSM2_DELAY
Time (seconds) to sleep before closing PSM endpoints. This is a workaround for a
bug in some versions of PSM library.

The default setting is 0.

FI_PSM2_TIMEOUT
Timeout (seconds) for gracefully closing PSM endpoints. A forced closing will be
issued if timeout expires.

The default setting is 5.

FI_PSM2_CONN_TIMEOUT
Timeout (seconds) for establishing connection between two PSM endpoints.

The default setting is 5.

FI_PSM2_PROG_INTERVAL
When auto progress is enabled (asked via the hints to fi_getinfo), a progress
thread is created to make progress calls from time to time. This option set the
interval (microseconds) between progress calls.

The default setting is 1 if affinity is set, or 1000 if not. See FI_PSM2_PROG_AFFINITY.

FI_PSM2_PROG_AFFINITY
When set, specify the set of CPU cores to set the progress thread affinity to. The
format is <start>[:<end>[:<stride>]][,<start>[:<end>[:<stride>]]]*, where each
triplet <start>:<end>:<stride> defines a block of core_ids. Both <start> and <end>
can be either the core_id (when >=0) or core_id - num_cores (when <0).

By default affinity is not set.

FI_PSM2_INJECT_SIZE
Maximum message size allowed for fi_inject and fi_tinject calls. This is an
experimental feature to allow some applications to override default inject size
limitation. When the inject size is larger than the default value, some inject
calls might block.

The default setting is 64.

FI_PSM2_LOCK_LEVEL
When set, dictate the level of locking being used by the provider. Level 2 means
all locks are enabled. Level 1 disables some locks and is suitable for runs that
limit the access to each PSM2 context to a single thread. Level 0 disables all
locks and thus is only suitable for single threaded runs.

To use level 0 or level 1, wait object and auto progress mode cannot be used because they
introduce internal threads that may break the conditions needed for these levels.

The default setting is 2.

FI_PSM2_LAZY_CONN
There are two strategies on when to establish connections between the PSM2
endpoints that OFI endpoints are built on top of. In eager connection mode,
connections are established when addresses are inserted into the address vector.
In lazy connection mode, connections are established when addresses are used the
first time in communication. Eager connection mode has slightly lower critical
path overhead but lazy connection mode scales better.

This option controls how the two connection modes are used. When set to 1, lazy
connection mode is always used. When set to 0, eager connection mode is used when
required conditions are all met and lazy connection mode is used otherwise. The
conditions for eager connection mode are: (1) multiple endpoint (and scalable endpoint)
support is disabled by explicitly setting PSM2_MULTI_EP=0; and (2) the address vector type
is FI_AV_MAP.

The default setting is 0.

FI_PSM2_DISCONNECT
The provider has a mechanism to automatically send disconnection notifications to
all connected peers before the local endpoint is closed. As the response, the
peers call psm2_ep_disconnect to clean up the connection state at their side. This
allows the same PSM2 epid be used by different dynamically started processes
(clients) to communicate with the same peer (server). This mechanism, however,
introduce extra overhead to the finalization phase. For applications that never
reuse epids within the same session such overhead is unnecessary.

This option controls whether the automatic disconnection notification mechanism should be
enabled. For client-server application mentioned above, the client side should set this
option to 1, but the server should set it to 0.

The default setting is 0.

FI_PSM2_TAG_LAYOUT
Select how the 96-bit PSM2 tag bits are organized. Currently three choices are
available: tag60 means 32-4-60 partitioning for CQ data, internal protocol flags,
and application tag. tag64 means 4-28-64 partitioning for internal protocol flags,
CQ data, and application tag. auto means to choose either tag60 or tag64 based on
the hints passed to fi_getinfo – tag60 is used if remote CQ data support is
requested explicitly, either by passing non-zero value via
hints->domain_attr->cq_data_size or by including FI_REMOTE_CQ_DATA in hints->caps,
otherwise tag64 is used. If tag64 is the result of automatic selection, fi_getinfo
also returns a second instance of the provider with tag60 layout.

The default setting is auto.

Notice that if the provider is compiled with macro PSMX2_TAG_LAYOUT defined to 1 (means
tag60) or 2 (means tag64), the choice is fixed at compile time and this runtime option
will be disabled.

PSM2 EXTENSIONS

       The  psm2  provider  supports limited low level parameter setting through the fi_set_val()
       and fi_get_val() functions.  Currently the following parameters can be set via the  domain
       fid: • .RS 2

       FI_PSM2_DISCONNECT *
              Overwite  the global runtime parameter FI_PSM2_DISCONNECT for this domain.  See the
              RUNTIME PARAMETERS section for details.

       Valid parameter names are defined in the header file rdma/fi_ext_psm2.h.

AUTHORS

       OpenFabrics.

NAME

OVERVIEW

LIMITATIONS

RUNTIME PARAMETERS

PSM2 EXTENSIONS

SEE ALSO

AUTHORS