Ubuntu Manpage: librpma - remote persistent memory access library

Provided by: librpma-dev_1.3.0-2build2_amd64

NAME

       librpma - remote persistent memory access library

SYNOPSIS

             #include <librpma.h>
             cc ... -lrpma

DESCRIPTION

       librpma  is a C library to simplify accessing persistent memory (PMem) on remote hosts over Remote Direct
       Memory Access (RDMA).

       The librpma library provides two possible schemes of operation: Remote Memory Access and Messaging.  Both
       of them are available over a connection established between two peers. Both of these schemes can make use
       of  PMem  as  well  as  DRAM  for  the  sake  of building efficient and scalable Remote Persistent Memory
       Accessing (RPMA) applications.

REMOTE MEMORY ACCESS

       The librpma library implements four basic API calls dedicated for accessing a remote memory:

       •  rpma_read() - initiates transferring data from the remote memory to the local memory,

       •  rpma_write() - initiates transferring data from the local memory to the remote memory),

       •  rpma_atomic_write()  -  works  like  rpma_write(),  but  it  allows  transferring  8  bytes  of   data
          (RPMA_ATOMIC_WRITE_ALIGNMENT)    and   storing   them   atomically   in   the   remote   memory   (see
          rpma_atomic_write(3) for details and restrictions), and:

       •  rpma_flush() - initiates finalizing a transfer of  data  to  the  remote  memory.  Possible  types  of
          rpma_flush() operation:

          •  RPMA_FLUSH_TYPE_PERSISTENT - flush data down to the persistent domain,

          •  RPMA_FLUSH_TYPE_VISIBILITY - flush data deep enough to make it visible on the remote node.

       All the above functions use the attribute flags to set the completion notification indicator:

       •  RPMA_F_COMPLETION_ON_ERROR - generates the completion only on error

       •  RPMA_F_COMPLETION_ALWAYS - generates the completion regardless of a result of the operation.

       All of these operations are considered as finished when the respective completion is generated.

DIRECT WRITE TO PMEM

Direct Write to PMem is a feature of a platform and its configuration which allows an RDMA-capable
network interface to write data to platform's PMem in a persistent way. It may be impossible because of
e.g. caching mechanisms existing on the data's way. When Direct Write to PMem is impossible, operating in
the way assuming it is possible may corrupt data on PMem, so this is why Direct Write to PMem is not
enabled by default.

On the current Intel platforms, the only thing you have to do in order to enable Direct Write to PMem is
turning off Intel Direct Data I/O (DDIO). Sometimes, you can turn off DDIO either globally for the whole
platform or for a specific PCIe Root Port. For details, please see the manual of your platform.

When you have a platform which allows Direct Write to PMem, you have to declare this is the case in your
peer's configuration. The peer's configuration has to be transferred to all the peers which want to
execute rpma_flush() with RPMA_FLUSH_TYPE_PERSISTENT against the platform's PMem and applied to the
connection object which safeguards access to PMem.

• rpma_peer_cfg_set_direct_write_to_pmem() - declare Direct Write to PMem support

• rpma_peer_cfg_get_descriptor() - get the descriptor of the peer configuration

• rpma_peer_cfg_from_descriptor() - create a peer configuration from the descriptor

• rpma_conn_apply_remote_peer_cfg() - apply remote peer cfg to the connection

For details on how to use these APIs please see https://github.com/pmem/rpma/tree/main/examples/05-flush-
to-persistent.

CLIENT OPERATION

       A  client  is  the active side of the process of establishing a connection. A role of the peer during the
       process of establishing connection does not determine direction of the  data  flow  (neither  via  Remote
       Memory  Access  nor  via  Messaging).  After  establishing  the  connection  both  peers  have  the  same
       capabilities.

       The client, in order to establish a connection, has to perform the following steps:

       •  rpma_conn_req_new() - create a new outgoing connection request object

       •  rpma_conn_req_connect() - initiate processing the connection request

       •  rpma_conn_next_event() - wait for the RPMA_CONN_ESTABLISHED event

       After establishing the connection both peers can perform Remote Memory Access and/or Messaging  over  the
       connection.

       The client, in order to close a connection, has to perform the following steps:

       •  rpma_conn_disconnect() - initiate disconnection

       •  rpma_conn_next_event() - wait for the RPMA_CONN_CLOSED event

       •  rpma_conn_delete() - delete the closed connection

SERVER OPERATION

       A  server  is  the passive side of the process of establishing a connection. Note that after establishing
       the connection both peers have the same capabilities.

       The server, in order to establish a connection, has to perform the following steps:

       •  rpma_ep_listen() - create a listening endpoint

       •  rpma_ep_next_conn_req() - obtain an incoming connection request

       •  rpma_conn_req_connect() - initiate connecting the connection request

       •  rpma_conn_next_event() - wait for the RPMA_CONN_ESTABLISHED event

       After establishing the connection both peers can perform Remote Memory Access and/or Messaging  over  the
       connection.

       The server, in order to close a connection, has to perform the following steps:

       •  rpma_conn_next_event() - wait for the RPMA_CONN_CLOSED event

       •  rpma_conn_disconnect() - disconnect the connection

       •  rpma_conn_delete() - delete the closed connection

       When no more incoming connections are expected, the server can stop waiting for them:

       •  rpma_ep_shutdown() - stop listening and delete the endpoint

MEMORY MANAGEMENT

       Every  piece of memory (either volatile or persistent) must be registered and its usage must be specified
       in order to be used in Remote Memory Access or Messaging. This can be done  using  the  following  memory
       management librpma functions:

       •  rpma_mr_reg() which registers a memory region and creates a local memory registration object and

       •  rpma_mr_dereg() which deregisters the memory region and deletes the local memory registration object.

       A  description  of  the registered memory region sometimes has to be transferred via network to the other
       side of the connection. In order to do that a network-transferable description  of  the  provided  memory
       region  (called  'descriptor') has to be created using rpma_mr_get_descriptor(). On the other side of the
       connection the received descriptor should be decoded using rpma_mr_remote_from_descriptor(). It creates a
       remote memory region's structure that allows for Remote Memory Access.

MESSAGING

       The librpma messaging API allows transferring messages (buffers of arbitrary  data)  between  the  peers.
       Transferring  messages requires preparing buffers (memory regions) on the remote side to receive the sent
       data. The received data are written to those dedicated buffers and the sender does not  have  to  have  a
       respective  remote memory region object to send a message.  The memory buffers used for messaging have to
       be registered using rpma_mr_reg() prior to rpma_send() or rpma_recv() function call.

       The librpma library implements the following messaging API:

       •  rpma_send() - initiates the send operation which transfers a message from the local  memory  to  other
          side of the connection,

       •  rpma_recv()  -  initiates  the receive operation which prepares a buffer for a message sent from other
          side of the connection,

       •  rpma_conn_req_recv() works as rpma_recv(), but it may be used before the connection is established.

       All of these operations are considered as finished when the respective completion is generated.

COMPLETIONS

       RDMA operations generate complitions that notify a user that the respective operation has been completed.

       The following operations are available in librpma:

       •  IBV_WC_RDMA_READ - RMA read operation

       •  IBV_WC_RDMA_WRITE - RMA write operation

       •  IBV_WC_SEND - messaging send operation

       •  IBV_WC_RECV - messaging receive operation

       •  IBV_WC_RECV_RDMA_WITH_IMM - messaging receive operation for RMA write operation with immediate data

       All operations generate completion on error. The operations posted with the RPMA_F_COMPLETION_ALWAYS flag
       also generate a completion on success.  Completion codes are reused from the  libibverbs  library,  where
       the  IBV_WC_SUCCESS status indicates the successful completion of an operation. Completions are collected
       in the completion queue (CQ) (see the QUEUES, PERFORMANCE AND RESOURCE USE section for  more  details  on
       queues).

       The librpma library implements the following API for handling completions:

       •  rpma_conn_get_cq() gets the connection's main CQ,

       •  rpma_conn_get_rcq() gets the connection's receive CQ,

       •  rpma_cq_wait()  waits  for  an  incoming completion from the specified CQ (main or receive CQ) - if it
          succeeds the completion can be collected using rpma_cq_get_wc(),

       •  rpma_cq_get_wc() receives the next available completion of an already posted operation.

PEER

       A peer is an abstraction representing an RDMA-capable device.  All other RPMA objects have to be  created
       in the context of a peer.  A peer allows one to:

       •  establish connections (Client Operation)

       •  register memory regions (Memory Management)

       •  create endpoints for listening for incoming connections (Server Operation)

       At  the  beginning,  in  order to create a peer, a user has to obtain an RDMA device context by the given
       IPv4/IPv6 address using rpma_utils_get_ibv_context(). Then  a  new  peer  object  can  be  created  using
       rpma_peer_new() and deleted using rpma_peer_delete().

SYNCHRONOUS AND ASYNCHRONOUS MODES

       By default, all endpoints and connections operate in the synchronous mode where:

       •  rpma_ep_next_conn_req(),

       •  rpma_cq_wait() and

       •  rpma_conn_get_next_event()

       are  blocking  calls.  You  can  make  those  API  calls  non-blocking  by  modifying the respective file
       descriptors:

       •  rpma_ep_get_fd() - provides a file descriptor for rpma_ep_next_conn_req()

       •  rpma_cq_get_fd() - provides a file descriptor for rpma_cq_wait()

       •  rpma_conn_get_event_fd() - provides a file descriptor for rpma_conn_get_next_event()

       When you have a file descriptor, you can make it non-blocking using fcntl(2) as follows:

               int ret = fcntl(fd, F_GETFL);
               fcntl(fd, F_SETFL, flags | O_NONBLOCK);

       Such change makes the respective API call non-blocking automatically.

       The provided file descriptors can also be used for scalable I/O handling like epoll(7).

       Please   see   the   example    showing    how    to    make    use    of    RPMA    file    descriptors:
       https://github.com/pmem/rpma/tree/main/examples/06-multiple-connections

QUEUES, PERFORMANCE AND RESOURCE USE

       Remote  Memory  Access  operations,  Messaging  operations  and their Completions consume space in queues
       allocated in an RDMA-capable network interface (RNIC) hardware for each of the connections. You  must  be
       aware of the existence of these queues:

       •  completion  queue  (CQ)  where  completions  of  operations  are  placed, either when a completion was
          required by a user (RPMA_F_COMPLETION_ALWAYS) or a completion  with  an  error  occurred.  All  Remote
          Memory Access operations and Messaging operations can consume CQ space.

       •  send queue (SQ) where all Remote Memory Access operations and rpma_send() operations are placed before
          they are executed by RNIC.

       •  receive  queue  (RQ)  where rpma_recv() entries are placed before they are consumed by the rpma_send()
          coming from another side of the connection.

       You must assume SQ and RQ entries occupy the place in their respective queue till:

       •  a respective operation's completion is generated or

       •  a completion of an operation, which was scheduled later, is generated.

       You must also be aware that RNIC has limited resources so it is impossible to store a very  long  set  of
       queues for many possibly existing connections. If all of the queues will not fit into RNIC's resources it
       will  start  using the platform's memory for this purpose. In this case, the performance will be degraded
       because of inevitable cache misses.

       Because the length of queues has so profound impact on  the  performance  of  RPMA  application  you  can
       configure the length of each of the queues separately for each of the connections:

       •  rpma_conn_cfg_set_cq_size() - set length of CQ

       •  rpma_conn_cfg_set_sq_size() - set length of SQ

       •  rpma_conn_cfg_set_rq_size() - set length of RQ

       When  the  connection  configuration  object is ready it has to be used for either rpma_conn_req_new() or
       rpma_ep_next_conn_req() for the settings to take effect.

THREAD SAFETY

       The analysis of thread safety of the librpma library is described  in  details  in  the  THREAD_SAFETY.md
       file:

               https://github.com/pmem/rpma/blob/main/THREAD_SAFETY.md

ON-DEMAND PAGING SUPPORT

       On-Demand-Paging  (ODP)  is  a  technique  that  simplifies the memory registration process (for example,
       applications no longer need to pin down the underlying physical pages of the address space and track  the
       validity  of the mappings). On-Demand Paging is available if both the hardware and the kernel support it.
       The detailed description of ODP can be found here:

            https://community.mellanox.com/s/article/understanding-on-demand-paging--odp-x

       State of ODP support can be  checked  using  the  rpma_utils_ibv_context_is_odp_capable()  function  that
       queries the RDMA device context's capabilities and checks if it supports On-Demand Paging.

       The  librpma  library uses ODP automatically if it is supported. ODP support is required to register PMem
       memory region mapped from File System DAX (FSDAX).

DEBUGGING AND ERROR HANDLING

       If a librpma function may fail, it returns a negative error code. Checking if the returned value is  non-
       negative  is  the  only  programmatically  available  way to verify if the API call succeeded.  The exact
       meaning of all error codes is described in the manual of each function.

       The librpma library implements the logging API which may give additional information in case of an  error
       and during normal operation as well, according to the current logging threshold levels.

       The  function  that  will handle all generated log messages can be set using rpma_log_set_function(). The
       logging function can be either the default logging function (built into the library) or  a  user-defined,
       thread-safe,  function.  The  default logging function can write messages to syslog(3) and stderr(3). The
       logging threshold level can be set or  got  using  rpma_log_set_threshold()  or  rpma_log_get_threshold()
       respectively.

       There      is      an      example      of      the      usage      of     the     logging     functions:
       https://github.com/pmem/rpma/tree/main/examples/log

EXAMPLES

       See https://github.com/pmem/rpma/tree/main/examples for examples of using the librpma API.

ACKNOWLEDGEMENTS

       librpma is built on the top of libibverbs and librdmacm APIs.

DEPRECATING

       Using of the API calls which are marked as deprecated should be avoided, because they will be removed  in
       a new major release.

       NOTE: API calls deprecated in 0.X release will be removed in 0.(X+1) release usually.