Ubuntu Manpage: librpma - remote persistent memory access library

NAME

       librpma - remote persistent memory access library

SYNOPSIS

             #include <librpma.h>
             cc ... -lrpma

DESCRIPTION

       librpma is a C library to simplify accessing persistent memory (PMem) on remote hosts over
       Remote Direct Memory Access (RDMA).

       The librpma library provides two possible schemes of operation: Remote Memory  Access  and
       Messaging.  Both  of  them  are available over a connection established between two peers.
       Both of these schemes can make use of PMem as well  as  DRAM  for  the  sake  of  building
       efficient and scalable Remote Persistent Memory Accessing (RPMA) applications.

REMOTE MEMORY ACCESS

       The  librpma  library  implements  four  basic  API calls dedicated for accessing a remote
       memory:

       •  rpma_read() - initiates transferring data from the remote memory to the local memory,

       •  rpma_write() - initiates transferring data from the local memory to the remote memory),

       •  rpma_atomic_write() - works like rpma_write(), but it allows transferring  8  bytes  of
          data  (RPMA_ATOMIC_WRITE_ALIGNMENT)  and  storing  them atomically in the remote memory
          (see rpma_atomic_write(3) for details and restrictions), and:

       •  rpma_flush() - initiates finalizing a transfer of data to the remote  memory.  Possible
          types of rpma_flush() operation:

          •  RPMA_FLUSH_TYPE_PERSISTENT - flush data down to the persistent domain,

          •  RPMA_FLUSH_TYPE_VISIBILITY - flush data deep enough to make it visible on the remote
             node.

       All the above functions use  the  attribute  flags  to  set  the  completion  notification
       indicator:

       •  RPMA_F_COMPLETION_ON_ERROR - generates the completion only on error

       •  RPMA_F_COMPLETION_ALWAYS  -  generates  the  completion  regardless  of a result of the
          operation.

       All of these operations are considered as  finished  when  the  respective  completion  is
       generated.

DIRECT WRITE TO PMEM

Direct Write to PMem is a feature of a platform and its configuration which allows an
RDMA-capable network interface to write data to platform's PMem in a persistent way. It
may be impossible because of e.g. caching mechanisms existing on the data's way. When
Direct Write to PMem is impossible, operating in the way assuming it is possible may
corrupt data on PMem, so this is why Direct Write to PMem is not enabled by default.

On the current Intel platforms, the only thing you have to do in order to enable Direct
Write to PMem is turning off Intel Direct Data I/O (DDIO). Sometimes, you can turn off
DDIO either globally for the whole platform or for a specific PCIe Root Port. For
details, please see the manual of your platform.

When you have a platform which allows Direct Write to PMem, you have to declare this is
the case in your peer's configuration. The peer's configuration has to be transferred to
all the peers which want to execute rpma_flush() with RPMA_FLUSH_TYPE_PERSISTENT against
the platform's PMem and applied to the connection object which safeguards access to PMem.

• rpma_peer_cfg_set_direct_write_to_pmem() - declare Direct Write to PMem support

• rpma_peer_cfg_get_descriptor() - get the descriptor of the peer configuration

• rpma_peer_cfg_from_descriptor() - create a peer configuration from the descriptor

• rpma_conn_apply_remote_peer_cfg() - apply remote peer cfg to the connection

For details on how to use these APIs please see
https://github.com/pmem/rpma/tree/master/examples/05-flush-to-persistent.

CLIENT OPERATION

       A client is the active side of the process of establishing a connection.  A  role  of  the
       peer  during  the  process  of establishing connection does not determine direction of the
       data flow (neither via Remote Memory Access nor via  Messaging).  After  establishing  the
       connection both peers have the same capabilities.

       The client, in order to establish a connection, has to perform the following steps:

       •  rpma_conn_req_new() - create a new outgoing connection request object

       •  rpma_conn_req_connect() - initiate processing the connection request

       •  rpma_conn_next_event() - wait for the RPMA_CONN_ESTABLISHED event

       After  establishing  the  connection  both  peers  can perform Remote Memory Access and/or
       Messaging over the connection.

       The client, in order to close a connection, has to perform the following steps:

       •  rpma_conn_disconnect() - initiate disconnection

       •  rpma_conn_next_event() - wait for the RPMA_CONN_CLOSED event

       •  rpma_conn_delete() - delete the closed connection

SERVER OPERATION

       A server is the passive side of the process of establishing a connection. Note that  after
       establishing the connection both peers have the same capabilities.

       The server, in order to establish a connection, has to perform the following steps:

       •  rpma_ep_listen() - create a listening endpoint

       •  rpma_ep_next_conn_req() - obtain an incoming connection request

       •  rpma_conn_req_connect() - initiate connecting the connection request

       •  rpma_conn_next_event() - wait for the RPMA_CONN_ESTABLISHED event

       After  establishing  the  connection  both  peers  can perform Remote Memory Access and/or
       Messaging over the connection.

       The server, in order to close a connection, has to perform the following steps:

       •  rpma_conn_next_event() - wait for the RPMA_CONN_CLOSED event

       •  rpma_conn_disconnect() - disconnect the connection

       •  rpma_conn_delete() - delete the closed connection

       When no more incoming connections are expected, the server can stop waiting for them:

       •  rpma_ep_shutdown() - stop listening and delete the endpoint

MEMORY MANAGEMENT

       Every piece of memory (either volatile or persistent) must be  registered  and  its  usage
       must  be  specified  in order to be used in Remote Memory Access or Messaging. This can be
       done using the following memory management librpma functions:

       •  rpma_mr_reg() which registers a memory region and creates a local  memory  registration
          object and

       •  rpma_mr_dereg()  which  deregisters  the  memory  region  and  deletes the local memory
          registration object.

       A description of the registered memory region sometimes has to be transferred via  network
       to  the  other  side  of  the  connection.  In  order  to  do  that a network-transferable
       description of the provided memory region (called 'descriptor') has to  be  created  using
       rpma_mr_get_descriptor().  On  the  other  side  of the connection the received descriptor
       should be decoded using  rpma_mr_remote_from_descriptor().  It  creates  a  remote  memory
       region's structure that allows for Remote Memory Access.

MESSAGING

       The librpma messaging API allows transferring messages (buffers of arbitrary data) between
       the peers. Transferring messages requires preparing buffers (memory regions) on the remote
       side  to  receive  the sent data. The received data are written to those dedicated buffers
       and the sender does not have to have a respective remote memory region object  to  send  a
       message.   The memory buffers used for messaging have to be registered using rpma_mr_reg()
       prior to rpma_send() or rpma_recv() function call.

       The librpma library implements the following messaging API:

       •  rpma_send() - initiates the send operation which transfers a  message  from  the  local
          memory to other side of the connection,

       •  rpma_recv()  -  initiates  the  receive operation which prepares a buffer for a message
          sent from other side of the connection,

       •  rpma_conn_req_recv() works as rpma_recv(), but it may be used before the connection  is
          established.

       All  of  these  operations  are  considered  as finished when the respective completion is
       generated.

COMPLETIONS

       RDMA operations generate complitions that notify a user that the respective operation  has
       been completed.

       The following operations are available in librpma:

       •  IBV_WC_RDMA_READ - RMA read operation

       •  IBV_WC_RDMA_WRITE - RMA write operation

       •  IBV_WC_SEND - messaging send operation

       •  IBV_WC_RECV - messaging receive operation

       •  IBV_WC_RECV_RDMA_WITH_IMM  -  messaging  receive operation for RMA write operation with
          immediate data

       All  operations  generate  completion  on  error.   The   operations   posted   with   the
       RPMA_F_COMPLETION_ALWAYS flag also generate a completion on success.  Completion codes are
       reused from  the  libibverbs  library,  where  the  IBV_WC_SUCCESS  status  indicates  the
       successful  completion  of an operation. Completions are collected in the completion queue
       (CQ) (see the QUEUES, PERFORMANCE AND RESOURCE USE section for more details on queues).

       The librpma library implements the following API for handling completions:

       •  rpma_conn_get_cq() gets the connection's main CQ,

       •  rpma_conn_get_rcq() gets the connection's receive CQ,

       •  rpma_cq_wait() waits for an incoming completion from the specified CQ (main or  receive
          CQ) - if it succeeds the completion can be collected using rpma_cq_get_wc(),

       •  rpma_cq_get_wc() receives the next available completion of an already posted operation.

PEER

       A peer is an abstraction representing an RDMA-capable device.  All other RPMA objects have
       to be created in the context of a peer.  A peer allows one to:

       •  establish connections (Client Operation)

       •  register memory regions (Memory Management)

       •  create endpoints for listening for incoming connections (Server Operation)

       At the beginning, in order to create a peer, a user has to obtain an RDMA  device  context
       by  the given IPv4/IPv6 address using rpma_utils_get_ibv_context(). Then a new peer object
       can be created using rpma_peer_new() and deleted using rpma_peer_delete().

SYNCHRONOUS AND ASYNCHRONOUS MODES

       By default, all endpoints and connections operate in the synchronous mode where:

       •  rpma_ep_next_conn_req(),

       •  rpma_cq_wait() and

       •  rpma_conn_get_next_event()

       are blocking calls. You can make those API calls non-blocking by modifying the  respective
       file descriptors:

       •  rpma_ep_get_fd() - provides a file descriptor for rpma_ep_next_conn_req()

       •  rpma_cq_get_fd() - provides a file descriptor for rpma_cq_wait()

       •  rpma_conn_get_event_fd() - provides a file descriptor for rpma_conn_get_next_event()

       When you have a file descriptor, you can make it non-blocking using fcntl(2) as follows:

               int ret = fcntl(fd, F_GETFL);
               fcntl(fd, F_SETFL, flags | O_NONBLOCK);

       Such change makes the respective API call non-blocking automatically.

       The provided file descriptors can also be used for scalable I/O handling like epoll(7).

       Please   see   the   example   showing   how   to  make  use  of  RPMA  file  descriptors:
       https://github.com/pmem/rpma/tree/master/examples/06-multiple-connections

QUEUES, PERFORMANCE AND RESOURCE USE

       Remote Memory Access operations, Messaging operations and their Completions consume  space
       in  queues  allocated in an RDMA-capable network interface (RNIC) hardware for each of the
       connections. You must be aware of the existence of these queues:

       •  completion queue (CQ) where  completions  of  operations  are  placed,  either  when  a
          completion  was  required  by a user (RPMA_F_COMPLETION_ALWAYS) or a completion with an
          error occurred. All Remote  Memory  Access  operations  and  Messaging  operations  can
          consume CQ space.

       •  send  queue  (SQ)  where all Remote Memory Access operations and rpma_send() operations
          are placed before they are executed by RNIC.

       •  receive queue (RQ) where rpma_recv() entries are placed before they are consumed by the
          rpma_send() coming from another side of the connection.

       You must assume SQ and RQ entries occupy the place in their respective queue till:

       •  a respective operation's completion is generated or

       •  a completion of an operation, which was scheduled later, is generated.

       You must also be aware that RNIC has limited resources so it is impossible to store a very
       long set of queues for many possibly existing connections. If all of the queues  will  not
       fit  into  RNIC's resources it will start using the platform's memory for this purpose. In
       this case, the performance will be degraded because of inevitable cache misses.

       Because the length of queues has so profound impact on the performance of RPMA application
       you can configure the length of each of the queues separately for each of the connections:

       •  rpma_conn_cfg_set_cq_size() - set length of CQ

       •  rpma_conn_cfg_set_sq_size() - set length of SQ

       •  rpma_conn_cfg_set_rq_size() - set length of RQ

       When  the  connection  configuration  object  is  ready  it  has  to  be  used  for either
       rpma_conn_req_new() or rpma_ep_next_conn_req() for the settings to take effect.

THREAD SAFETY

       The analysis of thread safety of the librpma  library  is  described  in  details  in  the
       THREAD_SAFETY.md file:

               https://github.com/pmem/rpma/blob/master/THREAD_SAFETY.md

ON-DEMAND PAGING SUPPORT

       On-Demand-Paging (ODP) is a technique that simplifies the memory registration process (for
       example, applications no longer need to pin down the  underlying  physical  pages  of  the
       address  space  and  track the validity of the mappings). On-Demand Paging is available if
       both the hardware and the kernel support it. The detailed description of ODP can be  found
       here:

            https://community.mellanox.com/s/article/understanding-on-demand-paging--odp-x

       State  of  ODP  support  can  be checked using the rpma_utils_ibv_context_is_odp_capable()
       function that queries the RDMA device context's capabilities and checks if it supports On-
       Demand Paging.

       The  librpma library uses ODP automatically if it is supported. ODP support is required to
       register PMem memory region mapped from File System DAX (FSDAX).

DEBUGGING AND ERROR HANDLING

       If a librpma function may fail, it returns a negative error code. Checking if the returned
       value is non-negative is the only programmatically available way to verify if the API call
       succeeded.  The exact meaning of all error codes  is  described  in  the  manual  of  each
       function.

       The  librpma  library  implements the logging API which may give additional information in
       case of an error and during normal operation as well, according  to  the  current  logging
       threshold levels.

       The   function   that   will   handle   all  generated  log  messages  can  be  set  using
       rpma_log_set_function(). The logging function can be either the default  logging  function
       (built  into  the  library)  or a user-defined, thread-safe, function. The default logging
       function can write messages to syslog(3) and stderr(3). The logging threshold level can be
       set or got using rpma_log_set_threshold() or rpma_log_get_threshold() respectively.

       There     is     an    example    of    the    usage    of    the    logging    functions:
       https://github.com/pmem/rpma/tree/master/examples/log

EXAMPLES

       See https://github.com/pmem/rpma/tree/master/examples for examples of  using  the  librpma
       API.

ACKNOWLEDGEMENTS

       librpma is built on the top of libibverbs and librdmacm APIs.

DEPRECATING

       Using of the API calls which are marked as deprecated should be avoided, because they will
       be removed in a new major release.

       NOTE: API calls deprecated in 0.X release will be removed in 0.(X+1) release usually.