plucky (7) librpma.7.gz

Provided by: librpma-dev_1.3.0-2build2_amd64 bug

NAME

       librpma - remote persistent memory access library

SYNOPSIS

             #include <librpma.h>
             cc ... -lrpma

DESCRIPTION

       librpma  is a C library to simplify accessing persistent memory (PMem) on remote hosts over Remote Direct
       Memory Access (RDMA).

       The librpma library provides two possible schemes of operation: Remote Memory Access and Messaging.  Both
       of them are available over a connection established between two peers. Both of these schemes can make use
       of PMem as well as DRAM for the  sake  of  building  efficient  and  scalable  Remote  Persistent  Memory
       Accessing (RPMA) applications.

REMOTE MEMORY ACCESS

       The librpma library implements four basic API calls dedicated for accessing a remote memory:

       •  rpma_read() - initiates transferring data from the remote memory to the local memory,

       •  rpma_write() - initiates transferring data from the local memory to the remote memory),

       •  rpma_atomic_write()   -  works  like  rpma_write(),  but  it  allows  transferring  8  bytes  of  data
          (RPMA_ATOMIC_WRITE_ALIGNMENT)   and   storing   them   atomically   in   the   remote   memory    (see
          rpma_atomic_write(3) for details and restrictions), and:

       •  rpma_flush()  -  initiates  finalizing  a  transfer  of  data  to the remote memory. Possible types of
          rpma_flush() operation:

          •  RPMA_FLUSH_TYPE_PERSISTENT - flush data down to the persistent domain,

          •  RPMA_FLUSH_TYPE_VISIBILITY - flush data deep enough to make it visible on the remote node.

       All the above functions use the attribute flags to set the completion notification indicator:

       •  RPMA_F_COMPLETION_ON_ERROR - generates the completion only on error

       •  RPMA_F_COMPLETION_ALWAYS - generates the completion regardless of a result of the operation.

       All of these operations are considered as finished when the respective completion is generated.

DIRECT WRITE TO PMEM

       Direct Write to PMem is a feature of a platform  and  its  configuration  which  allows  an  RDMA-capable
       network  interface  to write data to platform's PMem in a persistent way. It may be impossible because of
       e.g. caching mechanisms existing on the data's way. When Direct Write to PMem is impossible, operating in
       the  way  assuming  it  is  possible may corrupt data on PMem, so this is why Direct Write to PMem is not
       enabled by default.

       On the current Intel platforms, the only thing you have to do in order to enable Direct Write to PMem  is
       turning  off Intel Direct Data I/O (DDIO). Sometimes, you can turn off DDIO either globally for the whole
       platform or for a specific PCIe Root Port.  For details, please see the manual of your platform.

       When you have a platform which allows Direct Write to PMem, you have to declare this is the case in  your
       peer's  configuration.  The  peer's  configuration  has  to be transferred to all the peers which want to
       execute rpma_flush() with RPMA_FLUSH_TYPE_PERSISTENT against the  platform's  PMem  and  applied  to  the
       connection object which safeguards access to PMem.

       •  rpma_peer_cfg_set_direct_write_to_pmem() - declare Direct Write to PMem support

       •  rpma_peer_cfg_get_descriptor() - get the descriptor of the peer configuration

       •  rpma_peer_cfg_from_descriptor() - create a peer configuration from the descriptor

       •  rpma_conn_apply_remote_peer_cfg() - apply remote peer cfg to the connection

       For details on how to use these APIs please see https://github.com/pmem/rpma/tree/main/examples/05-flush-
       to-persistent.

CLIENT OPERATION

       A client is the active side of the process of establishing a connection. A role of the  peer  during  the
       process  of  establishing  connection  does  not determine direction of the data flow (neither via Remote
       Memory  Access  nor  via  Messaging).  After  establishing  the  connection  both  peers  have  the  same
       capabilities.

       The client, in order to establish a connection, has to perform the following steps:

       •  rpma_conn_req_new() - create a new outgoing connection request object

       •  rpma_conn_req_connect() - initiate processing the connection request

       •  rpma_conn_next_event() - wait for the RPMA_CONN_ESTABLISHED event

       After  establishing  the connection both peers can perform Remote Memory Access and/or Messaging over the
       connection.

       The client, in order to close a connection, has to perform the following steps:

       •  rpma_conn_disconnect() - initiate disconnection

       •  rpma_conn_next_event() - wait for the RPMA_CONN_CLOSED event

       •  rpma_conn_delete() - delete the closed connection

SERVER OPERATION

       A server is the passive side of the process of establishing a connection. Note  that  after  establishing
       the connection both peers have the same capabilities.

       The server, in order to establish a connection, has to perform the following steps:

       •  rpma_ep_listen() - create a listening endpoint

       •  rpma_ep_next_conn_req() - obtain an incoming connection request

       •  rpma_conn_req_connect() - initiate connecting the connection request

       •  rpma_conn_next_event() - wait for the RPMA_CONN_ESTABLISHED event

       After  establishing  the connection both peers can perform Remote Memory Access and/or Messaging over the
       connection.

       The server, in order to close a connection, has to perform the following steps:

       •  rpma_conn_next_event() - wait for the RPMA_CONN_CLOSED event

       •  rpma_conn_disconnect() - disconnect the connection

       •  rpma_conn_delete() - delete the closed connection

       When no more incoming connections are expected, the server can stop waiting for them:

       •  rpma_ep_shutdown() - stop listening and delete the endpoint

MEMORY MANAGEMENT

       Every piece of memory (either volatile or persistent) must be registered and its usage must be  specified
       in  order  to  be  used in Remote Memory Access or Messaging. This can be done using the following memory
       management librpma functions:

       •  rpma_mr_reg() which registers a memory region and creates a local memory registration object and

       •  rpma_mr_dereg() which deregisters the memory region and deletes the local memory registration object.

       A description of the registered memory region sometimes has to be transferred via network  to  the  other
       side  of  the  connection.  In order to do that a network-transferable description of the provided memory
       region (called 'descriptor') has to be created using rpma_mr_get_descriptor(). On the other side  of  the
       connection the received descriptor should be decoded using rpma_mr_remote_from_descriptor(). It creates a
       remote memory region's structure that allows for Remote Memory Access.

MESSAGING

       The librpma messaging API allows transferring messages (buffers of arbitrary  data)  between  the  peers.
       Transferring  messages requires preparing buffers (memory regions) on the remote side to receive the sent
       data. The received data are written to those dedicated buffers and the sender does not  have  to  have  a
       respective  remote memory region object to send a message.  The memory buffers used for messaging have to
       be registered using rpma_mr_reg() prior to rpma_send() or rpma_recv() function call.

       The librpma library implements the following messaging API:

       •  rpma_send() - initiates the send operation which transfers a message from the local  memory  to  other
          side of the connection,

       •  rpma_recv()  -  initiates  the receive operation which prepares a buffer for a message sent from other
          side of the connection,

       •  rpma_conn_req_recv() works as rpma_recv(), but it may be used before the connection is established.

       All of these operations are considered as finished when the respective completion is generated.

COMPLETIONS

       RDMA operations generate complitions that notify a user that the respective operation has been completed.

       The following operations are available in librpma:

       •  IBV_WC_RDMA_READ - RMA read operation

       •  IBV_WC_RDMA_WRITE - RMA write operation

       •  IBV_WC_SEND - messaging send operation

       •  IBV_WC_RECV - messaging receive operation

       •  IBV_WC_RECV_RDMA_WITH_IMM - messaging receive operation for RMA write operation with immediate data

       All operations generate completion on error. The operations posted with the RPMA_F_COMPLETION_ALWAYS flag
       also  generate  a  completion on success.  Completion codes are reused from the libibverbs library, where
       the IBV_WC_SUCCESS status indicates the successful completion of an operation. Completions are  collected
       in  the  completion  queue (CQ) (see the QUEUES, PERFORMANCE AND RESOURCE USE section for more details on
       queues).

       The librpma library implements the following API for handling completions:

       •  rpma_conn_get_cq() gets the connection's main CQ,

       •  rpma_conn_get_rcq() gets the connection's receive CQ,

       •  rpma_cq_wait() waits for an incoming completion from the specified CQ (main or receive  CQ)  -  if  it
          succeeds the completion can be collected using rpma_cq_get_wc(),

       •  rpma_cq_get_wc() receives the next available completion of an already posted operation.

PEER

       A  peer is an abstraction representing an RDMA-capable device.  All other RPMA objects have to be created
       in the context of a peer.  A peer allows one to:

       •  establish connections (Client Operation)

       •  register memory regions (Memory Management)

       •  create endpoints for listening for incoming connections (Server Operation)

       At the beginning, in order to create a peer, a user has to obtain an RDMA device  context  by  the  given
       IPv4/IPv6  address  using  rpma_utils_get_ibv_context().  Then  a  new  peer  object can be created using
       rpma_peer_new() and deleted using rpma_peer_delete().

SYNCHRONOUS AND ASYNCHRONOUS MODES

       By default, all endpoints and connections operate in the synchronous mode where:

       •  rpma_ep_next_conn_req(),

       •  rpma_cq_wait() and

       •  rpma_conn_get_next_event()

       are blocking calls. You  can  make  those  API  calls  non-blocking  by  modifying  the  respective  file
       descriptors:

       •  rpma_ep_get_fd() - provides a file descriptor for rpma_ep_next_conn_req()

       •  rpma_cq_get_fd() - provides a file descriptor for rpma_cq_wait()

       •  rpma_conn_get_event_fd() - provides a file descriptor for rpma_conn_get_next_event()

       When you have a file descriptor, you can make it non-blocking using fcntl(2) as follows:

               int ret = fcntl(fd, F_GETFL);
               fcntl(fd, F_SETFL, flags | O_NONBLOCK);

       Such change makes the respective API call non-blocking automatically.

       The provided file descriptors can also be used for scalable I/O handling like epoll(7).

       Please    see    the    example    showing    how    to    make    use    of   RPMA   file   descriptors:
       https://github.com/pmem/rpma/tree/main/examples/06-multiple-connections

QUEUES, PERFORMANCE AND RESOURCE USE

       Remote Memory Access operations, Messaging operations and  their  Completions  consume  space  in  queues
       allocated  in  an RDMA-capable network interface (RNIC) hardware for each of the connections. You must be
       aware of the existence of these queues:

       •  completion queue (CQ) where completions of  operations  are  placed,  either  when  a  completion  was
          required  by  a  user  (RPMA_F_COMPLETION_ALWAYS)  or  a completion with an error occurred. All Remote
          Memory Access operations and Messaging operations can consume CQ space.

       •  send queue (SQ) where all Remote Memory Access operations and rpma_send() operations are placed before
          they are executed by RNIC.

       •  receive  queue  (RQ)  where rpma_recv() entries are placed before they are consumed by the rpma_send()
          coming from another side of the connection.

       You must assume SQ and RQ entries occupy the place in their respective queue till:

       •  a respective operation's completion is generated or

       •  a completion of an operation, which was scheduled later, is generated.

       You must also be aware that RNIC has limited resources so it is impossible to store a very  long  set  of
       queues for many possibly existing connections. If all of the queues will not fit into RNIC's resources it
       will start using the platform's memory for this purpose. In this case, the performance will  be  degraded
       because of inevitable cache misses.

       Because  the  length  of  queues  has  so  profound impact on the performance of RPMA application you can
       configure the length of each of the queues separately for each of the connections:

       •  rpma_conn_cfg_set_cq_size() - set length of CQrpma_conn_cfg_set_sq_size() - set length of SQrpma_conn_cfg_set_rq_size() - set length of RQ

       When the connection configuration object is ready it has to be used  for  either  rpma_conn_req_new()  or
       rpma_ep_next_conn_req() for the settings to take effect.

THREAD SAFETY

       The  analysis  of  thread  safety  of the librpma library is described in details in the THREAD_SAFETY.md
       file:

               https://github.com/pmem/rpma/blob/main/THREAD_SAFETY.md

ON-DEMAND PAGING SUPPORT

       On-Demand-Paging (ODP) is a technique that simplifies  the  memory  registration  process  (for  example,
       applications  no longer need to pin down the underlying physical pages of the address space and track the
       validity of the mappings). On-Demand Paging is available if both the hardware and the kernel support  it.
       The detailed description of ODP can be found here:

            https://community.mellanox.com/s/article/understanding-on-demand-paging--odp-x

       State  of  ODP  support  can  be  checked using the rpma_utils_ibv_context_is_odp_capable() function that
       queries the RDMA device context's capabilities and checks if it supports On-Demand Paging.

       The librpma library uses ODP automatically if it is supported. ODP support is required to  register  PMem
       memory region mapped from File System DAX (FSDAX).

DEBUGGING AND ERROR HANDLING

       If  a librpma function may fail, it returns a negative error code. Checking if the returned value is non-
       negative is the only programmatically available way to verify if  the  API  call  succeeded.   The  exact
       meaning of all error codes is described in the manual of each function.

       The  librpma library implements the logging API which may give additional information in case of an error
       and during normal operation as well, according to the current logging threshold levels.

       The function that will handle all generated log messages can be set  using  rpma_log_set_function().  The
       logging  function  can be either the default logging function (built into the library) or a user-defined,
       thread-safe, function. The default logging function can write messages to syslog(3)  and  stderr(3).  The
       logging  threshold  level  can  be  set or got using rpma_log_set_threshold() or rpma_log_get_threshold()
       respectively.

       There     is     an     example      of      the      usage      of      the      logging      functions:
       https://github.com/pmem/rpma/tree/main/examples/log

EXAMPLES

       See https://github.com/pmem/rpma/tree/main/examples for examples of using the librpma API.

ACKNOWLEDGEMENTS

       librpma is built on the top of libibverbs and librdmacm APIs.

DEPRECATING

       Using  of the API calls which are marked as deprecated should be avoided, because they will be removed in
       a new major release.

       NOTE: API calls deprecated in 0.X release will be removed in 0.(X+1) release usually.

SEE ALSO

       https://pmem.io/rpma/