Provided by: rds-tools_1.4.1-OFED-1.4.2-1_amd64 bug


       RDS-rdma - Zerocopy Interface for RDMA over RDS


       This  manual page describes the zerocopy interface of RDS, which was added in RDSv3. For a
       description of the basic RDS interface, please refer to rds(7).

       The principal mode of operation for RDS  zerocopy  is  like  this:  one  participant  (the
       client) wishes to initiate a direct transfer to or from some area of memory in its process
       address space.  This memory does not have to be aligned.

       The client obtains a handle for this  region  of  memory,  and  passes  it  to  the  other
       participant  (the  server). This is called the RDMA cookie. To the application, the cookie
       is an opaque 64bit data type.

       The client sends this handle to the server application, along with other  details  of  the
       RDMA  request  (such  as  which  data  to  transfer  to that memory area).  Throughout the
       following discussion, we will refer to this message as the RDMA request.

       The server uses this RDMA cookie  to  initiate  the  requested  RDMA  transfer.  The  RDMA
       transfer  is  combined  atomically  with  a  normal RDS message, which is delivered to the
       client. This message is called the RDMA ACK throughout  the  following.   Atomic  in  this
       context means that either both the RDMA succeeds and the RDMA ACK is delivered, or neither

       Thus, when the client receives the  RDMA  ACK,  it  knows  that  the  RDMA  has  completed
       successfully. It can then release the RDMA cookie for this memory region, if it wishes to.

       RDMA  operations  are not reliable, in the sense that unlike normal RDS messages, RDS RDMA
       operations may fail, and get dropped.


       The interface is currently based on control messages (ancillary data) sent or received via
       the  sendmsg(2)  and  recvmsg(2)  system calls. Optionally, an older interface can be used
       that is based on the setsockopt(2)  system  call.  However,  we  recommend  using  control
       messages, as this reduces the number of system calls required.

   Control message interface
       With  the  control message interface, the RDMA cookie is passed to the server out-of-band,
       included in an extension header attached to the RDS message.

       The following outlines the mode of operation; the data types used  will  be  specified  in
       details in a subsequent section.

       Initially,  the  client  will  send  RDMA  requests along with a RDS_CMSG_RDMA_MAP control
       message. The control message contains the address and length  of  the  memory  region  for
       which  to obtain a handle, some flags, and a pointer to a memory location (in the caller's
       address space) where the kernel will store the RDMA cookie.

       Alternatively, if the application has already obtained a RDMA cookie for the memory  range
       it   wants   to   RDMA  to/from,  it  can  hand  this  cookie  to  the  kernel  using  the
       RDS_CMSG_RDMA_DEST control message.

       Either way, the kernel will include the resulting RDMA cookie in an extension header  that
       is transmitted as part of the RDMA request to the server.

       When  the  server  receives  the  RDMA request, the kernel will deliver the cookie wrapped
       inside a RDS_CMSG_RDMA_DEST control message.

       The server then initiates the data transfer by sending the RDMA ACK message along  with  a
       RDS_CMSG_RDMA_ARGS  control  message. This message contains the RDMA cookie, and the local
       memory to copy to or from.

       The  server  process  may  request  a  notification  when  an  RDMA  operation  completes.
       Notifications   are   delivered  as  a  RDS_CMSG_RDMA_STATUS  control  messages.  When  an
       application calls recvmsg(2), it will either receive a regular RDS message (possibly  with
       other  RDMA related control messages), or an empty message with one or more status control

       In addition, applications When an RDMA operation fails for some reason and  is  discarded,
       the  application  can ask to receive notifications for failed messages as well, regardless
       of whether it asked for success  notification  of  an  individual  message  or  not.  This
       behavior is turned on by setting the RDS_RECVERR socket option.

   Setsockopt interface
       In addition to the control message interface, RDS allows a process to register and release
       memory ranges for RDMA through calls to setsockopt(2).

              To obtain a RDMA  cookie  for  a  given  memory  range,  the  application  can  use
              setsockopt  with  RDS_GET_MR.   This  operates  essentially  the  same  way  as the
              RDS_CMSG_RDMA_MAP control message: the argument contains the address and length  of
              the  memory  range  to  be  registered, and a pointer to a RDMA cookie variable, in
              which the system call will store the cookie for the registered range.

              Memory ranges can be released by calling setsockopt with  RDS_FREE_MR,  giving  the
              RDMA cookie and additional flags as arguments.

              This  is  a  boolean option which can be set as well as queried (using getsockopt).
              When enabled, RDS will send RDMA notification messages to the application  for  any
              RDMA operation that fails. This option defaults to off.

       For all of these calls, the level argument to setsockopt is SOL_RDS.


       RDMA cookie
              typedef u_int64_t       rds_rdma_cookie_t

              This  encapsulates  a  memory  location  in  the  client  process.  In  the current
              implementation, it contains the R_Key of the remote memory region, and  the  offset
              into it (so that the application does not have to worry about alignment.

              The   RDMA   cookie   is  used  in  several  struct  types  described  below.   The
              RDS_CMSG_RDMA_DEST control message contains a rds_rdma_cookie_t all  by  itself  as

       Mapping arguments
              The  following  data  type is used with RDS_CMSG_RDMA_MAP control messages and with
              the RDS_GET_MR socket option:

              struct rds_iovec {
                      u_int64_t       addr;
                      u_int64_t       bytes;

              struct rds_get_mr_args {
                      struct rds_iovec vec;
                      u_int64_t       cookie_addr;
                      uint64_t        flags;

              The cookie_addr specifies a memory location where to store the RDMA cookie.

              The flags value is a bitwise OR of any of the following flags:

                     This tells the kernel that the allocated RDMA cookie is to be  used  exactly
                     once.  When  the  RDMA  ACK  message  arrives, the kernel will automatically
                     unbind the memory area and release any resources associated with the cookie.

                     If this flag is not set, it is the application's responsibility  to  release
                     the memory region at a later time using the RDS_FREE_MR socket option.

                     Normally, RDMA memory mappings are invalidated lazily, as this requires some
                     relatively costly synchronization with the HCA. However, this means that the
                     server  application  can  continue  to access the registered memory for some
                     indeterminate amount of time.  If this  flag  is  set,  the  RDS  code  will
                     invalidate  the  mapping  at the time it is released (either upon arrival of
                     the RDMA ACK, if USE_ONCE was specified; or when the application destroys it
                     using FREE_MR).

       RDMA Operation
              RDMA  operations  are  initiated by the server using the RDS_CMSG_RDMA_ARGS control
              message, which takes the following data as payload:

              struct rds_rdma_args {
                      rds_rdma_cookie_t cookie;
                      struct rds_iovec remote_vec;
                      u_int64_t       local_vec_addr;
                      u_int64_t       nr_local;
                      u_int64_t       flags;
                      u_int32_t       user_token;

              The cookie argument contains the RDMA cookie received from the client.   The  local
              memory  is  given  via  an  array  of  rds_iovecs.   The  array address is given in
              local_vec_addr, and its number of elements is given in nr_local.

              The struct member remote_vec specifies a  location  relative  to  the  memory  area
              identified  by  the  cookie:  remote_vec.addr  is  an  offset into that region, and
              remote_vec.bytes is the length of the memory window to copy to/from.   This  length
              must  match the size of the local memory area, i.e. the sum of bytes in all members
              of the local iovec.

              The flags field contains the bitwise OR of any of the following flags:

                     If set, any RDMA  WRITE  is  initiated  from  the  server's  memory  to  the
                     client's.  If  not  set, RDS will do a RDMA READ from the client's memory to
                     the server's memory.

                     By default, Infiniband makes no guarantee about the ordering of an RDMA READ
                     with  respect to subsequent SEND operations. Setting this flag asks that the
                     RDMA READ should be fenced off the subsequent RDS ACK message. Setting  this
                     flag  requires  an  additional round-trip of the IB fabric, but it is a good
                     idea to use set this flag by default, unless you are really sure you do  not
                     want it.

                     This  flag  requests  a  notification  upon completion of the RDMA operation
                     (successful or otherwise). The noticiation will contain  the  value  of  the
                     user_token  field  passed in by the application. This allows the application
                     to release resources (such as buffers) assosicated with the RDMA transfer.

              The user_token can be used to  pass  an  application  specific  identifier  to  the
              kernel.  This  token  is  returned to the application when a status notification is
              generated (see the following section).

       RDMA Notification
              The RDS kernel code is able to notify the server application when an RDMA operation
              completes.  These  notifications  are  delivered  via  RDS_CMSG_RDMA_STATUS control

              By default, no notifications are generated. There are two ways an  application  can
              request  them.  On one hand, status notifications can be enabled on a per-operation
              basis by setting the RDS_RDMA_NOTIFY_ME flag in the RDMA arguments.  On  the  other
              hand,  the  application can request notifications for all RDMA operations that fail
              by setting the RDS_RECVERR socket option (see below).  In both cases, the format of
              the  notification  is  the  same;  and  at  most  one notification will be sent per
              completed operation.

              The message format is this:

              struct rds_rdma_notify {
                      u_int32_t       user_token;
                      int32_t         status;

              The user_token field contains the value previously  given  to  the  kernel  in  the
              RDS_CMSG_RDMA_ARGS  control message. The status field contains a status value, with
              0 indicating success, and non-zero indicating an error.

              The following status codes are currently defined:

                     The RDMA operation succeeded.

                     The RDMA operation failed due to a remote access error. This is usually  due
                     to an invalid R_key, offset or transfer size.

                     The RDMA operation was canceled by the application.  (This error code is not
                     yet generated).

                     RDMA operations were discarded  after  the  connection  broke  and  was  re-
                     established. The RDMA operation may have been processed partially.

                     Any other failure.

       RDMA setsockopt arguments
              When using the RDS_GET_MR socket option to register a memory range, the application
              passes a pointer to a struct rds_get_mr_args variable, described above.

              The RDS_FREE_MR call takes an argument of type struct rds_free_mr_args:

              struct rds_free_mr_args {
                      rds_rdma_cookie_t cookie;
                      u_int64_t       flags;

              cookie specifies the RDMA cookie to be released. RDMA access to  the  memory  range
              will  usually  not  be  invoked  instantly, because the operation is rather costly.
              However, if the flags argument contains RDS_RDMA_INVALIDATE,  RDS  will  invalidate
              the indicated mapping immediately, as described in section Mapping arguments above.

              If  the  cookie  argument is 0, and RDS_RDMA_INVALIDATE is set, RDS will invalidate
              old memory mappings on all devices.


       In addition to the usual error codes returned by  sendmsg,  recvmsg  and  setsockopt,  RDS
       returns the following error codes:

       EAGAIN RDS  was  unable  to map a memory range because the limit was exceeded (returned by
              RDS_CMSG_RDMA_MAP and RDS_GET_MR).

       EINVAL When sending a message, there were were  conflicting  control  messages  (e.g.  two
              RDMA_MAP messages, or a RDMA_MAP  and a  RDMA_DEST message).

              In  a  RDS_CMSG_RDMA_MAP  or RDS_GET_MR operation, the application specified memory
              range greater than the maximum size supported.

              When setting up an RDMA operation with RDS_CMSG_RDMA_ARGS, the size  of  the  local
              memory (given in the rds_iovec) did not match the size of the remote memory range.

       EBUSY  RDS was unable to obtain a DMA mapping for the indicated memory.


       Currently, the following limits apply

       ·      The  maximum  size  of  a  zerocopy  transfer  is 1MB. This can be adjusted via the
              fmr_message_size module parameter.

       ·      The maximum number of memory ranges that can be mapped is limited to  2048  at  the
              moment.  This  can be adjusted via the fmr_pool_size module parameter. However, the
              actual limit imposed by the hardware may in fact be lower.


       RDS was written and is Copyright (C) 2007-2008 by Oracle, Inc.

                                                                                  RDS zerocopy(7)