trusty (7) rds-rdma.7.gz

Provided by: rds-tools_1.4.1-OFED-1.4.2-1ubuntu1_amd64 bug

NAME

       RDS-rdma - Zerocopy Interface for RDMA over RDS

DESCRIPTION

       This  manual page describes the zerocopy interface of RDS, which was added in RDSv3. For a description of
       the basic RDS interface, please refer to rds(7).

       The principal mode of operation for RDS zerocopy is like this: one participant  (the  client)  wishes  to
       initiate a direct transfer to or from some area of memory in its process address space.  This memory does
       not have to be aligned.

       The client obtains a handle for this region of memory, and  passes  it  to  the  other  participant  (the
       server). This is called the RDMA cookie. To the application, the cookie is an opaque 64bit data type.

       The  client  sends  this  handle  to the server application, along with other details of the RDMA request
       (such as which data to transfer to that memory area).  Throughout the following discussion, we will refer
       to this message as the RDMA request.

       The  server  uses this RDMA cookie to initiate the requested RDMA transfer. The RDMA transfer is combined
       atomically with a normal RDS message, which is delivered to the client. This message is called  the  RDMA
       ACK  throughout  the  following.  Atomic in this context means that either both the RDMA succeeds and the
       RDMA ACK is delivered, or neither succeeds.

       Thus, when the client receives the RDMA ACK, it knows that the RDMA has completed  successfully.  It  can
       then release the RDMA cookie for this memory region, if it wishes to.

       RDMA  operations  are not reliable, in the sense that unlike normal RDS messages, RDS RDMA operations may
       fail, and get dropped.

INTERFACE

       The interface is currently based on control messages (ancillary data) sent or received via the sendmsg(2)
       and  recvmsg(2)  system  calls.  Optionally,  an  older  interface  can  be  used  that  is  based on the
       setsockopt(2) system call. However, we recommend using control messages, as this reduces  the  number  of
       system calls required.

   Control message interface
       With  the  control message interface, the RDMA cookie is passed to the server out-of-band, included in an
       extension header attached to the RDS message.

       The following outlines the mode of operation; the data types used will  be  specified  in  details  in  a
       subsequent section.

       Initially, the client will send RDMA requests along with a RDS_CMSG_RDMA_MAP control message. The control
       message contains the address and length of the memory region for which to obtain a  handle,  some  flags,
       and  a  pointer to a memory location (in the caller's address space) where the kernel will store the RDMA
       cookie.

       Alternatively, if the application has already obtained a RDMA cookie for the memory  range  it  wants  to
       RDMA to/from, it can hand this cookie to the kernel using the RDS_CMSG_RDMA_DEST control message.

       Either  way, the kernel will include the resulting RDMA cookie in an extension header that is transmitted
       as part of the RDMA request to the server.

       When the server receives the  RDMA  request,  the  kernel  will  deliver  the  cookie  wrapped  inside  a
       RDS_CMSG_RDMA_DEST control message.

       The   server   then  initiates  the  data  transfer  by  sending  the  RDMA  ACK  message  along  with  a
       RDS_CMSG_RDMA_ARGS control message. This message contains the RDMA cookie, and the local memory  to  copy
       to or from.

       The  server  process  may  request  a  notification  when  an RDMA operation completes. Notifications are
       delivered as a RDS_CMSG_RDMA_STATUS control messages. When  an  application  calls  recvmsg(2),  it  will
       either  receive  a  regular  RDS message (possibly with other RDMA related control messages), or an empty
       message with one or more status control messages.

       In addition, applications When an RDMA operation fails for some reason and is discarded, the  application
       can  ask to receive notifications for failed messages as well, regardless of whether it asked for success
       notification of an individual message or not. This behavior is  turned  on  by  setting  the  RDS_RECVERR
       socket option.

   Setsockopt interface
       In  addition to the control message interface, RDS allows a process to register and release memory ranges
       for RDMA through calls to setsockopt(2).

       RDS_GET_MR
              To obtain a RDMA cookie for a  given  memory  range,  the  application  can  use  setsockopt  with
              RDS_GET_MR.   This operates essentially the same way as the RDS_CMSG_RDMA_MAP control message: the
              argument contains the address and length of the memory range to be registered, and a pointer to  a
              RDMA cookie variable, in which the system call will store the cookie for the registered range.

       RDS_FREE_MR
              Memory  ranges  can be released by calling setsockopt with RDS_FREE_MR, giving the RDMA cookie and
              additional flags as arguments.

       RDS_RECVERR
              This is a boolean option which can be set as well as queried (using  getsockopt).   When  enabled,
              RDS  will  send  RDMA  notification messages to the application for any RDMA operation that fails.
              This option defaults to off.

       For all of these calls, the level argument to setsockopt is SOL_RDS.

RDMA MACROS AND TYPES

       RDMA cookie
              typedef u_int64_t       rds_rdma_cookie_t

              This encapsulates a memory location in the client  process.  In  the  current  implementation,  it
              contains  the  R_Key  of the remote memory region, and the offset into it (so that the application
              does not have to worry about alignment.

              The RDMA cookie is used in several struct types described below.  The  RDS_CMSG_RDMA_DEST  control
              message contains a rds_rdma_cookie_t all by itself as payload.

       Mapping arguments
              The  following  data  type is used with RDS_CMSG_RDMA_MAP control messages and with the RDS_GET_MR
              socket option:

              struct rds_iovec {
                      u_int64_t       addr;
                      u_int64_t       bytes;
              };

              struct rds_get_mr_args {
                      struct rds_iovec vec;
                      u_int64_t       cookie_addr;
                      uint64_t        flags;
              };

              The cookie_addr specifies a memory location where to store the RDMA cookie.

              The flags value is a bitwise OR of any of the following flags:

              RDS_RDMA_USE_ONCE
                     This tells the kernel that the allocated RDMA cookie is to be used exactly once.  When  the
                     RDMA  ACK message arrives, the kernel will automatically unbind the memory area and release
                     any resources associated with the cookie.

                     If this flag is not set, it is the  application's  responsibility  to  release  the  memory
                     region at a later time using the RDS_FREE_MR socket option.

              RDS_RDMA_INVALIDATE
                     Normally,  RDMA  memory  mappings  are invalidated lazily, as this requires some relatively
                     costly synchronization with the HCA. However, this means that the  server  application  can
                     continue  to  access  the registered memory for some indeterminate amount of time.  If this
                     flag is set, the RDS code will invalidate the mapping at the time it  is  released  (either
                     upon  arrival  of the RDMA ACK, if USE_ONCE was specified; or when the application destroys
                     it using FREE_MR).

       RDMA Operation
              RDMA operations are initiated by the server using the RDS_CMSG_RDMA_ARGS  control  message,  which
              takes the following data as payload:

              struct rds_rdma_args {
                      rds_rdma_cookie_t cookie;
                      struct rds_iovec remote_vec;
                      u_int64_t       local_vec_addr;
                      u_int64_t       nr_local;
                      u_int64_t       flags;
                      u_int32_t       user_token;
              };

              The  cookie argument contains the RDMA cookie received from the client.  The local memory is given
              via an array of rds_iovecs.  The array address is given  in  local_vec_addr,  and  its  number  of
              elements is given in nr_local.

              The  struct  member  remote_vec specifies a location relative to the memory area identified by the
              cookie: remote_vec.addr is an offset into that region, and remote_vec.bytes is the length  of  the
              memory window to copy to/from.  This length must match the size of the local memory area, i.e. the
              sum of bytes in all members of the local iovec.

              The flags field contains the bitwise OR of any of the following flags:

              RDS_RDMA_READWRITE
                     If set, any RDMA WRITE is initiated from the server's memory to the client's. If  not  set,
                     RDS will do a RDMA READ from the client's memory to the server's memory.

              RDS_RDMA_FENCE
                     By  default,  Infiniband makes no guarantee about the ordering of an RDMA READ with respect
                     to subsequent SEND operations. Setting this flag asks that the RDMA READ should  be  fenced
                     off  the subsequent RDS ACK message. Setting this flag requires an additional round-trip of
                     the IB fabric, but it is a good idea to use set this flag by default, unless you are really
                     sure you do not want it.

              RDS_RDMA_NOTIFY_ME
                     This  flag  requests  a  notification  upon completion of the RDMA operation (successful or
                     otherwise). The noticiation will contain the value of the user_token field passed in by the
                     application. This allows the application to release resources (such as buffers) assosicated
                     with the RDMA transfer.

              The user_token can be used to pass an application specific identifier to the kernel. This token is
              returned to the application when a status notification is generated (see the following section).

       RDMA Notification
              The  RDS  kernel  code  is able to notify the server application when an RDMA operation completes.
              These notifications are delivered via RDS_CMSG_RDMA_STATUS control messages.

              By default, no notifications are generated. There are two ways an application can request them. On
              one  hand,  status  notifications  can  be  enabled  on  a  per-operation  basis  by  setting  the
              RDS_RDMA_NOTIFY_ME flag in the RDMA arguments. On the other  hand,  the  application  can  request
              notifications  for  all  RDMA  operations  that fail by setting the RDS_RECVERR socket option (see
              below).  In both cases, the format of the notification is the same; and at most  one  notification
              will be sent per completed operation.

              The message format is this:

              struct rds_rdma_notify {
                      u_int32_t       user_token;
                      int32_t         status;
              };

              The  user_token  field contains the value previously given to the kernel in the RDS_CMSG_RDMA_ARGS
              control message. The status field contains a status value, with 0 indicating success, and non-zero
              indicating an error.

              The following status codes are currently defined:

              RDS_RDMA_SUCCESS
                     The RDMA operation succeeded.

              RDS_RDMA_REMOTE_ERROR
                     The  RDMA  operation failed due to a remote access error. This is usually due to an invalid
                     R_key, offset or transfer size.

              RDS_RDMA_CANCELED
                     The RDMA operation  was  canceled  by  the  application.   (This  error  code  is  not  yet
                     generated).

              RDS_RDMA_DROPPED
                     RDMA  operations were discarded after the connection broke and was re-established. The RDMA
                     operation may have been processed partially.

              RDS_RDMA_OTHER_ERROR
                     Any other failure.

       RDMA setsockopt arguments
              When using the RDS_GET_MR socket option to register a  memory  range,  the  application  passes  a
              pointer to a struct rds_get_mr_args variable, described above.

              The RDS_FREE_MR call takes an argument of type struct rds_free_mr_args:

              struct rds_free_mr_args {
                      rds_rdma_cookie_t cookie;
                      u_int64_t       flags;
              };

              cookie  specifies the RDMA cookie to be released. RDMA access to the memory range will usually not
              be invoked instantly, because the operation is rather  costly.  However,  if  the  flags  argument
              contains  RDS_RDMA_INVALIDATE, RDS will invalidate the indicated mapping immediately, as described
              in section Mapping arguments above.

              If the cookie argument is 0, and RDS_RDMA_INVALIDATE  is  set,  RDS  will  invalidate  old  memory
              mappings on all devices.

ERRORS

       In  addition  to  the  usual  error  codes  returned  by sendmsg, recvmsg and setsockopt, RDS returns the
       following error codes:

       EAGAIN RDS was unable to map a memory range because the limit was exceeded (returned by RDS_CMSG_RDMA_MAP
              and RDS_GET_MR).

       EINVAL When  sending a message, there were were conflicting control messages (e.g. two RDMA_MAP messages,
              or a RDMA_MAP  and a  RDMA_DEST message).

              In a RDS_CMSG_RDMA_MAP or RDS_GET_MR operation, the application  specified  memory  range  greater
              than the maximum size supported.

              When  setting up an RDMA operation with RDS_CMSG_RDMA_ARGS, the size of the local memory (given in
              the rds_iovec) did not match the size of the remote memory range.

       EBUSY  RDS was unable to obtain a DMA mapping for the indicated memory.

LIMITS

       Currently, the following limits apply

       •      The maximum size of a zerocopy transfer is 1MB. This can  be  adjusted  via  the  fmr_message_size
              module parameter.

       •      The  maximum number of memory ranges that can be mapped is limited to 2048 at the moment. This can
              be adjusted via the fmr_pool_size module parameter. However,  the  actual  limit  imposed  by  the
              hardware may in fact be lower.

AUTHORS

       RDS was written and is Copyright (C) 2007-2008 by Oracle, Inc.

                                                                                                 RDS zerocopy(7)