Provided by: rds-tools_1.4.1-OFED-1.4.2-1_amd64 bug


       RDS - Reliable Datagram Sockets


       #include <sys/socket.h>
       #include <netinet/in.h>


       This  is  an implementation of the RDS socket API. It provides reliable, in-order datagram
       delivery between sockets over a variety of transports.

       Currently, RDS can be  transported  over  Infiniband,  and  loopback.   RDS  over  TCP  is
       disabled, but will be re-enabled in the near future.

       RDS uses standard AF_INET addresses as described in ip(7) to identify end points.

   Socket Creation
       RDS is still in development and as such does not have a reserved protocol family constant.
       Applications must read the string representation of the protocol  family  value  from  the
       pf_rds sysctl parameter file described below.

       rds_socket = socket(pf_rds, SOCK_SEQPACKET, 0);

   Socket Options
       RDS sockets support a number of socket options through the setsockopt(2) and getsockopt(2)
       calls. The following generic options  (with  socket  level  SOL_SOCKET)  are  of  specific

              Specifies  the  size  of  the  receive  buffer. See section on "Congestion Control"

              Specifies the size of the send buffer. See "Message Transmission" below.

              Specifies the send timeout when trying to enqueue a message on a socket with a full
              queue in blocking mode.

       In  addition  to  these,  RDS  supports a number of protocol specific options (with socket
       level SOL_RDS).  Just as with the RDS protocol family, an  official  value  has  not  been
       assigned  yet,  so  the kernel will assign a value dynamically.  The assigned value can be
       retrieved from the sol_rds sysctl parameter file.

       RDS specific socket options will be described in a separate section below.

       A new RDS socket has no local address when it is first returned from socket(2).   It  must
       be  bound  to  a  local  address  by  calling  bind(2)  before any messages can be sent or
       received. This will also attach the socket to a specific transport, based on the  type  of
       interface the local address is attached to.  From that point on, the socket can only reach
       destinations which are available through this transport.

       For instance, when binding to the address of an Infiniband  interface  such  as  ib0,  the
       socket will use the Infiniband transport. If RDS is not able to associate a transport with
       the given address, it will return EADDRNOTAVAIL.

       An RDS socket can only be bound to one address and only one socket can be bound to a given
       address/port  pair. If no port is specified in the binding address then an unbound port is
       selected at random.

       RDS does not allow the application to bind a previously bound socket to  another  address.
       Binding to the wildcard address INADDR_ANY is not permitted either.

       The  default  mode  of  operation  for  RDS  is  to  use unconnected socket, and specify a
       destination address as an  argument  to  sendmsg.   However,  RDS  allows  sockets  to  be
       connected  to  a  remote  end  point  using connect(2).  If a socket is connected, calling
       sendmsg without specifying a destination address will  use  the  previously  given  remote

   Congestion Control
       RDS does not have explicit congestion control like common streaming protocols such as TCP.
       However, sockets have two queue limits associated with them; the send queue size  and  the
       receive queue size.  Messages are accounted based on the number of bytes of payload.

       The  send queue size limits how much data local processes can queue on a local socket (see
       the following section). If that limit is exceeded, the  kernel  will  not  accept  further
       messages  until  the queue is drained and messages have been delivered to and acknowledged
       by the remote host.

       The receive queue size limits how much data RDS will put on the receive queue of a  socket
       before  marking the socket as congested.  When a socket becomes congested, RDS will send a
       congestion map update to the other participating hosts, who  are  then  expected  to  stop
       sending more messages to this port.

       There is a timing window during which a remote host can still continue to send messages to
       a congested port; RDS solves this by accepting these messages even if the socket's receive
       queue is already over the limit.

       As  the  application  pulls  incoming messages off the receive queue using recvmsg(2), the
       number of bytes on the receive queue will eventually drop below the receive queue size, at
       which  point the port is then marked uncongested, and another congestion update is sent to
       all participating hosts. This tells them to allow applications to send additional messages
       to this port.

       The  default values for the send and receive buffer size are controlled by the A given RDS
       socket has limited transmit buffer space. It defaults  to  the  system  wide  socket  send
       buffer  size  set  in the wmem_default and rmem_default sysctls, respectively. They can be
       tuned by the application through the SO_SNDBUF and SO_RCVBUF socket options.

   Blocking Behavior
       The sendmsg(2) and recvmsg(2) calls can block in a variety of situations.  Whether a  call
       blocks or returns with an error depends on the non-blocking setting of the file descriptor
       and the MSG_DONTWAIT message flag. If the file descriptor is set to blocking  mode  (which
       is the default), and the MSG_DONTWAIT flag is not given, the call will block.

       In  addition,  the  SO_SNDTIMEO  and  SO_RCVTIMEO  socket options can be used to specify a
       timeout (in seconds) after which the call will abort waiting, and  return  an  error.  The
       default timeout is 0, which tells RDS to block indefinitely.

   Message Transmission
       Messages  may be sent using sendmsg(2) once the RDS socket is bound. Message length cannot
       exceed 4 gigabytes as the wire protocol uses an unsigned 32 bit  integer  to  express  the
       message length.

       RDS  does  not  support  out  of  band  data.  Applications are allowed to send to unicast
       addresses only; broadcast or multicast are not supported.

       A successful sendmsg(2) call puts the message in the socket's transmit queue where it will
       remain  until  either  the  destination  acknowledges that the message is no longer in the
       network or the application removes the message from the send queue.

       Messages can be removed from the send queue  with  the  RDS_CANCEL_SENT_TO  socket  option
       described below.

       While  a  message  is  in  the  transmit queue its payload bytes are accounted for.  If an
       attempt is made to send a message while there is  not  sufficient  room  on  the  transmit
       queue, the call will either block or return EAGAIN.

       Trying to send to a destination that is marked congested (see above), the call will either
       block or return ENOBUFS.

       A message sent with no payload bytes will not consume any space in the destination's  send
       buffer  but will result in a message receipt on the destination. The receiver will not get
       any payload data but will be able to see the sender's address.

       Messages sent to a port to which no socket is bound will  be  silently  discarded  by  the
       destination host. No error messages are reported to the sender.

   Message Receipt
       Messages  may  be  received  with recvmsg(2) on an RDS socket once it is bound to a source
       address. RDS will return messages in-order, i.e. messages from the same sender will arrive
       in the same order in which they were be sent.

       The  address of the sender will be returned in the sockaddr_in structure pointed to by the
       msg_name field, if set.

       If the MSG_PEEK flag is given, the first  message  on  the  receive  is  returned  without
       removing it from the queue.

       The memory consumed by messages waiting for delivery does not limit the number of messages
       that can be queued for  receive.  RDS  does  attempt  to  perform  congestion  control  as
       described in the section above.

       If  the  length of the message exceeds the size of the buffer provided to recvmsg(2), then
       the remainder of the bytes in the message are discarded and the MSG_TRUNC flag is  set  in
       the  msg_flags  field.  In this truncating case recvmsg(2) will still return the number of
       bytes copied, not the length of entire messge.  If MSG_TRUNC is set in the flags  argument
       to recvmsg(2), then it will return the number of bytes in the entire message. Thus one can
       examine the size of the next message in the receive  queue  without  incurring  a  copying
       overhead by providing a zero length buffer and setting MSG_PEEK and MSG_TRUNC in the flags

       The sending address of a zero-length message will still be provided in the msg_name field.

   Control Messages
       RDS  uses  control  messages  (a.k.a.  ancillary  data)  through   the   msg_control   and
       msg_controllen  fields  in  sendmsg(2)  and recvmsg(2).  Control messages generated by RDS
       have a cmsg_level value of sol_rds.  Most control messages are  related  to  the  zerocopy
       interface added in RDS version 3, and are described in rds-rdma(7).

       The  only  exception  is  the  RDS_CMSG_CONG_UPDATE  message,  which  is  described in the
       following section.

       RDS supports the poll(2) interface in a limited fashion.  POLLIN is returned when there is
       a  message  (either  a  proper  RDS message, or a control message) waiting in the socket's
       receive queue.  POLLOUT is always returned while there is room on the socket's send queue.

       Sending to congested ports requires special handling. When an application tries to send to
       a congested destination, the system call will return ENOBUFS.  However, it cannot poll for
       POLLOUT, as there is probably still room on the transmit queue, so  the  call  to  poll(2)
       would return immediately, even though the destination is still congested.

       There are two ways of dealing with this situation. The first is to simply poll for POLLIN.
       By default, a process sleeping in poll(2) is always woken up when the  congestion  map  is
       updated, and thus the application can retry any previously congested sends.

       The  second  option  is  explicit  congestion monitoring, which gives the application more
       fine-grained control.

       With explicit monitoring, the application polls for POLLIN  as  before,  and  additionally
       uses the RDS_CONG_MONITOR socket option to install a 64bit mask value in the socket, where
       each bit corresponds to a group of ports. When a congestion update arrives, RDS checks the
       set of ports that became uncongested against the bit mask installed in the socket. If they
       overlap, a control messages is enqueued on the socket, and the application  is  woken  up.
       When  it calls recvmsg(2), it will be given the control message containing the bitmap.  on
       the socket.

       The  congestion  monitor  bitmask  can  be  set  and  queried  using  setsockopt(2)   with
       RDS_CONG_MONITOR, and a pointer to the 64bit mask variable.

       Congestion  updates  are  delivered  to  the  application via RDS_CMSG_CONG_UPDATE control
       messages.  These  control  messages  are  always  delivered  by  themselves  (or  possibly
       additional control messages), but never along with a RDS data message. The cmsg_data field
       of the control message is an 8 byte datum containing the 64bit mask value.

       Applications can use the following macros to test for and set bits in the bitmask:

       #define RDS_CONG_MONITOR_SIZE   64
       #define RDS_CONG_MONITOR_BIT(port)  (((unsigned int) port) % RDS_CONG_MONITOR_SIZE)
       #define RDS_CONG_MONITOR_MASK(port) (1 << RDS_CONG_MONITOR_BIT(port))

   Canceling Messages
       An  application  can  cancel  (flush)   messages   from   the   send   queue   using   the
       RDS_CANCEL_SENT_TO  socket  option  with  setsockopt(2).   This  call  takes  an  optional
       sockaddr_in address structure as argument. If given,  only  messages  to  the  destination
       specified  by this address are discarded. If no address is given, all pending messages are

       Note that this affects messages that have not yet been transmitted  as  well  as  messages
       that  have been transmitted, but for which no acknowledgment from the remote host has been
       received yet.

       If sendmsg(2) succeeds, RDS guarantees that the  message  will  be visible  to  recvmsg(2)
       on  a  socket  bound to the destination address as long as that destination socket remains

       If there is no socket bound on the  destination,  the  message  is silently  dropped.   If
       the  sending  RDS can't be sure that there is no socket bound then it will try to send the
       message indefinitely until it can be sure or the sent message is canceled.

       If  a socket is closed then all pending sent messages on the socket are canceled  and  may
       or may not be seen by the receiver.

       The  RDS_CANCEL_SENT_TO  socket  option  can  be used to cancel all  pending messages to a
       given destination.

       If  a  receiving socket is closed with pending messages then the  sender  considers  those
       messages as  having  left  the  network   and  will  not retransmit them.

       A   message  will only be seen by recvmsg(2) once, unless MSG_PEEK was specified. Once the
       message has been delivered it is removed from the sending socket's transmit queue.

       All messages sent from the same socket to the same destination will be  delivered  in  the
       order  they're  sent.  Messages sent from different sockets, or to different destinations,
       may be delivered in any order.


       These parameteres may only be accessed through their files in  /proc/sys/net/rds.   Access
       through sysctl(2) is not supported.

       pf_rds This file contains the string representation of the protocol family constant passed
              to socket(2) to create a new RDS socket.

              This file contains the string representation of the socket level parameter that  is
              passed to getsockopt(2) and setsockopt(2) to manipulate RDS socket options.

       max_unacked_bytes and max_unacked_packets
              These  parameters  are used to tune the generation of acknowledgements. By default,
              the system receiving RDS messages does  not  send  back  explicit  acknowledgements
              unless it transmits a message of its own (in which case the ACK is piggybacked onto
              the outgoing message), or when the sending system requests an ACK.

              However, the sender needs to see an ACK from time to time so that it can purge  old
              messages  from  the  send  queue. The unacked bytes and packet counters are used to
              keep track of how much data has been sent without requesting an ACK. The default is
              to  request  an  acknowledgement  every 16 packets, or every 16 MB, whichever comes

       reconnect_delay_min_ms and reconnect_delay_max_ms
              RDS uses host-to-host connections to transport RDS messages (both for the  TCP  and
              the  Infiniband transport). If this connection breaks, RDS will try to re-establish
              the connection. Because this reconnect may be triggered by both hosts at  the  same
              time  and  fail, RDS uses a random backoff before attempting a reconnect. These two
              parameters specify the minimum and  maximum  delay  in  milliseconds.  The  default
              values are 1 and 1000, respectively.


       rds-rdma(7), socket(2), bind(2), sendmsg(2), recvmsg(2), getsockopt(2), setsockopt(2).