trusty (7) rds.7.gz

Provided by: rds-tools_1.4.1-OFED-1.4.2-1ubuntu1_amd64 bug

NAME

       RDS - Reliable Datagram Sockets

SYNOPSIS

       #include <sys/socket.h>
       #include <netinet/in.h>

DESCRIPTION

       This is an implementation of the RDS socket API. It provides reliable, in-order datagram delivery between
       sockets over a variety of transports.

       Currently, RDS can be transported over Infiniband, and loopback.  RDS over TCP is disabled, but  will  be
       re-enabled in the near future.

       RDS uses standard AF_INET addresses as described in ip(7) to identify end points.

   Socket Creation
       RDS  is  still in development and as such does not have a reserved protocol family constant. Applications
       must read the string representation of the protocol family value from the pf_rds  sysctl  parameter  file
       described below.

       rds_socket = socket(pf_rds, SOCK_SEQPACKET, 0);

   Socket Options
       RDS  sockets  support  a  number of socket options through the setsockopt(2) and getsockopt(2) calls. The
       following generic options (with socket level SOL_SOCKET) are of specific importance:

       SO_RCVBUF
              Specifies the size of the receive buffer. See section on "Congestion Control" below.

       SO_SNDBUF
              Specifies the size of the send buffer. See "Message Transmission" below.

       SO_SNDTIMEO
              Specifies the send timeout when trying to enqueue a message on a  socket  with  a  full  queue  in
              blocking mode.

       In  addition  to  these,  RDS supports a number of protocol specific options (with socket level SOL_RDS).
       Just as with the RDS protocol family, an official value has not been assigned yet,  so  the  kernel  will
       assign a value dynamically.  The assigned value can be retrieved from the sol_rds sysctl parameter file.

       RDS specific socket options will be described in a separate section below.

   Binding
       A  new  RDS  socket has no local address when it is first returned from socket(2).  It must be bound to a
       local address by calling bind(2) before any messages can be sent or received. This will also  attach  the
       socket  to  a  specific transport, based on the type of interface the local address is attached to.  From
       that point on, the socket can only reach destinations which are available through this transport.

       For instance, when binding to the address of an Infiniband interface such as ib0, the socket will use the
       Infiniband  transport. If RDS is not able to associate a transport with the given address, it will return
       EADDRNOTAVAIL.

       An RDS socket can only be bound to one address and only one socket can be bound to a  given  address/port
       pair. If no port is specified in the binding address then an unbound port is selected at random.

       RDS  does  not allow the application to bind a previously bound socket to another address. Binding to the
       wildcard address INADDR_ANY is not permitted either.

   Connecting
       The default mode of operation for RDS is to use unconnected socket, and specify a destination address  as
       an  argument  to  sendmsg.   However,  RDS  allows  sockets  to  be connected to a remote end point using
       connect(2).  If a socket is connected, calling sendmsg without specifying a destination address will  use
       the previously given remote address.

   Congestion Control
       RDS  does  not  have  explicit  congestion  control like common streaming protocols such as TCP. However,
       sockets have two queue limits associated with them; the send queue  size  and  the  receive  queue  size.
       Messages are accounted based on the number of bytes of payload.

       The  send  queue size limits how much data local processes can queue on a local socket (see the following
       section). If that limit is exceeded, the kernel will not accept  further  messages  until  the  queue  is
       drained and messages have been delivered to and acknowledged by the remote host.

       The  receive queue size limits how much data RDS will put on the receive queue of a socket before marking
       the socket as congested.  When a socket becomes congested, RDS will send a congestion map update  to  the
       other participating hosts, who are then expected to stop sending more messages to this port.

       There  is  a  timing window during which a remote host can still continue to send messages to a congested
       port; RDS solves this by accepting these messages even if the socket's receive queue is already over  the
       limit.

       As the application pulls incoming messages off the receive queue using recvmsg(2), the number of bytes on
       the receive queue will eventually drop below the receive queue size, at which  point  the  port  is  then
       marked  uncongested, and another congestion update is sent to all participating hosts. This tells them to
       allow applications to send additional messages to this port.

       The default values for the send and receive buffer size are controlled by the  A  given  RDS  socket  has
       limited  transmit  buffer  space.  It  defaults  to  the  system  wide socket send buffer size set in the
       wmem_default and rmem_default sysctls, respectively. They can be tuned by  the  application  through  the
       SO_SNDBUF and SO_RCVBUF socket options.

   Blocking Behavior
       The  sendmsg(2)  and  recvmsg(2)  calls  can  block in a variety of situations.  Whether a call blocks or
       returns with an error depends on the non-blocking setting of the file  descriptor  and  the  MSG_DONTWAIT
       message flag. If the file descriptor is set to blocking mode (which is the default), and the MSG_DONTWAIT
       flag is not given, the call will block.

       In addition, the SO_SNDTIMEO and SO_RCVTIMEO socket options can be used to specify a timeout (in seconds)
       after  which  the call will abort waiting, and return an error. The default timeout is 0, which tells RDS
       to block indefinitely.

   Message Transmission
       Messages may be sent using sendmsg(2) once the RDS socket  is  bound.  Message  length  cannot  exceed  4
       gigabytes as the wire protocol uses an unsigned 32 bit integer to express the message length.

       RDS  does  not  support  out  of  band  data. Applications are allowed to send to unicast addresses only;
       broadcast or multicast are not supported.

       A successful sendmsg(2) call puts the message in the socket's transmit queue where it will  remain  until
       either  the  destination  acknowledges  that  the  message is no longer in the network or the application
       removes the message from the send queue.

       Messages can be removed from the send queue with the RDS_CANCEL_SENT_TO socket option described below.

       While a message is in the transmit queue its payload bytes are accounted for.  If an attempt is  made  to
       send  a  message  while there is not sufficient room on the transmit queue, the call will either block or
       return EAGAIN.

       Trying to send to a destination that is marked congested (see above),  the  call  will  either  block  or
       return ENOBUFS.

       A message sent with no payload bytes will not consume any space in the destination's send buffer but will
       result in a message receipt on the destination. The receiver will not get any payload data  but  will  be
       able to see the sender's address.

       Messages  sent  to a port to which no socket is bound will be silently discarded by the destination host.
       No error messages are reported to the sender.

   Message Receipt
       Messages may be received with recvmsg(2) on an RDS socket once it is bound to a source address. RDS  will
       return  messages in-order, i.e. messages from the same sender will arrive in the same order in which they
       were be sent.

       The address of the sender will be returned in the sockaddr_in structure pointed to by the msg_name field,
       if set.

       If  the MSG_PEEK flag is given, the first message on the receive is returned without removing it from the
       queue.

       The memory consumed by messages waiting for delivery does not limit the number of messages  that  can  be
       queued for receive. RDS does attempt to perform congestion control as described in the section above.

       If the length of the message exceeds the size of the buffer provided to recvmsg(2), then the remainder of
       the bytes in the message are discarded and the MSG_TRUNC flag is set in  the  msg_flags  field.  In  this
       truncating case recvmsg(2) will still return the number of bytes copied, not the length of entire messge.
       If MSG_TRUNC is set in the flags argument to recvmsg(2), then it will return the number of bytes  in  the
       entire  message. Thus one can examine the size of the next message in the receive queue without incurring
       a copying overhead by providing a zero length buffer and setting MSG_PEEK  and  MSG_TRUNC  in  the  flags
       argument.

       The sending address of a zero-length message will still be provided in the msg_name field.

   Control Messages
       RDS  uses  control  messages (a.k.a. ancillary data) through the msg_control and msg_controllen fields in
       sendmsg(2) and recvmsg(2).  Control messages generated by RDS have a cmsg_level value of  sol_rds.   Most
       control  messages are related to the zerocopy interface added in RDS version 3, and are described in rds-
       rdma(7).

       The only exception is the RDS_CMSG_CONG_UPDATE message, which is described in the following section.

   Polling
       RDS supports the poll(2) interface in a limited fashion.  POLLIN is returned  when  there  is  a  message
       (either  a  proper  RDS message, or a control message) waiting in the socket's receive queue.  POLLOUT is
       always returned while there is room on the socket's send queue.

       Sending to congested ports requires special handling. When an application tries to send  to  a  congested
       destination,  the  system  call  will  return  ENOBUFS.  However, it cannot poll for POLLOUT, as there is
       probably still room on the transmit queue, so the call to poll(2) would return immediately,  even  though
       the destination is still congested.

       There are two ways of dealing with this situation. The first is to simply poll for POLLIN.  By default, a
       process sleeping in poll(2) is always woken  up  when  the  congestion  map  is  updated,  and  thus  the
       application can retry any previously congested sends.

       The  second  option  is  explicit  congestion  monitoring,  which gives the application more fine-grained
       control.

       With explicit monitoring, the  application  polls  for  POLLIN  as  before,  and  additionally  uses  the
       RDS_CONG_MONITOR socket option to install a 64bit mask value in the socket, where each bit corresponds to
       a group of ports. When a congestion update arrives, RDS checks the set of ports that  became  uncongested
       against  the  bit  mask  installed  in the socket. If they overlap, a control messages is enqueued on the
       socket, and the application is woken up. When it calls recvmsg(2), it will be given the  control  message
       containing the bitmap.  on the socket.

       The  congestion  monitor  bitmask can be set and queried using setsockopt(2) with RDS_CONG_MONITOR, and a
       pointer to the 64bit mask variable.

       Congestion updates are delivered to the application  via  RDS_CMSG_CONG_UPDATE  control  messages.  These
       control  messages are always delivered by themselves (or possibly additional control messages), but never
       along with a RDS data message. The cmsg_data field of the control message is an 8 byte  datum  containing
       the 64bit mask value.

       Applications can use the following macros to test for and set bits in the bitmask:

       #define RDS_CONG_MONITOR_SIZE   64
       #define RDS_CONG_MONITOR_BIT(port)  (((unsigned int) port) % RDS_CONG_MONITOR_SIZE)
       #define RDS_CONG_MONITOR_MASK(port) (1 << RDS_CONG_MONITOR_BIT(port))

   Canceling Messages
       An application can cancel (flush) messages from the send queue using the RDS_CANCEL_SENT_TO socket option
       with setsockopt(2).  This call takes an optional sockaddr_in address structure  as  argument.  If  given,
       only  messages  to  the  destination specified by this address are discarded. If no address is given, all
       pending messages are discarded.

       Note that this affects messages that have not yet been transmitted as well as  messages  that  have  been
       transmitted, but for which no acknowledgment from the remote host has been received yet.

   Reliability
       If  sendmsg(2)  succeeds,  RDS  guarantees that the  message  will  be visible  to recvmsg(2) on a socket
       bound to the destination address as long as that destination socket remains open.

       If there is no socket bound on the  destination,  the  message  is silently  dropped.   If   the  sending
       RDS  can't  be sure that there is no socket bound then it will try to send the message indefinitely until
       it can be sure or the sent message is canceled.

       If  a socket is closed then all pending sent messages on the socket are canceled and may or  may  not  be
       seen by the receiver.

       The RDS_CANCEL_SENT_TO socket option can be used to cancel all  pending messages to a given destination.

       If   a   receiving  socket  is  closed  with pending messages then the sender considers those messages as
       having  left  the  network and  will  not retransmit them.

       A  message will only be seen by recvmsg(2) once, unless MSG_PEEK was specified. Once the message has been
       delivered it is removed from the sending socket's transmit queue.

       All  messages  sent  from  the same socket to the same destination will be delivered in the order they're
       sent. Messages sent from different sockets, or to different destinations, may be delivered in any order.

SYSCTL VALUES

       These parameteres may only  be  accessed  through  their  files  in  /proc/sys/net/rds.   Access  through
       sysctl(2) is not supported.

       pf_rds This  file  contains the string representation of the protocol family constant passed to socket(2)
              to create a new RDS socket.

       sol_rds
              This file contains the string representation of the socket  level  parameter  that  is  passed  to
              getsockopt(2) and setsockopt(2) to manipulate RDS socket options.

       max_unacked_bytes and max_unacked_packets
              These  parameters  are  used  to  tune  the generation of acknowledgements. By default, the system
              receiving RDS messages does not send back explicit acknowledgements unless it transmits a  message
              of  its  own (in which case the ACK is piggybacked onto the outgoing message), or when the sending
              system requests an ACK.

              However, the sender needs to see an ACK from time to time so that it can purge old  messages  from
              the  send queue. The unacked bytes and packet counters are used to keep track of how much data has
              been sent without requesting an ACK. The  default  is  to  request  an  acknowledgement  every  16
              packets, or every 16 MB, whichever comes first.

       reconnect_delay_min_ms and reconnect_delay_max_ms
              RDS  uses  host-to-host connections to transport RDS messages (both for the TCP and the Infiniband
              transport). If this connection breaks, RDS will try to re-establish the connection.  Because  this
              reconnect  may  be  triggered  by  both hosts at the same time and fail, RDS uses a random backoff
              before attempting a reconnect. These two parameters specify  the  minimum  and  maximum  delay  in
              milliseconds. The default values are 1 and 1000, respectively.

SEE ALSO

       rds-rdma(7), socket(2), bind(2), sendmsg(2), recvmsg(2), getsockopt(2), setsockopt(2).

                                                                                                          RDS(7)