Provided by: libfabric-dev_1.17.0-3ubuntu1_amd64 bug

NAME

       fi_cq - Completion queue operations

       fi_cq_open / fi_close
              Open/close a completion queue

       fi_control
              Control CQ operation or attributes.

       fi_cq_read / fi_cq_readfrom / fi_cq_readerr
              Read a completion from a completion queue

       fi_cq_sread / fi_cq_sreadfrom
              A  synchronous  (blocking) read that waits until a specified condition has been met
              before reading a completion from a completion queue.

       fi_cq_signal
              Unblock any thread waiting in fi_cq_sread or fi_cq_sreadfrom.

       fi_cq_strerror
              Converts provider specific error information into a printable string

SYNOPSIS

              #include <rdma/fi_domain.h>

              int fi_cq_open(struct fid_domain *domain, struct fi_cq_attr *attr,
                  struct fid_cq **cq, void *context);

              int fi_close(struct fid *cq);

              int fi_control(struct fid *cq, int command, void *arg);

              ssize_t fi_cq_read(struct fid_cq *cq, void *buf, size_t count);

              ssize_t fi_cq_readfrom(struct fid_cq *cq, void *buf, size_t count,
                  fi_addr_t *src_addr);

              ssize_t fi_cq_readerr(struct fid_cq *cq, struct fi_cq_err_entry *buf,
                  uint64_t flags);

              ssize_t fi_cq_sread(struct fid_cq *cq, void *buf, size_t count,
                  const void *cond, int timeout);

              ssize_t fi_cq_sreadfrom(struct fid_cq *cq, void *buf, size_t count,
                  fi_addr_t *src_addr, const void *cond, int timeout);

              int fi_cq_signal(struct fid_cq *cq);

              const char * fi_cq_strerror(struct fid_cq *cq, int prov_errno,
                    const void *err_data, char *buf, size_t len);

ARGUMENTS

       domain Open resource domain

       cq     Completion queue

       attr   Completion queue attributes

       context
              User specified context associated with the completion queue.

       buf    For read calls, the data buffer to write completions  into.   For  write  calls,  a
              completion  to  insert  into the completion queue.  For fi_cq_strerror, an optional
              buffer that receives printable error information.

       count  Number of CQ entries.

       len    Length of data buffer

       src_addr
              Source address of a completed receive operation

       flags  Additional flags to apply to the operation

       command
              Command of control operation to perform on CQ.

       arg    Optional control argument

       cond   Condition that must be met before a completion is generated

       timeout
              Time in milliseconds to wait.  A negative value indicates infinite timeout.

       prov_errno
              Provider specific error value

       err_data
              Provider specific error data related to a completion

DESCRIPTION

       Completion queues are used to report events associated  with  data  transfers.   They  are
       associated  with  message  sends and receives, RMA, atomic, tagged messages, and triggered
       events.  Reported events are usually associated with a fabric endpoint, but may also refer
       to memory regions used as the target of an RMA or atomic operation.

   fi_cq_open
       fi_cq_open  allocates  a new completion queue.  Unlike event queues, completion queues are
       associated with a resource domain and may be offloaded entirely in provider hardware.

       The properties and behavior of a completion queue are defined by struct fi_cq_attr.

              struct fi_cq_attr {
                  size_t               size;      /* # entries for CQ */
                  uint64_t             flags;     /* operation flags */
                  enum fi_cq_format    format;    /* completion format */
                  enum fi_wait_obj     wait_obj;  /* requested wait object */
                  int                  signaling_vector; /* interrupt affinity */
                  enum fi_cq_wait_cond wait_cond; /* wait condition format */
                  struct fid_wait     *wait_set;  /* optional wait set */
              };

       size   Specifies the minimum size of a completion queue.  A value of 0 indicates that  the
              provider may choose a default value.

       flags  Flags that control the configuration of the CQ.

       - FI_AFFINITY
              Indicates that the signaling_vector field (see below) is valid.

       format Completion queues allow the application to select the amount of detail that it must
              store and report.  The format attribute allows the application  to  select  one  of
              several  completion  formats,  indicating  the  structure  of  the  data  that  the
              completion queue should return when read.  Supported  formats  and  the  structures
              that  correspond  to each are listed below.  The meaning of the CQ entry fields are
              defined in the Completion Fields section.

       - FI_CQ_FORMAT_UNSPEC
              If an unspecified format is requested, then the CQ will  use  a  provider  selected
              default format.

       - FI_CQ_FORMAT_CONTEXT
              Provides only user specified context that was associated with the completion.

              struct fi_cq_entry {
                  void     *op_context; /* operation context */
              };
              • .RS 2

       FI_CQ_FORMAT_MSG
              Provides  minimal  data  for  processing  completions,  with  expanded  support for
              reporting information about received messages.

              struct fi_cq_msg_entry {
                  void     *op_context; /* operation context */
                  uint64_t flags;       /* completion flags */
                  size_t   len;         /* size of received data */
              };
              • .RS 2

       FI_CQ_FORMAT_DATA
              Provides data associated with a completion.  Includes support for received  message
              length, remote CQ data, and multi-receive buffers.

              struct fi_cq_data_entry {
                  void     *op_context; /* operation context */
                  uint64_t flags;       /* completion flags */
                  size_t   len;         /* size of received data */
                  void     *buf;        /* receive data buffer */
                  uint64_t data;        /* completion data */
              };
              • .RS 2

       FI_CQ_FORMAT_TAGGED
              Expands completion data to include support for the tagged message interfaces.

              struct fi_cq_tagged_entry {
                  void     *op_context; /* operation context */
                  uint64_t flags;       /* completion flags */
                  size_t   len;         /* size of received data */
                  void     *buf;        /* receive data buffer */
                  uint64_t data;        /* completion data */
                  uint64_t tag;         /* received tag */
              };

       wait_obj
              CQ’s   may  be  associated  with  a  specific  wait  object.   Wait  objects  allow
              applications to block  until  the  wait  object  is  signaled,  indicating  that  a
              completion  is  available  to  be  read.   Users may use fi_control to retrieve the
              underlying wait object associated with a CQ, in order to use  it  in  other  system
              calls.   The  following  values  may  be  used  to  specify the type of wait object
              associated  with  a  CQ:  FI_WAIT_NONE,  FI_WAIT_UNSPEC,  FI_WAIT_SET,  FI_WAIT_FD,
              FI_WAIT_MUTEX_COND, and FI_WAIT_YIELD.  The default is FI_WAIT_NONE.

       - FI_WAIT_NONE
              Used  to  indicate  that  the user will not block (wait) for completions on the CQ.
              When FI_WAIT_NONE is  specified,  the  application  may  not  call  fi_cq_sread  or
              fi_cq_sreadfrom.

       - FI_WAIT_UNSPEC
              Specifies that the user will only wait on the CQ using fabric interface calls, such
              as fi_cq_sread or fi_cq_sreadfrom.  In  this  case,  the  underlying  provider  may
              select  the most appropriate or highest performing wait object available, including
              custom wait mechanisms.  Applications that select FI_WAIT_UNSPEC are not guaranteed
              to retrieve the underlying wait object.

       - FI_WAIT_SET
              Indicates  that  the  completion  queue  should  use  a wait set object to wait for
              completions.  If specified, the wait_set field must reference an existing wait  set
              object.

       - FI_WAIT_FD
              Indicates  that  the CQ should use a file descriptor as its wait mechanism.  A file
              descriptor wait object  must  be  usable  in  select,  poll,  and  epoll  routines.
              However,  a  provider  may  signal  an  FD  wait  object by marking it as readable,
              writable, or with an error.

       - FI_WAIT_MUTEX_COND
              Specifies that the CQ should use a pthread  mutex  and  cond  variable  as  a  wait
              object.

       - FI_WAIT_YIELD
              Indicates  that  the  CQ will wait without a wait object but instead yield on every
              wait.  Allows usage of fi_cq_sread and fi_cq_sreadfrom through a spin.

       signaling_vector
              If the FI_AFFINITY flag is set, this indicates the logical cpu number (0..max cpu -
              1)  that  interrupts  associated  with  the CQ should target.  This field should be
              treated as a hint to the provider and may be  ignored  if  the  provider  does  not
              support interrupt affinity.

       wait_cond
              By  default,  when  a completion is inserted into a CQ that supports blocking reads
              (fi_cq_sread/fi_cq_sreadfrom), the corresponding wait object  is  signaled.   Users
              may  specify a condition that must first be met before the wait is satisfied.  This
              field indicates how the provider should interpret the cond field,  which  describes
              the condition needed to signal the wait object.

       A wait condition should be treated as an optimization.  Providers are not required to meet
       the requirements of the condition before signaling the wait object.   Applications  should
       not rely on the condition necessarily being true when a blocking read call returns.

       If  wait_cond  is set to FI_CQ_COND_NONE, then no additional conditions are applied to the
       signaling of the CQ wait object, and the insertion of any new entry will trigger the  wait
       condition.   If  wait_cond  is  set  to  FI_CQ_COND_THRESHOLD,  then  the  cond  field  is
       interpreted as a size_t threshold value.  The threshold indicates the  number  of  entries
       that are to be queued before at the CQ before the wait is satisfied.

       This field is ignored if wait_obj is set to FI_WAIT_NONE.

       wait_set
              If  wait_obj  is  FI_WAIT_SET,  this  field  references  a wait object to which the
              completion queue should attach.  When an event  is  inserted  into  the  completion
              queue,  the corresponding wait set will be signaled if all necessary conditions are
              met.  The use of a wait_set enables an  optimized  method  of  waiting  for  events
              across  multiple event and completion queues.  This field is ignored if wait_obj is
              not FI_WAIT_SET.

   fi_close
       The fi_close call  releases  all  resources  associated  with  a  completion  queue.   Any
       completions which remain on the CQ when it is closed are lost.

       When  closing  the  CQ,  there  must be no opened endpoints, transmit contexts, or receive
       contexts associated with the CQ.  If resources are  still  associated  with  the  CQ  when
       attempting to close, the call will return -FI_EBUSY.

   fi_control
       The  fi_control  call is used to access provider or implementation specific details of the
       completion queue.  Access to the CQ should be serialized across all calls when  fi_control
       is invoked, as it may redirect the implementation of CQ operations.  The following control
       commands are usable with a CQ.

       FI_GETWAIT (void **)
              This command allows the user to retrieve the low-level wait object associated  with
              the CQ.  The format of the wait-object is specified during CQ creation, through the
              CQ attributes.  The fi_control arg parameter should be an address where  a  pointer
              to  the  returned  wait  object  will be written.  See fi_eq.3 for addition details
              using fi_control with FI_GETWAIT.

   fi_cq_read
       The fi_cq_read operation performs a non-blocking read of completion data from the CQ.  The
       format  of  the  completion  event  is  determined  using the fi_cq_format option that was
       specified when the CQ was opened.  Multiple completions may be retrieved from a  CQ  in  a
       single  call.   The  maximum number of entries to return is limited to the specified count
       parameter, with the number of entries successfully read from the CQ returned by the  call.
       (See  return  values  section  below.) A count value of 0 may be used to drive progress on
       associated endpoints when manual progress is enabled.

       CQs are optimized to report operations  which  have  completed  successfully.   Operations
       which  fail  are  reported  `out  of  band'.   Such  operations  are  retrieved  using the
       fi_cq_readerr function.  When an operation that has completed with an unexpected error  is
       encountered,  it  is  placed  into  a temporary error queue.  Attempting to read from a CQ
       while an item is in the error queue results in fi_cq_read failing with a  return  code  of
       -FI_EAVAIL.    Applications   may   use  this  return  code  to  determine  when  to  call
       fi_cq_readerr.

   fi_cq_readfrom
       The fi_cq_readfrom call behaves identical to fi_cq_read, with the exception that it allows
       the  CQ  to  return  source address information to the user for any received data.  Source
       address data is only available for those endpoints configured with  FI_SOURCE  capability.
       If  fi_cq_readfrom  is  called  on  an  endpoint  for  which source addressing data is not
       available, the source address will be  set  to  FI_ADDR_NOTAVAIL.   The  number  of  input
       src_addr entries must be the same as the count parameter.

       Returned  source  addressing  data  is  converted  from  the  native  address  used by the
       underlying fabric into an fi_addr_t, which may be used in transmit operations.  Under most
       circumstances,  returning  fi_addr_t  requires  that  the source address already have been
       inserted into the address vector associated with the receiving endpoint.  This is true for
       address  vectors  of type FI_AV_TABLE.  In select providers when FI_AV_MAP is used, source
       addresses may be converted algorithmically into a usable fi_addr_t, even though the source
       address  has  not been inserted into the address vector.  This is permitted by the API, as
       it allows the provider to avoid address look-up as part of receive message processing.  In
       no  case  do  providers  insert  addresses into an AV separate from an application calling
       fi_av_insert or similar call.

       For endpoints allocated using the FI_SOURCE_ERR capability, if the source  address  cannot
       be  converted into a valid fi_addr_t value, fi_cq_readfrom will return -FI_EAVAIL, even if
       the data were received  successfully.   The  completion  will  then  be  reported  through
       fi_cq_readerr with error code -FI_EADDRNOTAVAIL.  See fi_cq_readerr for details.

       If  FI_SOURCE  is specified without FI_SOURCE_ERR, source addresses which cannot be mapped
       to a usable fi_addr_t will be reported as FI_ADDR_NOTAVAIL.

   fi_cq_sread / fi_cq_sreadfrom
       The fi_cq_sread and fi_cq_sreadfrom  calls  are  the  blocking  equivalent  operations  to
       fi_cq_read  and fi_cq_readfrom.  Their behavior is similar to the non-blocking calls, with
       the exception that the calls will not return until either a completion has been read  from
       the CQ or an error or timeout occurs.

       Threads  blocking  in this function will return to the caller if they are signaled by some
       external source.  This is true even if the timeout has not occurred or  was  specified  as
       infinite.

       It  is invalid for applications to call these functions if the CQ has been configured with
       a wait object of FI_WAIT_NONE or FI_WAIT_SET.

   fi_cq_readerr
       The read error function, fi_cq_readerr, retrieves information regarding  any  asynchronous
       operation  which  has completed with an unexpected error.  fi_cq_readerr is a non-blocking
       call, returning immediately whether an error completion was found or not.

       Error information is reported to the user through struct fi_cq_err_entry.  The  format  of
       this structure is defined below.

              struct fi_cq_err_entry {
                  void     *op_context; /* operation context */
                  uint64_t flags;       /* completion flags */
                  size_t   len;         /* size of received data */
                  void     *buf;        /* receive data buffer */
                  uint64_t data;        /* completion data */
                  uint64_t tag;         /* message tag */
                  size_t   olen;        /* overflow length */
                  int      err;         /* positive error code */
                  int      prov_errno;  /* provider error code */
                  void    *err_data;    /*  error data */
                  size_t   err_data_size; /* size of err_data */
              };

       The  general  reason  for  the error is provided through the err field.  Provider specific
       error information may also be available through the prov_errno and err_data fields.  Users
       may  call  fi_cq_strerror  to convert provider specific error information into a printable
       string for debugging purposes.  See field details below for more information on the use of
       err_data and err_data_size.

       Note  that error completions are generated for all operations, including those for which a
       completion was not requested (e.g. an endpoint is configured with FI_SELECTIVE_COMPLETION,
       but  the  request did not have the FI_COMPLETION flag set).  In such cases, providers will
       return as much information as made available by the underlying software and hardware about
       the  failure,  other fields will be set to NULL or 0.  This includes the op_context value,
       which may not have been provided or was ignored on input as part of the transfer.

       Notable completion error codes are given below.

       FI_EADDRNOTAVAIL
              This error code is used by CQs configured with FI_SOURCE_ERR to report  completions
              for  which  a usable fi_addr_t source address could not be found.  An error code of
              FI_EADDRNOTAVAIL indicates that the data transfer  was  successfully  received  and
              processed,  with  the  fi_cq_err_entry  fields  containing  information  about  the
              completion.  The err_data field will be set to the source address data.  The source
              address  will  be  in  the same format as specified through the fi_info addr_format
              field for the opened domain.  This may be passed directly into an fi_av_insert call
              to add the source address to the address vector.

   fi_cq_signal
       The  fi_cq_signal  call will unblock any thread waiting in fi_cq_sread or fi_cq_sreadfrom.
       This may be used to wake-up a  thread  that  is  blocked  waiting  to  read  a  completion
       operation.   The  fi_cq_signal operation is only available if the CQ was configured with a
       wait object.

COMPLETION FIELDS

       The CQ entry data structures share many of the same fields.  The meanings of these  fields
       are the same for all CQ entry structure formats.

       op_context
              The  operation context is the application specified context value that was provided
              with an asynchronous operation.  The op_context field is valid for all  completions
              that are associated with an asynchronous operation.

       For  completion events that are not associated with a posted operation, this field will be
       set to NULL.  This includes completions generated at the target in response to  RMA  write
       operations  that  carry  CQ data (FI_REMOTE_WRITE | FI_REMOTE_CQ_DATA flags set), when the
       FI_RX_CQ_DATA mode bit is not required.

       flags  This specifies flags associated with the completed operation.  The Completion Flags
              section below lists valid flag values.  Flags are set for all relevant completions.

       len    This  len  field  only  applies  to  completed  receive  operations  (e.g. fi_recv,
              fi_trecv, etc.).  It indicates the size of received message data  –  i.e. how  many
              data  bytes  were  placed  into  the  associated  receive buffer by a corresponding
              fi_send/fi_tsend/et  al  call.   If  an  endpoint  has  been  configured  with  the
              FI_MSG_PREFIX mode, the len also reflects the size of the prefix buffer.

       buf    The buf field is only valid for completed receive operations, and only applies when
              the receive buffer was posted with the  FI_MULTI_RECV  flag.   In  this  case,  buf
              points to the starting location where the receive data was placed.

       data   The  data  field is only valid if the FI_REMOTE_CQ_DATA completion flag is set, and
              only applies to receive completions.  If FI_REMOTE_CQ_DATA is set, this field  will
              contain the completion data provided by the peer as part of their transmit request.
              The completion data will be given in host byte order.

       tag    A tag applies only to received messages that occur  using  the  tagged  interfaces.
              This  field  contains the tag that was included with the received message.  The tag
              will be in host byte order.

       olen   The olen field applies to received  messages.   It  is  used  to  indicate  that  a
              received  message  has  overrun  the available buffer space and has been truncated.
              The olen specifies the amount of data that did not fit into the  available  receive
              buffer and was discarded.

       err    This  err  code  is  a positive fabric errno associated with a completion.  The err
              value indicates the general reason for an error, if one occurred.   See  fi_errno.3
              for a list of possible error codes.

       prov_errno
              On  an  error,  prov_errno  may contain a provider specific error code.  The use of
              this field and its meaning is provider specific.  It is intended to be  used  as  a
              debugging  aid.  See fi_cq_strerror for additional details on converting this error
              value into a human readable string.

       err_data
              The err_data field is used to return provider specific information,  if  available,
              about  the  error.   On  input,  err_data  should  reference  a data buffer of size
              err_data_size.  On output, the provider will fill in this buffer with any  provider
              specific  data which may help identify the cause of the error.  The contents of the
              err_data field and its meaning is provider specific.  It is intended to be used  as
              a  debugging  aid.   See  fi_cq_strerror  for additional details on converting this
              error data into a human readable string.  See the compatibility note below  on  how
              this field is used for older libfabric releases.

       err_data_size
              On  input,  err_data_size  indicates  the size of the err_data buffer in bytes.  On
              output, err_data_size will be set to the number of bytes  copied  to  the  err_data
              buffer.   The err_data information is typically used with fi_cq_strerror to provide
              details about the type of error that occurred.

       For compatibility purposes, the behavior of the err_data and err_data_size fields  is  may
       be  modified  from  that  listed above.  If err_data_size is 0 on input, or the fabric was
       opened with release < 1.5, then any buffer referenced  by  err_data  will  be  ignored  on
       input.   In  this  situation, on output err_data will be set to a data buffer owned by the
       provider.  The contents of the buffer will remain  valid  until  a  subsequent  read  call
       against  the  CQ.   Applications must serialize access to the CQ when processing errors to
       ensure that the buffer referenced by err_data does not change.

COMPLETION FLAGS

       Completion flags provide  additional  details  regarding  the  completed  operation.   The
       following completion flags are defined.

       FI_SEND
              Indicates  that the completion was for a send operation.  This flag may be combined
              with an FI_MSG or FI_TAGGED flag.

       FI_RECV
              Indicates that the completion was for  a  receive  operation.   This  flag  may  be
              combined with an FI_MSG or FI_TAGGED flag.

       FI_RMA Indicates  that  an  RMA  operation  completed.   This flag may be combined with an
              FI_READ, FI_WRITE, FI_REMOTE_READ, or FI_REMOTE_WRITE flag.

       FI_ATOMIC
              Indicates that an atomic operation completed.  This flag may be  combined  with  an
              FI_READ, FI_WRITE, FI_REMOTE_READ, or FI_REMOTE_WRITE flag.

       FI_MSG Indicates that a message-based operation completed.  This flag may be combined with
              an FI_SEND or FI_RECV flag.

       FI_TAGGED
              Indicates that a tagged message operation completed.  This  flag  may  be  combined
              with an FI_SEND or FI_RECV flag.

       FI_MULTICAST
              Indicates  that  a  multicast  operation completed.  This flag may be combined with
              FI_MSG and relevant flags.  This flag is only guaranteed to be valid  for  received
              messages if the endpoint has been configured with FI_SOURCE.

       FI_READ
              Indicates  that  a  locally  initiated  RMA or atomic read operation has completed.
              This flag may be combined with an FI_RMA or FI_ATOMIC flag.

       FI_WRITE
              Indicates that a locally initiated RMA or atomic  write  operation  has  completed.
              This flag may be combined with an FI_RMA or FI_ATOMIC flag.

       FI_REMOTE_READ
              Indicates  that  a  remotely  initiated RMA or atomic read operation has completed.
              This flag may be combined with an FI_RMA or FI_ATOMIC flag.

       FI_REMOTE_WRITE
              Indicates that a remotely initiated RMA or atomic write  operation  has  completed.
              This flag may be combined with an FI_RMA or FI_ATOMIC flag.

       FI_REMOTE_CQ_DATA
              This indicates that remote CQ data is available as part of the completion.

       FI_MULTI_RECV
              This  flag  applies to receive buffers that were posted with the FI_MULTI_RECV flag
              set.  This completion flag indicates that the original receive buffer referenced by
              the  completion  has been consumed and was released by the provider.  Providers may
              set this flag on the last message that is received into the multi- recv buffer,  or
              may  generate  a  separate  completion  that  indicates  that  the  buffer has been
              released.

       Applications can distinguish between these two cases by  examining  the  completion  entry
       flags  field.  If additional flags, such as FI_RECV, are set, the completion is associated
       with a received message.  In this case, the buf field will reference  the  location  where
       the  received  message  was  placed  into  the  multi-recv  buffer.   Other  fields in the
       completion entry will be determined based on the received message.  If other flag bits are
       zero,  the  provider  is  reporting  that the multi-recv buffer has been released, and the
       completion entry is not associated with a received message.

       FI_MORE
              See the `Buffered Receives' section in fi_msg(3) for more details.   This  flag  is
              associated  with  receive  completions on endpoints that have FI_BUFFERED_RECV mode
              enabled.  When set  to  one,  it  indicates  that  the  buffer  referenced  by  the
              completion  is  limited  by  the  FI_OPT_BUFFERED_LIMIT  threshold,  and additional
              message data must be retrieved by the application using an FI_CLAIM operation.

       FI_CLAIM
              See the `Buffered Receives' section in fi_msg(3) for more details.   This  flag  is
              set  on  completions associated with receive operations that claim buffered receive
              data.   Note  that  this  flag  only  applies  to  endpoints  configured  with  the
              FI_BUFFERED_RECV mode bit.

COMPLETION EVENT SEMANTICS

       Libfabric  defines  several completion `levels', identified using operational flags.  Each
       flag indicates the soonest that a completion event may be generated by a provider, and the
       assumptions  that  an  application may make upon processing a completion.  The operational
       flags are defined below, along with an example of  how  a  provider  might  implement  the
       semantic.   Note  that  only  meeting the semantic is required of the provider and not the
       implementation.  Providers may implement stronger completion semantics than necessary  for
       a given operation, but only the behavior defined by the completion level is guaranteed.

       To  help  understand  the  conceptual differences in completion levels, consider mailing a
       letter.  Placing the letter into the local mailbox  for  pick-up  is  similar  to  `inject
       complete'.   Having  the  letter  picked  up and dropped off at the destination mailbox is
       equivalent to `transmit  complete'.   The  `delivery  complete'  semantic  is  a  stronger
       guarantee,  with  a person at the destination signing for the letter.  However, the person
       who signed for the letter is not necessarily the intended recipient.  The `match complete'
       option  is  similar  to delivery complete, but requires the intended recipient to sign for
       the letter.

       The `commit complete' level has different semantics than the previously mentioned  levels.
       Commit complete would be closer to the letter arriving at the destination and being placed
       into a fire proof safe.

       The operational flags for the described completion levels are defined below.

       FI_INJECT_COMPLETE
              Indicates that a completion should be generated when the source  buffer(s)  may  be
              reused.   A  completion guarantees that the buffers will not be read from again and
              the application may reclaim them.  No other guarantees are made with respect to the
              state of the operation.

       Example:  A  provider  may  generate this completion event after copying the source buffer
       into a network buffer, either in host memory or on the NIC.  An inject completion does not
       indicate  that  the  data  has  been transmitted onto the network, and a local error could
       occur after the completion event has been generated  that  could  prevent  it  from  being
       transmitted.

       Inject  complete  allows  for the fastest completion reporting (and, hence, buffer reuse),
       but provides the weakest guarantees against network errors.

       Note: This flag is used to control when a completion entry is inserted into  a  completion
       queue.   It  does  not  apply to operations that do not generate a completion queue entry,
       such as the fi_inject operation, and is not  subject  to  the  inject_size  message  limit
       restriction.

       FI_TRANSMIT_COMPLETE
              Indicates  that  a  completion  should be generated when the transmit operation has
              completed relative to the local provider.  The exact behavior is dependent  on  the
              endpoint type.

       For reliable endpoints:

       Indicates  that  a completion should be generated when the operation has been delivered to
       the peer endpoint.  A completion guarantees that the operation is no longer  dependent  on
       the  fabric  or  local  resources.  The state of the operation at the peer endpoint is not
       defined.

       Example: A provider may generate a transmit complete event upon receiving an ack from  the
       peer endpoint.  The state of the message at the peer is unknown and may be buffered in the
       target NIC at the time the ack has been generated.

       For unreliable endpoints:

       Indicates that a completion should be generated when the operation has been  delivered  to
       the  fabric.   A  completion guarantees that the operation is no longer dependent on local
       resources.  The state of the operation within the fabric is not defined.

       FI_DELIVERY_COMPLETE
              Indicates that a completion should not be generated until  an  operation  has  been
              processed  by the destination endpoint(s).  A completion guarantees that the result
              of the operation is available; however, additional steps may need to  be  taken  at
              the  destination  to retrieve the results.  For example, an application may need to
              provide a receive buffers in order to retrieve messages that were buffered  by  the
              provider.

       Delivery  complete  indicates  that  the  message  has  been processed by the peer.  If an
       application buffer was ready to receive the results of the message when it  arrived,  then
       delivery complete indicates that the data was placed into the application’s buffer.

       This  completion mode applies only to reliable endpoints.  For operations that return data
       to the initiator, such as RMA read or atomic-fetch, the source endpoint is also considered
       a destination endpoint.  This is the default completion mode for such operations.

       FI_MATCH_COMPLETE
              Indicates  that  a completion should be generated only after the operation has been
              matched with an application specified buffer.   Operations  using  this  completion
              semantic  are  dependent  on  the application at the target claiming the message or
              results.  As a  result,  match  complete  may  involve  additional  provider  level
              acknowledgements  or  lengthy  delays.  However, this completion model enables peer
              applications to synchronize their execution.  Many providers may not  support  this
              semantic.

       FI_COMMIT_COMPLETE
              Indicates  that a completion should not be generated (locally or at the peer) until
              the result of an operation have been made persistent.  A completion guarantees that
              the result is both available and durable, in the case of power failure.

       This completion mode applies only to operations that target persistent memory regions over
       reliable endpoints.  This completion mode is experimental.

       FI_FENCE
              This is not a completion level, but plays a role in the completion ordering between
              operations  that  would  not normally be ordered.  An operation that is marked with
              the FI_FENCE flag and all operations posted after the fenced operation are deferred
              until  all  previous  operations  targeting  the same peer endpoint have completed.
              Additionally,  the  completion  of  the  fenced  operation  indicates  that   prior
              operations  have  met  the  same  completion  level  as  the fenced operation.  For
              example, if an operation is posted as FI_DELIVERY_COMPLETE  |  FI_FENCE,  then  its
              completion   indicates   prior  operations  have  met  the  semantic  required  for
              FI_DELIVERY_COMPLETE.  This is true even if the prior operation was posted  with  a
              lower completion level, such as FI_TRANSMIT_COMPLETE or FI_INJECT_COMPLETE.

       Note  that  a  completion  generated for an operation posted prior to the fenced operation
       only guarantees that the completion level that was originally requested has been met.   It
       is  the  completion  of the fenced operation that guarantees that the additional semantics
       have been met.

       The above completion semantics are defined with respect to the initiator of the operation.
       The  different  semantics  are  useful for describing when the initiator may re-use a data
       buffer, and guarantees what state a transfer  must  reach  prior  to  a  completion  being
       generated.   This  allows  applications to determine appropriate error handling in case of
       communication failures.

TARGET COMPLETION SEMANTICS

       The completion semantic at the target is used to determine when  data  at  the  target  is
       visible  to  the  peer  application.   Visibility indicates that a memory read to the same
       address that was the target of a data transfer will return the results  of  the  transfer.
       The  target  of  a transfer can be identified by the initiator, as may be the case for RMA
       and atomic operations, or determined by the target, for example by  providing  a  matching
       receive  buffer.  Global visibility indicates that the results are available regardless of
       where the memory read originates.  For example, the read could come from a process running
       on  a  host  CPU,  it may be accessed by subsequent data transfer over the fabric, or read
       from a peer device such as a GPU.

       In terms of completion semantics, visibility usually indicates that the transfer meets the
       FI_DELIVERY_COMPLETE  requirements  from  the  perspective  of  the  target.   The  target
       completion semantic may be, but is not necessarily, linked with  the  completion  semantic
       specified by the initiator of the transfer.

       Often,  target processes do not explicitly state a desired completion semantic and instead
       rely on the  default  semantic.   The  default  behavior  is  based  on  several  factors,
       including:

       • whether a completion even is generated at the target

       • the type of transfer involved (e.g. msg vs RMA)

       • endpoint data and message ordering guarantees

       • properties of the targeted memory buffer

       • the initiator’s specified completion semantic

       Broadly,  target  completion  semantics  are  grouped based on whether or not the transfer
       generates a completion event at the target.  This includes writing a CQ entry or  updating
       a completion counter.  In common use cases, transfers that use a message interface (FI_MSG
       or FI_TAGGED) typically generate target events, while transfers involving an RMA interface
       (FI_RMA  or  FI_ATOMIC) often do not.  There are exceptions to both these cases, depending
       on endpoint to CQ and counter bindings and operational flags.   For  example,  RMA  writes
       that  carry  remote  CQ  data  will  generate  a  completion  event at the target, and are
       frequently used to convey visibility to the target application.   The  general  guidelines
       for  target  side  semantics  are described below, followed by exceptions that modify that
       behavior.

       By default, completions generated at the target indicate  that  the  transferred  data  is
       immediately  available  to  be  read  from  the  target  buffer.  That is, the target sees
       FI_DELIVERY_COMPLETE  (or  better)  semantics,  even  if  the  initiator  requested  lower
       semantics.   For  applications using only data buffers allocated from host memory, this is
       often sufficient.

       For operations that do not generate a completion event at the target,  the  visibility  of
       the  data  at  the  target  may need to be inferred based on subsequent operations that do
       generate target completions.   Absent  a  target  completion,  when  a  completion  of  an
       operation  is  written  at  the initiator, the visibility semantic of the operation at the
       target aligns with the initiator completion semantic.  For instance, if an  RMA  operation
       completes  at the initiator as either FI_INJECT_COMPLETE or FI_TRANSMIT_COMPLETE, the data
       visibility at the target is not guaranteed.

       One or more of the following mechanisms can be used by the  target  process  to  guarantee
       that  the  results  of a data transfer that did not generate a completion at the target is
       now visible.  This list is not inclusive of all options, but defines common uses.  In  the
       descriptions  below,  the  first  transfer  does  not  result in a completion event at the
       target, but is eventually followed by a transfer which does.

       • If the endpoint guarantees message ordering between two transfers, the target completion
         of  a  second transfer will indicate that the data from the first transfer is available.
         For example, if the endpoint supports send after write ordering (FI_ORDER_SAW),  then  a
         receive  completion  corresponding  to  the  send  will  indicate that the write data is
         available.  This holds independent of the initiator’s completion semantic for either the
         write  or send.  When ordering is guaranteed, the second transfer can be queued with the
         provider immediately after queuing the first.

       • If the endpoint does not guarantee message ordering, the initiator must take  additional
         steps  to  ensure  visibility.  If initiator requests FI_DELIVERY_COMPLETE semantics for
         the first operation, the initiator can wait for the operation to complete locally.  Once
         the  completion  has been read, the target completion of a second transfer will indicate
         that the first transfer’s data is visible.

       • Alternatively, if message ordering is not guaranteed by the endpoint, the initiator  can
         use the FI_FENCE and FI_DELIVERY_COMPLETE flags on the second data transfer to force the
         first transfers to meet the FI_DELIVERY_COMPLETE  semantics.   If  the  second  transfer
         generates  a  completion  at  the  target,  that will indicate that the data is visible.
         Otherwise, a target completion for any transfer after the fenced operation will indicate
         that the data is visible.

       The  above  semantics  apply  for  transfers  targeting  traditional  host memory buffers.
       However, the behavior may differ when device memory and/or persistent memory  is  involved
       (FI_HMEM  and FI_PMEM capability bits).  When heterogenous memory is involved, the concept
       of memory domains come into play.  Memory domains  identify  the  physical  separation  of
       memory,  which  may  or may not be accessible through the same virtual address space.  See
       the fi_mr(3) man page for further details on memory domains.

       Completion ordering and data visibility are only well-defined for  transfers  that  target
       the  same  memory  domain.   Applications  need  to  be  aware  of ordering and visibility
       differences when transfers target different memory  domains.   Additionally,  applications
       also  need  to be concerned with the memory domain that completions themselves are written
       and if it differs from the memory domain targeted by  a  transfer.   In  some  situations,
       either the provider or application may need to call device specific APIs to synchronize or
       flush device memory caches in order to achieve the desired data visibility.

       When heterogenous memory is in use, the default target completion semantic  for  transfers
       that  generate  a  completion  at  the  target  is  still  FI_DELIVERY_COMPLETE,  however,
       applications should be aware that there may be a negative impact  on  overall  performance
       for providers to meet this requirement.

       For  example,  a  target  process may be using a GPU to accelerate computations.  A memory
       region mapping to memory on the GPU may be exposed to peers as either  an  RMA  target  or
       posted  locally  as a receive buffer.  In this case, the application is concerned with two
       memory domains – system and GPU memory.  Completions are written to system memory.

       Continuing the example, a peer process sends a tagged message.  That  message  is  matched
       with  the  receive buffer located in GPU memory.  The NIC copies the data from the network
       into the receive buffer and writes an entry into the completion  queue.   Note  that  both
       memory  domains  were accessed as part of this transfer.  The message data was directed to
       the GPU memory, but the completion went to host memory.  Because separate  memory  domains
       may  not  be  synchronized  with  each  other,  it is possible for the host CPU to see and
       process the completion entry before the transfer to the GPU memory is  visible  to  either
       the  host  GPU or even software running on the GPU.  From the perspective of the provider,
       visibility of the completion does not imply visibility of data written to the GPU’s memory
       domain.

       The  default  completion  semantic  at  the  target  application for message operations is
       FI_DELIVERY_COMPLETE.  An anticipated provider implementation in this situation is for the
       provider  software running on the host CPU to intercept the CQ entry, detect that the data
       landed in heterogenous memory, and perform the necessary device synchronization  or  flush
       operation  before  reporting  the completion up to the application.  This ensures that the
       data is visible  to  CPU  and  GPU  software  prior  to  the  application  processing  the
       completion.

       In  addition  to  the cost of provider software intercepting completions and checking if a
       transfer  targeted  heterogenous  memory,  device  synchronization   itself   may   impact
       performance.   As  a  result,  applications  can  request a lower completion semantic when
       posting receives.  That indicates to the provider that the application will be responsible
       for  handling  any  device  specific flush operations that might be needed.  See fi_msg(3)
       FLAGS.

       For data transfers that do not generate a  completion  at  the  target,  such  as  RMA  or
       atomics,  it  is  the  responsibility of the application to ensure that all target buffers
       meet the necessary visibility requirements of the application.  The  previously  mentioned
       bulleted  methods for notifying the target that the data is visible may not be sufficient,
       as the provider software at the target could lack the context needed to ensure visibility.
       This  implies  that  the  application  may  need to call device synchronization/flush APIs
       directly.

       For example, a peer application could perform several RMA writes that  target  GPU  memory
       buffers.   If  the provider offloads RMA operations into the NIC, the provider software at
       the target will be unaware that the RMA operations have occurred.  If  the  peer  sends  a
       message  to  the  target  application that indicates that the RMA operations are done, the
       application must ensure that the RMA data is visible to the  host  CPU  or  GPU  prior  to
       executing  code that accesses the data.  The target completion of having received the sent
       message is not sufficient, even if send-after-write ordering is supported.

       Most target heterogenous  memory  completion  semantics  map  to  FI_TRANSMIT_COMPLETE  or
       FI_DELIVERY_COMPLETE.  Persistent memory (FI_PMEM capability), however, is often used with
       FI_COMMIT_COMPLETE semantics.  Heterogenous completion concepts still apply.

       For transfers flagged by the initiator with FI_COMMIT_COMPLETE, a completion at the target
       indicates  that  the  results are visible and durable.  For transfers targeting persistent
       memory, but using a different completion semantic at the initiator, the visibility at  the
       target  is  similar to that described above.  Durability is only associated with transfers
       marked with FI_COMMIT_COMPLETE.

       For transfers targeting  persistent  memory  that  request  FI_DELIVERY_COMPLETE,  then  a
       completion,  at  either  the  initiator  or  target,  indicates  that the data is visible.
       Visibility at the target can be conveyed using one  of  the  above  describe  mechanism  –
       generating  a target completion, sending a message from the initiator, etc.  Similarly, if
       the initiator requested FI_TRANSMIT_COMPLETE, then additional steps are needed  to  ensure
       visibility  at  the  target.   For  example, the transfer can generate a completion at the
       target, which would indicate visibility, but  not  durability.   The  initiator  can  also
       follow  the transfer with another operation that forces visibility, such as using FI_FENCE
       in conjunction with FI_DELIVERY_COMPLETE.

NOTES

       A completion queue must be bound to at least one enabled  endpoint  before  any  operation
       such  as  fi_cq_read,  fi_cq_readfrom, fi_cq_sread, fi_cq_sreadfrom etc.  can be called on
       it.

       Completion flags may be suppressed if the FI_NOTIFY_FLAGS_ONLY  mode  bit  has  been  set.
       When  enabled,  only  the following flags are guaranteed to be set in completion data when
       they are valid: FI_REMOTE_READ and FI_REMOTE_WRITE (when FI_RMA_EVENT capability  bit  has
       been set), FI_REMOTE_CQ_DATA, and FI_MULTI_RECV.

       If  a  completion queue has been overrun, it will be placed into an `overrun' state.  Read
       operations will continue to return any valid,  non-corrupted  completions,  if  available.
       After all valid completions have been retrieved, any attempt to read the CQ will result in
       it returning an FI_EOVERRUN error event.  Overrun completion queues are  considered  fatal
       and may not be used to report additional completions once the overrun occurs.

RETURN VALUES

   fi_cq_open / fi_cq_signal
       : Returns 0 on success.  On error, returns a negative fabric errno.

   fi_cq_read / fi_cq_readfrom
       :  On  success, returns the number of completions retrieved from the completion queue.  On
       error, returns a negative fabric errno, with these two errors explicitly identified: If no
       completions  are  available  to  read  from  the  CQ,  returns -FI_EAGAIN.  If the topmost
       completion is for a failed transfer (an error entry), returns -FI_EAVAIL.

   fi_cq_sread / fi_cq_sreadfrom
       : On success, returns the number of completions retrieved from the completion  queue.   On
       error,  returns  a  negative fabric errno, with these two errors explicitly identified: If
       the timeout expires or the calling thread is signaled and no data is available to be  read
       from  the completion queue, returns -FI_EAGAIN.  If the topmost completion is for a failed
       transfer (an error entry), returns -FI_EAVAIL.

   fi_cq_readerr
       : On success, returns the positive value 1 (number of error entries returned).  On  error,
       returns  a  negative  fabric  errno,  with  this  error explicitly identified: If no error
       completions are available to read from the CQ, returns -FI_EAGAIN.

   fi_cq_strerror
       : Returns a character string interpretation of the provider specific error returned with a
       completion.

       Fabric errno values are defined in rdma/fi_errno.h.

SEE ALSO

       fi_getinfo(3), fi_endpoint(3), fi_domain(3), fi_eq(3), fi_cntr(3), fi_poll(3)

AUTHORS

       OpenFabrics.