Provided by: liburing-dev_2.5-1build1_amd64 bug

NAME

       io_uring - Asynchronous I/O facility

SYNOPSIS

       #include <linux/io_uring.h>

DESCRIPTION

       io_uring  is  a  Linux-specific  API  for asynchronous I/O.  It allows the user to submit one or more I/O
       requests, which are processed asynchronously without blocking the calling  process.   io_uring  gets  its
       name  from ring buffers which are shared between user space and kernel space. This arrangement allows for
       efficient I/O, while avoiding the overhead  of  copying  buffers  between  them,  where  possible.   This
       interface  makes  io_uring  different  from  other  UNIX  I/O APIs, wherein, rather than just communicate
       between kernel and user space with system calls, ring buffers are used as the main mode of communication.
       This  arrangement has various performance benefits which are discussed in a separate section below.  This
       man page uses the terms shared buffers, shared ring buffers and queues interchangeably.

       The general programming model you need to follow for io_uring is outlined below

       •      Set up shared buffers with io_uring_setup(2) and mmap(2), mapping into user space  shared  buffers
              for  the  submission queue (SQ) and the completion queue (CQ).  You place I/O requests you want to
              make on the SQ, while the kernel places the results of those operations on the CQ.

       •      For every I/O request you need to make (like to read  a  file,  write  a  file,  accept  a  socket
              connection, etc), you create a submission queue entry, or SQE, describe the I/O operation you need
              to get done and add it to the tail of the submission  queue  (SQ).   Each  I/O  operation  is,  in
              essence,  the  equivalent  of  a  system call you would have made otherwise, if you were not using
              io_uring.  You can add more than one SQE to the queue depending on the number  of  operations  you
              want to request.

       •      After  you  add one or more SQEs, you need to call io_uring_enter(2) to tell the kernel to dequeue
              your I/O requests off the SQ and begin processing them.

       •      For each SQE you submit, once it is done processing the request, the kernel  places  a  completion
              queue  event  or  CQE  at  the  tail of the completion queue or CQ.  The kernel places exactly one
              matching CQE in the CQ for every SQE you submit on the SQ.  After you retrieve a  CQE,  minimally,
              you  might  be interested in checking the res field of the CQE structure, which corresponds to the
              return value of the system call's equivalent, had you used it  directly  without  using  io_uring.
              For  instance,  a read operation under io_uring, started with the IORING_OP_READ operation, issues
              the equivalent of the read(2) system call. In practice, it mixes the  semantics  of  pread(2)  and
              preadv2(2)  in  that it takes an explicit offset, and supports using -1 for the offset to indicate
              that the current file position should be used instead of passing in an explicit  offset.  See  the
              opcode  documentation  for more details. Given that io_uring is an async interface, errno is never
              used for passing back error information. Instead, res will contain what the equivalent system call
              would  have  returned  in  case  of  success,  and in case of error res will contain -errno .  For
              example, if the normal read system call would have returned -1 and set errno to EINVAL , then  res
              would  contain  -EINVAL .  If the normal system call would have returned a read size of 1024, then
              res would contain 1024.

       •      Optionally, io_uring_enter(2) can also wait for a specified number of requests to be processed  by
              the  kernel  before it returns.  If you specified a certain number of completions to wait for, the
              kernel would have placed at least those many number of CQEs on the CQ, which you can then  readily
              read, right after the return from io_uring_enter(2).

       •      It  is  important to remember that I/O requests submitted to the kernel can complete in any order.
              It is not necessary for the kernel to process one request after another, in the order  you  placed
              them.   Given  that  the  interface  is  a ring, the requests are attempted in order, however that
              doesn't imply any sort of ordering on their completion.  When more than one request is in  flight,
              it  is not possible to determine which one will complete first.  When you dequeue CQEs off the CQ,
              you should always check which submitted request it corresponds to.  The  most  common  method  for
              doing  so  is utilizing the user_data field in the request, which is passed back on the completion
              side.

       Adding to and reading from the queues:

       •      You add SQEs to the tail of the SQ.  The kernel reads SQEs off the head of the queue.

       •      The kernel adds CQEs to the tail of the CQ.  You read CQEs off the head of the queue.

   Submission queue polling
       One of the goals of io_uring is to provide a means for efficient I/O.  To this end, io_uring  supports  a
       polling  mode  that lets you avoid the call to io_uring_enter(2), which you use to inform the kernel that
       you have queued SQEs on to the SQ.  With SQ Polling, io_uring starts  a  kernel  thread  that  polls  the
       submission  queue  for  any I/O requests you submit by adding SQEs.  With SQ Polling enabled, there is no
       need for you to call io_uring_enter(2), letting you avoid the overhead of  system  calls.   A  designated
       kernel thread dequeues SQEs off the SQ as you add them and dispatches them for asynchronous processing.

   Setting up io_uring
       The  main  steps  in setting up io_uring consist of mapping in the shared buffers with mmap(2) calls.  In
       the example program included in this man page, the function app_setup_uring() sets  up  io_uring  with  a
       QUEUE_DEPTH  deep  submission  queue.   Pay  attention  to  the  2  mmap(2)  calls that set up the shared
       submission and completion queues.  If your kernel is older than version  5.4,  three  mmap(2)  calls  are
       required.

   Submitting I/O requests
       The  process  of submitting a request consists of describing the I/O operation you need to get done using
       an io_uring_sqe  structure  instance.   These  details  describe  the  equivalent  system  call  and  its
       parameters.   Because  the  range  of  I/O operations Linux supports are very varied and the io_uring_sqe
       structure needs to be able to describe them, it has several fields, some packed  into  unions  for  space
       efficiency.  Here is a simplified version of struct io_uring_sqe with some of the most often used fields:

           struct io_uring_sqe {
                   __u8    opcode;         /* type of operation for this sqe */
                   __s32   fd;             /* file descriptor to do IO on */
                   __u64   off;            /* offset into file */
                   __u64   addr;           /* pointer to buffer or iovecs */
                   __u32   len;            /* buffer size or number of iovecs */
                   __u64   user_data;      /* data to be passed back at completion time */
                   __u8    flags;          /* IOSQE_ flags */
                   ...
           };

       Here is struct io_uring_sqe in full:

           struct io_uring_sqe {
                   __u8    opcode;         /* type of operation for this sqe */
                   __u8    flags;          /* IOSQE_ flags */
                   __u16   ioprio;         /* ioprio for the request */
                   __s32   fd;             /* file descriptor to do IO on */
                   union {
                           __u64   off;    /* offset into file */
                           __u64   addr2;
                   };
                   union {
                           __u64   addr;   /* pointer to buffer or iovecs */
                           __u64   splice_off_in;
                   };
                   __u32   len;            /* buffer size or number of iovecs */
                   union {
                           __kernel_rwf_t  rw_flags;
                           __u32           fsync_flags;
                           __u16           poll_events;    /* compatibility */
                           __u32           poll32_events;  /* word-reversed for BE */
                           __u32           sync_range_flags;
                           __u32           msg_flags;
                           __u32           timeout_flags;
                           __u32           accept_flags;
                           __u32           cancel_flags;
                           __u32           open_flags;
                           __u32           statx_flags;
                           __u32           fadvise_advice;
                           __u32           splice_flags;
                   };
                   __u64   user_data;      /* data to be passed back at completion time */
                   union {
                           struct {
                                   /* pack this to avoid bogus arm OABI complaints */
                                   union {
                                           /* index into fixed buffers, if used */
                                           __u16   buf_index;
                                           /* for grouped buffer selection */
                                           __u16   buf_group;
                                   } __attribute__((packed));
                                   /* personality to use, if used */
                                   __u16   personality;
                                   __s32   splice_fd_in;
                           };
                           __u64   __pad2[3];
                   };
           };

       To  submit  an  I/O  request  to  io_uring,  you  need to acquire a submission queue entry (SQE) from the
       submission queue (SQ),  fill  it  up  with  details  of  the  operation  you  want  to  submit  and  call
       io_uring_enter(2).   There are helper functions of the form io_uring_prep_X to enable proper setup of the
       SQE. If you want to avoid calling io_uring_enter(2), you have the option of setting up  Submission  Queue
       Polling.

       SQEs  are  added  to  the tail of the submission queue.  The kernel picks up SQEs off the head of the SQ.
       The general algorithm to get the next available SQE and update the tail is as follows.

           struct io_uring_sqe *sqe;
           unsigned tail, index;
           tail = *sqring->tail;
           index = tail & (*sqring->ring_mask);
           sqe = &sqring->sqes[index];
           /* fill up details about this I/O request */
           describe_io(sqe);
           /* fill the sqe index into the SQ ring array */
           sqring->array[index] = index;
           tail++;
           atomic_store_explicit(sqring->tail, tail, memory_order_release);

       To get the index of an entry, the application must mask the current tail index with the size mask of  the
       ring.   This  holds true for both SQs and CQs.  Once the SQE is acquired, the necessary fields are filled
       in, describing the request.  While the CQ ring directly indexes the shared array of CQEs, the  submission
       side has an indirection array between them.  The submission side ring buffer is an index into this array,
       which in turn contains the index into the SQEs.

       The following code snippet demonstrates how a read operation, an equivalent of a preadv2(2)  system  call
       is described by filling up an SQE with the necessary parameters.

           struct iovec iovecs[16];
            ...
           sqe->opcode = IORING_OP_READV;
           sqe->fd = fd;
           sqe->addr = (unsigned long) iovecs;
           sqe->len = 16;
           sqe->off = offset;
           sqe->flags = 0;

       Memory ordering
              Modern  compilers and CPUs freely reorder reads and writes without affecting the program's outcome
              to optimize performance.  Some aspects of this need to be  kept  in  mind  on  SMP  systems  since
              io_uring  involves  buffers  shared between kernel and user space.  These buffers are both visible
              and modifiable from kernel and user space.  As heads and tails belonging to these  shared  buffers
              are  updated  by  kernel  and  user  space,  changes need to be coherently visible on either side,
              irrespective of whether a CPU switch took place after the kernel-user mode  switch  happened.   We
              use  memory  barriers to enforce this coherency.  Being significantly large subjects on their own,
              memory barriers are out of scope for further discussion on this man page.

       Letting the kernel know about I/O submissions
              Once you place one or more SQEs on to the SQ, you need to let the kernel know that you've done so.
              You can do this by calling the io_uring_enter(2) system call.  This system call is also capable of
              waiting for a specified count of events to complete.  This way, you can be sure to find completion
              events in the completion queue without having to poll it for events later.

   Reading completion events
       Similar to the submission queue (SQ), the completion queue (CQ) is a shared buffer between the kernel and
       user space.  Whereas you placed submission queue entries on the tail of the SQ and the  kernel  read  off
       the  head,  when it comes to the CQ, the kernel places completion queue events or CQEs on the tail of the
       CQ and you read off its head.

       Submission is flexible (and thus a bit more complicated) since it needs to be able  to  encode  different
       types of system calls that take various parameters.  Completion, on the other hand is simpler since we're
       looking only for a return value back from the kernel.  This  is  easily  understood  by  looking  at  the
       completion queue event structure, struct io_uring_cqe:

           struct io_uring_cqe {
                __u64     user_data;  /* sqe->data submission passed back */
                __s32     res;        /* result code for this event */
                __u32     flags;
           };

       Here,  user_data  is  custom  data that is passed unchanged from submission to completion.  That is, from
       SQEs to CQEs.  This field can  be  used  to  set  context,  uniquely  identifying  submissions  that  got
       completed.   Given  that  I/O  requests  can complete in any order, this field can be used to correlate a
       submission with a completion.  res is the result from the system call that was performed as part  of  the
       submission; its return value.

       The  flags  field  carries  request-specific  information.  As of the 6.0 kernel, the following flags are
       defined:

       IORING_CQE_F_BUFFER
              If set, the upper 16 bits of the flags field carries the  buffer  ID  that  was  chosen  for  this
              request.  The  request must have been issued with IOSQE_BUFFER_SELECT set, and used with a request
              type that supports buffer selection. Additionally, buffers must have been provided upfront  either
              via the IORING_OP_PROVIDE_BUFFERS or the IORING_REGISTER_PBUF_RING methods.

       IORING_CQE_F_MORE
              If set, the application should expect more completions from the request. This is used for requests
              that can generate multiple completions, such as multi-shot requests, receive, or accept.

       IORING_CQE_F_SOCK_NONEMPTY
              If set, upon receiving the data from the socket in the current request, the socket still had  data
              left on completion of this request.

       IORING_CQE_F_NOTIF
              Set for notification CQEs, as seen with the zero-copy networking send and receive support.

       The general sequence to read completion events off the completion queue is as follows:

           unsigned head;
           head = *cqring->head;
           if (head != atomic_load_acquire(cqring->tail)) {
               struct io_uring_cqe *cqe;
               unsigned index;
               index = head & (cqring->mask);
               cqe = &cqring->cqes[index];
               /* process completed CQE */
               process_cqe(cqe);
               /* CQE consumption complete */
               head++;
           }
           atomic_store_explicit(cqring->head, head, memory_order_release);

       It  helps  to be reminded that the kernel adds CQEs to the tail of the CQ, while you need to dequeue them
       off the head.  To get the index of an entry at the head, the application must mask the current head index
       with  the  size  mask  of  the  ring.   Once the CQE has been consumed or processed, the head needs to be
       updated to reflect the consumption of the CQE.  Attention should be paid to the read and  write  barriers
       to ensure successful read and update of the head.

   io_uring performance
       Because  of  the  shared  ring buffers between kernel and user space, io_uring can be a zero-copy system.
       Copying buffers to and from becomes necessary when system calls that transfer  data  between  kernel  and
       user  space  are  involved.   But  since  the bulk of the communication in io_uring is via buffers shared
       between the kernel and user space, this huge performance overhead is completely avoided.

       While system calls may not seem like a significant overhead, in high performance applications,  making  a
       lot  of  them  will  begin  to  matter.  While workarounds the operating system has in place to deal with
       Spectre and Meltdown are ideally best done away with, unfortunately, some of these workarounds are around
       the  system call interface, making system calls not as cheap as before on affected hardware.  While newer
       hardware should not need these workarounds, hardware with these vulnerabilities can be expected to be  in
       the wild for a long time.  While using synchronous programming interfaces or even when using asynchronous
       programming interfaces under Linux, there is at least one system call involved in the submission of  each
       request.  In io_uring, on the other hand, you can batch several requests in one go, simply by queueing up
       multiple SQEs, each describing an I/O operation you want and make a  single  call  to  io_uring_enter(2).
       This is possible due to io_uring's shared buffers based design.

       While  this  batching  in itself can avoid the overhead associated with potentially multiple and frequent
       system calls, you can reduce even this overhead further with Submission  Queue  Polling,  by  having  the
       kernel poll and pick up your SQEs for processing as you add them to the submission queue. This avoids the
       io_uring_enter(2) call you need to make to tell  the  kernel  to  pick  SQEs  up.   For  high-performance
       applications, this means even fewer system call overheads.

CONFORMING TO

       io_uring is Linux-specific.

EXAMPLES

       The following example uses io_uring to copy stdin to stdout.  Using shell redirection, you should be able
       to copy files with this example.  Because it uses a queue depth of only one, this example  processes  I/O
       requests  one  after  the  other.   It is purposefully kept this way to aid understanding.  In real-world
       scenarios however, you'll want to have a larger queue depth to parallelize I/O request processing  so  as
       to gain the kind of performance benefits io_uring provides with its asynchronous processing of requests.

       #include <stdio.h>
       #include <stdlib.h>
       #include <sys/stat.h>
       #include <sys/ioctl.h>
       #include <sys/syscall.h>
       #include <sys/mman.h>
       #include <sys/uio.h>
       #include <linux/fs.h>
       #include <fcntl.h>
       #include <unistd.h>
       #include <string.h>
       #include <stdatomic.h>

       #include <linux/io_uring.h>

       #define QUEUE_DEPTH 1
       #define BLOCK_SZ    1024

       /* Macros for barriers needed by io_uring */
       #define io_uring_smp_store_release(p, v)            \
           atomic_store_explicit((_Atomic typeof(*(p)) *)(p), (v), \
                         memory_order_release)
       #define io_uring_smp_load_acquire(p)                \
           atomic_load_explicit((_Atomic typeof(*(p)) *)(p),   \
                        memory_order_acquire)

       int ring_fd;
       unsigned *sring_tail, *sring_mask, *sring_array,
                   *cring_head, *cring_tail, *cring_mask;
       struct io_uring_sqe *sqes;
       struct io_uring_cqe *cqes;
       char buff[BLOCK_SZ];
       off_t offset;

       /*
        * System call wrappers provided since glibc does not yet
        * provide wrappers for io_uring system calls.
       * */

       int io_uring_setup(unsigned entries, struct io_uring_params *p)
       {
           return (int) syscall(__NR_io_uring_setup, entries, p);
       }

       int io_uring_enter(int ring_fd, unsigned int to_submit,
                          unsigned int min_complete, unsigned int flags)
       {
           return (int) syscall(__NR_io_uring_enter, ring_fd, to_submit,
                       min_complete, flags, NULL, 0);
       }

       int app_setup_uring(void) {
           struct io_uring_params p;
           void *sq_ptr, *cq_ptr;

           /* See io_uring_setup(2) for io_uring_params.flags you can set */
           memset(&p, 0, sizeof(p));
           ring_fd = io_uring_setup(QUEUE_DEPTH, &p);
           if (ring_fd < 0) {
               perror("io_uring_setup");
               return 1;
           }

           /*
            * io_uring communication happens via 2 shared kernel-user space ring
            * buffers, which can be jointly mapped with a single mmap() call in
            * kernels >= 5.4.
            */

           int sring_sz = p.sq_off.array + p.sq_entries * sizeof(unsigned);
           int cring_sz = p.cq_off.cqes + p.cq_entries * sizeof(struct io_uring_cqe);

           /* Rather than check for kernel version, the recommended way is to
            * check the features field of the io_uring_params structure, which is a
            * bitmask. If IORING_FEAT_SINGLE_MMAP is set, we can do away with the
            * second mmap() call to map in the completion ring separately.
            */
           if (p.features & IORING_FEAT_SINGLE_MMAP) {
               if (cring_sz > sring_sz)
                   sring_sz = cring_sz;
               cring_sz = sring_sz;
           }

           /* Map in the submission and completion queue ring buffers.
            *  Kernels < 5.4 only map in the submission queue, though.
            */
           sq_ptr = mmap(0, sring_sz, PROT_READ | PROT_WRITE,
                         MAP_SHARED | MAP_POPULATE,
                         ring_fd, IORING_OFF_SQ_RING);
           if (sq_ptr == MAP_FAILED) {
               perror("mmap");
               return 1;
           }

           if (p.features & IORING_FEAT_SINGLE_MMAP) {
               cq_ptr = sq_ptr;
           } else {
               /* Map in the completion queue ring buffer in older kernels separately */
               cq_ptr = mmap(0, cring_sz, PROT_READ | PROT_WRITE,
                             MAP_SHARED | MAP_POPULATE,
                             ring_fd, IORING_OFF_CQ_RING);
               if (cq_ptr == MAP_FAILED) {
                   perror("mmap");
                   return 1;
               }
           }
           /* Save useful fields for later easy reference */
           sring_tail = sq_ptr + p.sq_off.tail;
           sring_mask = sq_ptr + p.sq_off.ring_mask;
           sring_array = sq_ptr + p.sq_off.array;

           /* Map in the submission queue entries array */
           sqes = mmap(0, p.sq_entries * sizeof(struct io_uring_sqe),
                          PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE,
                          ring_fd, IORING_OFF_SQES);
           if (sqes == MAP_FAILED) {
               perror("mmap");
               return 1;
           }

           /* Save useful fields for later easy reference */
           cring_head = cq_ptr + p.cq_off.head;
           cring_tail = cq_ptr + p.cq_off.tail;
           cring_mask = cq_ptr + p.cq_off.ring_mask;
           cqes = cq_ptr + p.cq_off.cqes;

           return 0;
       }

       /*
       * Read from completion queue.
       * In this function, we read completion events from the completion queue.
       * We dequeue the CQE, update and head and return the result of the operation.
       * */

       int read_from_cq() {
           struct io_uring_cqe *cqe;
           unsigned head;

           /* Read barrier */
           head = io_uring_smp_load_acquire(cring_head);
           /*
           * Remember, this is a ring buffer. If head == tail, it means that the
           * buffer is empty.
           * */
           if (head == *cring_tail)
               return -1;

           /* Get the entry */
           cqe = &cqes[head & (*cring_mask)];
           if (cqe->res < 0)
               fprintf(stderr, "Error: %s\n", strerror(abs(cqe->res)));

           head++;

           /* Write barrier so that update to the head are made visible */
           io_uring_smp_store_release(cring_head, head);

           return cqe->res;
       }

       /*
       * Submit a read or a write request to the submission queue.
       * */

       int submit_to_sq(int fd, int op) {
           unsigned index, tail;

           /* Add our submission queue entry to the tail of the SQE ring buffer */
           tail = *sring_tail;
           index = tail & *sring_mask;
           struct io_uring_sqe *sqe = &sqes[index];
           /* Fill in the parameters required for the read or write operation */
           sqe->opcode = op;
           sqe->fd = fd;
           sqe->addr = (unsigned long) buff;
           if (op == IORING_OP_READ) {
               memset(buff, 0, sizeof(buff));
               sqe->len = BLOCK_SZ;
           }
           else {
               sqe->len = strlen(buff);
           }
           sqe->off = offset;

           sring_array[index] = index;
           tail++;

           /* Update the tail */
           io_uring_smp_store_release(sring_tail, tail);

           /*
           * Tell the kernel we have submitted events with the io_uring_enter()
           * system call. We also pass in the IOURING_ENTER_GETEVENTS flag which
           * causes the io_uring_enter() call to wait until min_complete
           * (the 3rd param) events complete.
           * */
           int ret =  io_uring_enter(ring_fd, 1,1,
                                     IORING_ENTER_GETEVENTS);
           if(ret < 0) {
               perror("io_uring_enter");
               return -1;
           }

           return ret;
       }

       int main(int argc, char *argv[]) {
           int res;

           /* Setup io_uring for use */
           if(app_setup_uring()) {
               fprintf(stderr, "Unable to setup uring!\n");
               return 1;
           }

           /*
           * A while loop that reads from stdin and writes to stdout.
           * Breaks on EOF.
           */
           while (1) {
               /* Initiate read from stdin and wait for it to complete */
               submit_to_sq(STDIN_FILENO, IORING_OP_READ);
               /* Read completion queue entry */
               res = read_from_cq();
               if (res > 0) {
                   /* Read successful. Write to stdout. */
                   submit_to_sq(STDOUT_FILENO, IORING_OP_WRITE);
                   read_from_cq();
               } else if (res == 0) {
                   /* reached EOF */
                   break;
               }
               else if (res < 0) {
                   /* Error reading file */
                   fprintf(stderr, "Error: %s\n", strerror(abs(res)));
                   break;
               }
               offset += res;
           }

           return 0;
       }

SEE ALSO

       io_uring_enter(2) io_uring_register(2) io_uring_setup(2)