Provided by: liburing-dev_2.1-2build1_amd64 bug


       io_uring_setup - setup a context for performing asynchronous I/O


       #include <linux/io_uring.h>

       int io_uring_setup(u32 entries, struct io_uring_params *p);


       The io_uring_setup() system call sets up a submission queue (SQ) and completion queue (CQ)
       with at least entries entries, and returns a file descriptor which can be used to  perform
       subsequent  operations on the io_uring instance.  The submission and completion queues are
       shared between userspace and the kernel, which eliminates  the  need  to  copy  data  when
       initiating and completing I/O.

       params  is  used  by  the  application to pass options to the kernel, and by the kernel to
       convey information about the ring buffers.

           struct io_uring_params {
               __u32 sq_entries;
               __u32 cq_entries;
               __u32 flags;
               __u32 sq_thread_cpu;
               __u32 sq_thread_idle;
               __u32 features;
               __u32 resv[4];
               struct io_sqring_offsets sq_off;
               struct io_cqring_offsets cq_off;

       The flags, sq_thread_cpu, and sq_thread_idle fields are used  to  configure  the  io_uring
       instance.  flags is a bit mask of 0 or more of the following values ORed together:

              Perform busy-waiting for an I/O completion, as opposed to getting notifications via
              an asynchronous IRQ (Interrupt Request).  The file system (if any) and block device
              must  support  polling  in  order  for  this  to work.  Busy-waiting provides lower
              latency, but may consume more CPU resources than interrupt driven I/O.   Currently,
              this  feature  is  usable only on a file descriptor opened using the O_DIRECT flag.
              When a read or write is submitted to a polled context, the  application  must  poll
              for  completions on the CQ ring by calling io_uring_enter(2).  It is illegal to mix
              and match polled and non-polled I/O on an io_uring instance.

              When this flag is specified, a kernel thread is created to perform submission queue
              polling.   An  io_uring  instance  configured in this way enables an application to
              issue I/O without ever context switching into the kernel.  By using the  submission
              queue  to  fill in new submission queue entries and watching for completions on the
              completion queue, the application can submit and reap I/Os without doing  a  single
              system call.

              If the kernel thread is idle for more than sq_thread_idle milliseconds, it will set
              the IORING_SQ_NEED_WAKEUP bit in the flags field of the  struct  io_sq_ring.   When
              this  happens,  the  application  must  call  io_uring_enter(2)  to wake the kernel
              thread.  If I/O is kept busy, the kernel thread will never sleep.   An  application
              making  use  of this feature will need to guard the io_uring_enter(2) call with the
              following code sequence:

                   * Ensure that the wakeup flag is read after the tail pointer
                   * has been written. It's important to use memory load acquire
                   * semantics for the flags read, as otherwise the application
                   * and the kernel might not agree on the consistency of the
                   * wakeup flag.
                  unsigned flags = atomic_load_relaxed(sq_ring->flags);
                  if (flags & IORING_SQ_NEED_WAKEUP)
                      io_uring_enter(fd, 0, 0, IORING_ENTER_SQ_WAKEUP);

              where sq_ring is a submission queue ring setup using the  struct  io_sqring_offsets
              described below.

              Before  version  5.11  of  the  Linux kernel, to successfully use this feature, the
              application  must  register  a  set  of  files  to   be   used   for   IO   through
              io_uring_register(2)  using the IORING_REGISTER_FILES opcode. Failure to do so will
              result in submitted IO being errored with EBADF.  The presence of this feature  can
              be  detected  by the IORING_FEAT_SQPOLL_NONFIXED feature flag.  In version 5.11 and
              later, it is no longer necessary to register files to use this feature.  5.11  also
              allows using this as non-root, if the user has the CAP_SYS_NICE capability.

              If this flag is specified, then the poll thread will be bound to the cpu set in the
              sq_thread_cpu field of the struct io_uring_params.  This flag  is  only  meaningful
              when  IORING_SETUP_SQPOLL  is  specified.  When  cgroup setting cpuset.cpus changes
              (typically in container environment), the bounded cpu set may be changed as well.

              Create the completion queue with struct  io_uring_params.cq_entries  entries.   The
              value must be greater than entries, and may be rounded up to the next power-of-two.

              If this flag is specified, and if entries exceeds IORING_MAX_ENTRIES , then entries
              will be clamped at IORING_MAX_ENTRIES .  If the flag  IORING_SETUP_SQPOLL  is  set,
              and if the value of struct io_uring_params.cq_entries exceeds IORING_MAX_CQ_ENTRIES
              , then it will be clamped at IORING_MAX_CQ_ENTRIES .

              This flag should be set in conjunction with struct io_uring_params.wq_fd being  set
              to an existing io_uring ring file descriptor. When set, the io_uring instance being
              created will share the asynchronous worker thread backend of the specified io_uring
              ring, rather than create a new separate thread pool.

              If  this  flag is specified, the io_uring ring starts in a disabled state.  In this
              state, restrictions can be  registered,  but  submissions  are  not  allowed.   See
              io_uring_register(2) for details on how to enable the ring. Available since 5.10.

       If  no  flags are specified, the io_uring instance is setup for interrupt driven I/O.  I/O
       may be submitted using io_uring_enter(2) and can  be  reaped  by  polling  the  completion

       The resv array must be initialized to zero.

       features is filled in by the kernel, which specifies various features supported by current
       kernel version.

              If this flag is set, the two SQ and CQ rings can be mapped with  a  single  mmap(2)
              call.  The  SQEs  must  still  be  allocated  separately. This brings the necessary
              mmap(2) calls down from three to two. Available since kernel 5.4.

              If this flag is set, io_uring supports never  dropping  completion  events.   If  a
              completion  event  occurs  and  the  CQ  ring  is full, the kernel stores the event
              internally until such a time that the CQ ring has room for more  entries.  If  this
              overflow  condition  is  entered,  attempting  to submit more IO will fail with the
              -EBUSY error value, if it can't flush the overflown events to the CQ ring. If  this
              happens,  the  application must reap events from the CQ ring and attempt the submit
              again. Available since kernel 5.5.

              If this flag is set, applications can be certain that any data  for  async  offload
              has been consumed when the kernel has consumed the SQE. Available since kernel 5.5.

              If   this   flag   is   set,   applications   can   specify   offset   ==  -1  with
              IORING_OP_{READV,WRITEV}      ,      IORING_OP_{READ,WRITE}_FIXED       ,       and
              IORING_OP_{READ,WRITE} to mean current file position, which behaves like preadv2(2)
              and pwritev2(2) with offset  ==  -1.  It'll  use  (and  update)  the  current  file
              position. This obviously comes with the caveat that if the application has multiple
              reads or writes in flight, then the end result will not be  as  expected.  This  is
              similar  to  threads  sharing a file descriptor and doing IO using the current file
              position. Available since kernel 5.6.

              If this flag is set, then io_uring guarantees that both sync and async execution of
              a  request  assumes  the  credentials  of the task that called io_uring_enter(2) to
              queue the requests. If this flag isn't set,  then  requests  are  issued  with  the
              credentials  of  the task that originally registered the io_uring. If only one task
              is using a ring, then this flag doesn't matter as the credentials  will  always  be
              the  same.  Note  that  this  is  the  default  behavior,  tasks can still register
              different       personalities       through        io_uring_register(2)        with
              IORING_REGISTER_PERSONALITY  and  specify  the  personality  to  use  in  the  sqe.
              Available since kernel 5.6.

              If this flag is set, then io_uring supports using an  internal  poll  mechanism  to
              drive data/space readiness. This means that requests that cannot read or write data
              to a file no longer need to be punted to an async thread for handling, instead they
              will  begin  operation  when  the  file  is  ready. This is similar to doing poll +
              read/write in userspace, but eliminates the need to do so. If  this  flag  is  set,
              requests  waiting  on  space/data consume a lot less resources doing so as they are
              not blocking a thread. Available since kernel 5.7.

              If this flag is set, the IORING_OP_POLL_ADD command accepts the full  32-bit  range
              of  epoll  based  flags. Most notably EPOLLEXCLUSIVE which allows exclusive (waking
              single waiters) behavior. Available since kernel 5.9.

              If this flag is set, the IORING_SETUP_SQPOLL feature no longer requires the use  of
              fixed files. Any normal file descriptor can be used for IO commands without needing
              registration. Available since kernel 5.11.

              If this flag is set, then the io_uring_enter(2) system call supports passing in  an
              extended  argument instead of just the sigset_t of earlier kernels. This.  extended
              argument is of type struct io_uring_getevents_arg and allows the caller to pass  in
              both  a sigset_t and a timeout argument for waiting on events. The struct layout is
              as follows:

               struct io_uring_getevents_arg {
                  __u64 sigmask;
                  __u32 sigmask_sz;
                  __u32 pad;
                  __u64 ts;

              and a pointer to this struct must be passed in if IORING_ENTER_EXT_ARG  is  set  in
              the flags for the enter system call. Available since kernel 5.11.

              If  this  flag  is  set,  io_uring  is  using native workers for its async helpers.
              Previous kernels used kernel threads that assumed  the  identity  of  the  original
              io_uring  owning  task, but later kernels will actively create what looks more like
              regular process threads instead. Available since kernel 5.12.

              If this flag is set, then io_uring supports a variety of features related to  fixed
              files  and  buffers.  In  particular,  it  indicates that registered buffers can be
              updated in-place, whereas before the full set would have to be unregistered  first.
              Available since kernel 5.13.

       The  rest  of  the  fields  in the struct io_uring_params are filled in by the kernel, and
       provide the information necessary to memory map the submission  queue,  completion  queue,
       and  the array of submission queue entries.  sq_entries specifies the number of submission
       queue entries allocated.  sq_off describes the offsets of various ring buffer fields:

           struct io_sqring_offsets {
               __u32 head;
               __u32 tail;
               __u32 ring_mask;
               __u32 ring_entries;
               __u32 flags;
               __u32 dropped;
               __u32 array;
               __u32 resv[3];

       Taken together, sq_entries and  sq_off  provide  all  of  the  information  necessary  for
       accessing  the  submission  queue  ring  buffer and the submission queue entry array.  The
       submission queue can be mapped with a call like:

           ptr = mmap(0, sq_off.array + sq_entries * sizeof(__u32),
                      ring_fd, IORING_OFF_SQ_RING);

       where sq_off is the io_sqring_offsets  structure,  and  ring_fd  is  the  file  descriptor
       returned from io_uring_setup(2).  The addition of sq_off.array to the length of the region
       accounts for the fact that the ring located at the end  of  the  data  structure.   As  an
       example, the ring buffer head pointer can be accessed by adding sq_off.head to the address
       returned from mmap(2):

           head = ptr + sq_off.head;

       The flags field is used by the kernel to communicate state information to the application.
       Currently,  it  is  used  to  inform  the  application when a call to io_uring_enter(2) is
       necessary.  See the documentation for the IORING_SETUP_SQPOLL  flag  above.   The  dropped
       member  is  incremented  for  each  invalid submission queue entry encountered in the ring

       The head and tail track the ring buffer state.  The tail is incremented by the application
       when  submitting  new I/O, and the head is incremented by the kernel when the I/O has been
       successfully submitted.  Determining the index of the  head  or  tail  into  the  ring  is
       accomplished by applying a mask:

           index = tail & ring_mask;

       The array of submission queue entries is mapped with:

           sqentries = mmap(0, sq_entries * sizeof(struct io_uring_sqe),
                            PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE,
                            ring_fd, IORING_OFF_SQES);

       The completion queue is described by cq_entries and cq_off shown here:

           struct io_cqring_offsets {
               __u32 head;
               __u32 tail;
               __u32 ring_mask;
               __u32 ring_entries;
               __u32 overflow;
               __u32 cqes;
               __u32 flags;
               __u32 resv[3];

       The  completion  queue  is  simpler,  since  the  entries are not separated from the queue
       itself, and can be mapped with:

           ptr = mmap(0, cq_off.cqes + cq_entries * sizeof(struct io_uring_cqe),
                      PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE, ring_fd,

       Closing the  file  descriptor  returned  by  io_uring_setup(2)  will  free  all  resources
       associated with the io_uring context.


       io_uring_setup(2)  returns  a  new  file  descriptor on success.  The application may then
       provide the file descriptor in a  subsequent  mmap(2)  call  to  map  the  submission  and
       completion queues, or to the io_uring_register(2) or io_uring_enter(2) system calls.

       On error, -1 is returned and errno is set appropriately.


       EFAULT params is outside your accessible address space.

       EINVAL The  resv  array  contains  non-zero  data,  p.flags  contains an unsupported flag,
              entries   is   out   of   bounds,   IORING_SETUP_SQ_AFF    was    specified,    but
              IORING_SETUP_SQPOLL   was   not,   or   IORING_SETUP_CQSIZE   was   specified,  but
              io_uring_params.cq_entries was invalid.

       EMFILE The per-process limit on the number of open file descriptors has been reached  (see
              the description of RLIMIT_NOFILE in getrlimit(2)).

       ENFILE The system-wide limit on the total number of open files has been reached.

       ENOMEM Insufficient kernel resources are available.

       EPERM  IORING_SETUP_SQPOLL  was specified, but the effective user ID of the caller did not
              have sufficient privileges.


       io_uring_register(2), io_uring_enter(2)