Provided by: liburing-dev_2.4-1_amd64 bug

NAME

       io_uring_setup - setup a context for performing asynchronous I/O

SYNOPSIS

       #include <liburing.h>

       int io_uring_setup(u32 entries, struct io_uring_params *p);

DESCRIPTION

       The  io_uring_setup(2)  system  call  sets up a submission queue (SQ) and completion queue
       (CQ) with at least entries entries, and returns a file descriptor which  can  be  used  to
       perform  subsequent  operations  on  the io_uring instance.  The submission and completion
       queues are shared between userspace and the kernel, which eliminates the need to copy data
       when initiating and completing I/O.

       params  is  used  by  the  application to pass options to the kernel, and by the kernel to
       convey information about the ring buffers.

           struct io_uring_params {
               __u32 sq_entries;
               __u32 cq_entries;
               __u32 flags;
               __u32 sq_thread_cpu;
               __u32 sq_thread_idle;
               __u32 features;
               __u32 wq_fd;
               __u32 resv[3];
               struct io_sqring_offsets sq_off;
               struct io_cqring_offsets cq_off;
           };

       The flags, sq_thread_cpu, and sq_thread_idle fields are used  to  configure  the  io_uring
       instance.  flags is a bit mask of 0 or more of the following values ORed together:

       IORING_SETUP_IOPOLL
              Perform busy-waiting for an I/O completion, as opposed to getting notifications via
              an asynchronous IRQ (Interrupt Request).  The file system (if any) and block device
              must  support  polling  in  order  for  this  to work.  Busy-waiting provides lower
              latency, but may consume more CPU resources than interrupt driven I/O.   Currently,
              this  feature  is  usable only on a file descriptor opened using the O_DIRECT flag.
              When a read or write is submitted to a polled context, the  application  must  poll
              for  completions on the CQ ring by calling io_uring_enter(2).  It is illegal to mix
              and match polled and non-polled I/O on an io_uring instance.

              This is only applicable for storage devices for now, and the storage device must be
              configured  for polling. How to do that depends on the device type in question. For
              NVMe devices, the nvme driver must be loaded with the poll_queues parameter set  to
              the   desired  number  of  polling  queues.  The  polling  queues  will  be  shared
              appropriately between the CPUs in the system, if the number is less than the number
              of online CPU threads.

       IORING_SETUP_SQPOLL
              When this flag is specified, a kernel thread is created to perform submission queue
              polling.  An io_uring instance configured in this way  enables  an  application  to
              issue  I/O without ever context switching into the kernel.  By using the submission
              queue to fill in new submission queue entries and watching for completions  on  the
              completion  queue,  the application can submit and reap I/Os without doing a single
              system call.

              If the kernel thread is idle for more than sq_thread_idle milliseconds, it will set
              the  IORING_SQ_NEED_WAKEUP  bit  in the flags field of the struct io_sq_ring.  When
              this happens, the application  must  call  io_uring_enter(2)  to  wake  the  kernel
              thread.   If  I/O is kept busy, the kernel thread will never sleep.  An application
              making use of this feature will need to guard the io_uring_enter(2) call  with  the
              following code sequence:

                  /*
                   * Ensure that the wakeup flag is read after the tail pointer
                   * has been written. It's important to use memory load acquire
                   * semantics for the flags read, as otherwise the application
                   * and the kernel might not agree on the consistency of the
                   * wakeup flag.
                   */
                  unsigned flags = atomic_load_relaxed(sq_ring->flags);
                  if (flags & IORING_SQ_NEED_WAKEUP)
                      io_uring_enter(fd, 0, 0, IORING_ENTER_SQ_WAKEUP);

              where  sq_ring  is a submission queue ring setup using the struct io_sqring_offsets
              described below.

              Note that, when using a ring setup with  IORING_SETUP_SQPOLL,  you  never  directly
              call the io_uring_enter(2) system call. That is usually taken care of by liburing's
              io_uring_submit(3) function. It automatically determines if you are  using  polling
              mode  or  not  and  deals  with  when  your program needs to call io_uring_enter(2)
              without you having to bother about it.

              Before version 5.11 of the Linux kernel, to  successfully  use  this  feature,  the
              application   must   register   a   set   of  files  to  be  used  for  IO  through
              io_uring_register(2) using the IORING_REGISTER_FILES opcode. Failure to do so  will
              result  in submitted IO being errored with EBADF.  The presence of this feature can
              be detected by the IORING_FEAT_SQPOLL_NONFIXED feature flag.  In version  5.11  and
              later,  it  is no longer necessary to register files to use this feature. 5.11 also
              allows using this as non-root, if the user has the CAP_SYS_NICE capability. In 5.13
              this  requirement was also relaxed, and no special privileges are needed for SQPOLL
              in newer  kernels.  Certain  stable  kernels  older  than  5.13  may  also  support
              unprivileged SQPOLL.

       IORING_SETUP_SQ_AFF
              If this flag is specified, then the poll thread will be bound to the cpu set in the
              sq_thread_cpu field of the struct io_uring_params.  This flag  is  only  meaningful
              when  IORING_SETUP_SQPOLL  is  specified.  When  cgroup setting cpuset.cpus changes
              (typically in container environment), the bounded cpu set may be changed as well.

       IORING_SETUP_CQSIZE
              Create the completion queue with struct  io_uring_params.cq_entries  entries.   The
              value must be greater than entries, and may be rounded up to the next power-of-two.

       IORING_SETUP_CLAMP
              If this flag is specified, and if entries exceeds IORING_MAX_ENTRIES , then entries
              will be clamped at IORING_MAX_ENTRIES .  If the flag  IORING_SETUP_SQPOLL  is  set,
              and if the value of struct io_uring_params.cq_entries exceeds IORING_MAX_CQ_ENTRIES
              , then it will be clamped at IORING_MAX_CQ_ENTRIES .

       IORING_SETUP_ATTACH_WQ
              This flag should be set in conjunction with struct io_uring_params.wq_fd being  set
              to an existing io_uring ring file descriptor. When set, the io_uring instance being
              created will share the asynchronous worker thread backend of the specified io_uring
              ring, rather than create a new separate thread pool.

       IORING_SETUP_R_DISABLED
              If  this  flag is specified, the io_uring ring starts in a disabled state.  In this
              state, restrictions can be  registered,  but  submissions  are  not  allowed.   See
              io_uring_register(2) for details on how to enable the ring. Available since 5.10.

       IORING_SETUP_SUBMIT_ALL
              Normally  io_uring  stops  submitting  a batch of request, if one of these requests
              results in an error. This can cause submission of less than what is expected, if  a
              request ends in error while being submitted. If the ring is created with this flag,
              io_uring_enter(2) will continue submitting requests even if it encounters an  error
              submitting  a  request.  CQEs  are  still  posted for errored request regardless of
              whether or not this flag is set at ring creation time, the only  difference  is  if
              the  submit  sequence  is  halted or continued when an error is observed. Available
              since 5.18.

       IORING_SETUP_COOP_TASKRUN
              By default, io_uring will interrupt a task running in userspace when  a  completion
              event  comes  in.  This is to ensure that completions run in a timely manner. For a
              lot of use cases, this is overkill and can cause reduced performance from both  the
              inter-processor interrupt used to do this, the kernel/user transition, the needless
              interruption of the tasks userspace activities, and reduced batching if completions
              come in at a rapid rate. Most applications don't need the forceful interruption, as
              the events are processed at any kernel/user transition. The  exception  are  setups
              where  the  application uses multiple threads operating on the same ring, where the
              application waiting on completions isn't the one  that  submitted  them.  For  most
              other use cases, setting this flag will improve performance. Available since 5.19.

       IORING_SETUP_TASKRUN_FLAG
              Used   in   conjunction  with  IORING_SETUP_COOP_TASKRUN,  this  provides  a  flag,
              IORING_SQ_TASKRUN, which is set in the  SQ  ring  flags  whenever  completions  are
              pending that should be processed. liburing will check for this flag even when doing
              io_uring_peek_cqe(3) and enter the kernel to process them, and applications can  do
              the  same.  This makes IORING_SETUP_TASKRUN_FLAG safe to use even when applications
              rely on a peek style operation on the CQ ring to see if anything might  be  pending
              to reap. Available since 5.19.

       IORING_SETUP_SQE128
              If  set,  io_uring  will  use  128-byte  SQEs  rather than the normal 64-byte sized
              variant. This is a requirement for using certain request types, as of 5.19 only the
              IORING_OP_URING_CMD  passthrough command for NVMe passthrough needs this. Available
              since 5.19.

       IORING_SETUP_CQE32
              If set, io_uring will use  32-byte  CQEs  rather  than  the  normal  16-byte  sized
              variant. This is a requirement for using certain request types, as of 5.19 only the
              IORING_OP_URING_CMD passthrough command for NVMe passthrough needs this.  Available
              since 5.19.

       IORING_SETUP_SINGLE_ISSUER
              A  hint  to  the  kernel  that only a single task (or thread) will submit requests,
              which is used for internal optimisations. The submission task is  either  the  task
              that  created  the  ring, or if IORING_SETUP_R_DISABLED is specified then it is the
              task that enables the ring through io_uring_register(2).  The kernel enforces  this
              rule, failing requests with -EEXIST if the restriction is violated.  Note that when
              IORING_SETUP_SQPOLL is set it is considered that the  polling  task  is  doing  all
              submissions  on  behalf  of  the  userspace and so it always complies with the rule
              disregarding how many userspace tasks do io_uring_enter(2).  Available since 6.0.

       IORING_SETUP_DEFER_TASKRUN
              By default, io_uring will process all outstanding work at the  end  of  any  system
              call  or  thread  interrupt.  This  can  delay  the  application  from making other
              progress.  Setting this flag will hint to io_uring that it should defer work  until
              an io_uring_enter(2) call with the IORING_ENTER_GETEVENTS flag set. This allows the
              application to request work to run just before it  wants  to  process  completions.
              This flag requires the IORING_SETUP_SINGLE_ISSUER flag to be set, and also enforces
              that the call to io_uring_enter(2) is called from the same  thread  that  submitted
              requests.    Note   that  if  this  flag  is  set  then  it  is  the  application's
              responsibility to periodically trigger work (for example via any of the CQE waiting
              functions) or else completions may not be delivered.  Available since 6.1.

       If  no  flags are specified, the io_uring instance is setup for interrupt driven I/O.  I/O
       may be submitted using io_uring_enter(2) and can  be  reaped  by  polling  the  completion
       queue.

       The resv array must be initialized to zero.

       features is filled in by the kernel, which specifies various features supported by current
       kernel version.

       IORING_FEAT_SINGLE_MMAP
              If this flag is set, the two SQ and CQ rings can be mapped with  a  single  mmap(2)
              call.  The  SQEs  must  still  be  allocated  separately. This brings the necessary
              mmap(2) calls down from three to two. Available since kernel 5.4.

       IORING_FEAT_NODROP
              If this flag is set, io_uring supports almost never dropping completion events.  If
              a  completion  event  occurs  and  the CQ ring is full, the kernel stores the event
              internally until such a time that the CQ ring has room for more  entries.  If  this
              overflow  condition  is  entered,  attempting  to submit more IO will fail with the
              -EBUSY error value, if it can't flush the overflown events to the CQ ring. If  this
              happens,  the  application must reap events from the CQ ring and attempt the submit
              again. If the kernel has no free memory to store the event internally  it  will  be
              visible by an increase in the overflow value on the cqring.  Available since kernel
              5.5. Additionally io_uring_enter(2) will return  -EBADR  the  next  time  it  would
              otherwise sleep waiting for completions (since kernel 5.19).

       IORING_FEAT_SUBMIT_STABLE
              If  this  flag  is set, applications can be certain that any data for async offload
              has been consumed when the kernel has consumed the SQE. Available since kernel 5.5.

       IORING_FEAT_RW_CUR_POS
              If  this  flag   is   set,   applications   can   specify   offset   ==   -1   with
              IORING_OP_{READV,WRITEV}       ,       IORING_OP_{READ,WRITE}_FIXED      ,      and
              IORING_OP_{READ,WRITE} to mean current file position, which behaves like preadv2(2)
              and  pwritev2(2)  with  offset  ==  -1.   It'll  use  (and update) the current file
              position. This obviously comes with the caveat that if the application has multiple
              reads  or  writes  in  flight, then the end result will not be as expected. This is
              similar to threads sharing a file descriptor and doing IO using  the  current  file
              position. Available since kernel 5.6.

       IORING_FEAT_CUR_PERSONALITY
              If this flag is set, then io_uring guarantees that both sync and async execution of
              a request assumes the credentials of the  task  that  called  io_uring_enter(2)  to
              queue  the  requests.  If  this  flag  isn't set, then requests are issued with the
              credentials of the task that originally registered the io_uring. If only  one  task
              is  using  a  ring, then this flag doesn't matter as the credentials will always be
              the same. Note that  this  is  the  default  behavior,  tasks  can  still  register
              different        personalities        through       io_uring_register(2)       with
              IORING_REGISTER_PERSONALITY  and  specify  the  personality  to  use  in  the  sqe.
              Available since kernel 5.6.

       IORING_FEAT_FAST_POLL
              If  this  flag  is  set, then io_uring supports using an internal poll mechanism to
              drive data/space readiness. This means that requests that cannot read or write data
              to a file no longer need to be punted to an async thread for handling, instead they
              will begin operation when the file is ready.  This  is  similar  to  doing  poll  +
              read/write  in  userspace,  but  eliminates the need to do so. If this flag is set,
              requests waiting on space/data consume a lot less resources doing so  as  they  are
              not blocking a thread. Available since kernel 5.7.

       IORING_FEAT_POLL_32BITS
              If  this  flag is set, the IORING_OP_POLL_ADD command accepts the full 32-bit range
              of epoll based flags. Most notably EPOLLEXCLUSIVE which  allows  exclusive  (waking
              single waiters) behavior. Available since kernel 5.9.

       IORING_FEAT_SQPOLL_NONFIXED
              If  this flag is set, the IORING_SETUP_SQPOLL feature no longer requires the use of
              fixed files. Any normal file descriptor can be used for IO commands without needing
              registration. Available since kernel 5.11.

       IORING_FEAT_ENTER_EXT_ARG
              If  this flag is set, then the io_uring_enter(2) system call supports passing in an
              extended argument instead of just the sigset_t of earlier kernels. This.   extended
              argument  is of type struct io_uring_getevents_arg and allows the caller to pass in
              both a sigset_t and a timeout argument for waiting on events. The struct layout  is
              as follows:

               struct io_uring_getevents_arg {
                  __u64 sigmask;
                  __u32 sigmask_sz;
                  __u32 pad;
                  __u64 ts;
              };

              and  a  pointer  to this struct must be passed in if IORING_ENTER_EXT_ARG is set in
              the flags for the enter system call. Available since kernel 5.11.

       IORING_FEAT_NATIVE_WORKERS
              If this flag is set, io_uring is  using  native  workers  for  its  async  helpers.
              Previous  kernels  used  kernel  threads  that assumed the identity of the original
              io_uring owning task, but later kernels will actively create what looks  more  like
              regular process threads instead. Available since kernel 5.12.

       IORING_FEAT_RSRC_TAGS
              If  this flag is set, then io_uring supports a variety of features related to fixed
              files and buffers. In particular, it  indicates  that  registered  buffers  can  be
              updated  in-place, whereas before the full set would have to be unregistered first.
              Available since kernel 5.13.

       IORING_FEAT_CQE_SKIP
              If this flag is set, then io_uring supports setting IOSQE_CQE_SKIP_SUCCESS  in  the
              submitted  SQE,  indicating  that  no  CQE  should  be generated for this SQE if it
              executes normally. If  an  error  happens  processing  the  SQE,  a  CQE  with  the
              appropriate error value will still be generated. Available since kernel 5.17.

       IORING_FEAT_LINKED_FILE
              If  this flag is set, then io_uring supports sane assignment of files for SQEs that
              have  dependencies.  For  example,  if  a  chain  of  SQEs   are   submitted   with
              IOSQE_IO_LINK,  then  kernels without this flag will prepare the file for each link
              upfront.  If a previous link opens  a  file  with  a  known  index,  eg  if  direct
              descriptors are used with open or accept, then file assignment needs to happen post
              execution of that SQE. If this flag  is  set,  then  the  kernel  will  defer  file
              assignment  until  execution  of a given request is started. Available since kernel
              5.17.

       IORING_FEAT_REG_REG_RING
              If this flag is set, then io_uring supports calling  io_uring_register(2)  using  a
              registered  ring  fd,  via  IORING_REGISTER_USE_REGISTERED_RING.   Available  since
              kernel 6.3.

       The rest of the fields in the struct io_uring_params are filled  in  by  the  kernel,  and
       provide  the  information  necessary to memory map the submission queue, completion queue,
       and the array of submission queue entries.  sq_entries specifies the number of  submission
       queue entries allocated.  sq_off describes the offsets of various ring buffer fields:

           struct io_sqring_offsets {
               __u32 head;
               __u32 tail;
               __u32 ring_mask;
               __u32 ring_entries;
               __u32 flags;
               __u32 dropped;
               __u32 array;
               __u32 resv[3];
           };

       Taken  together,  sq_entries  and  sq_off  provide  all  of  the information necessary for
       accessing the submission queue ring buffer and the  submission  queue  entry  array.   The
       submission queue can be mapped with a call like:

           ptr = mmap(0, sq_off.array + sq_entries * sizeof(__u32),
                      PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE,
                      ring_fd, IORING_OFF_SQ_RING);

       where  sq_off  is  the  io_sqring_offsets  structure,  and  ring_fd is the file descriptor
       returned from io_uring_setup(2).  The addition of sq_off.array to the length of the region
       accounts  for  the  fact that the ring is located at the end of the data structure.  As an
       example, the ring buffer head pointer can be accessed by adding sq_off.head to the address
       returned from mmap(2):

           head = ptr + sq_off.head;

       The flags field is used by the kernel to communicate state information to the application.
       Currently, it is used to inform the  application  when  a  call  to  io_uring_enter(2)  is
       necessary.   See  the  documentation  for the IORING_SETUP_SQPOLL flag above.  The dropped
       member is incremented for each invalid submission queue  entry  encountered  in  the  ring
       buffer.

       The head and tail track the ring buffer state.  The tail is incremented by the application
       when submitting new I/O, and the head is incremented by the kernel when the I/O  has  been
       successfully  submitted.   Determining  the  index  of  the  head or tail into the ring is
       accomplished by applying a mask:

           index = tail & ring_mask;

       The array of submission queue entries is mapped with:

           sqentries = mmap(0, sq_entries * sizeof(struct io_uring_sqe),
                            PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE,
                            ring_fd, IORING_OFF_SQES);

       The completion queue is described by cq_entries and cq_off shown here:

           struct io_cqring_offsets {
               __u32 head;
               __u32 tail;
               __u32 ring_mask;
               __u32 ring_entries;
               __u32 overflow;
               __u32 cqes;
               __u32 flags;
               __u32 resv[3];
           };

       The completion queue is simpler, since the  entries  are  not  separated  from  the  queue
       itself, and can be mapped with:

           ptr = mmap(0, cq_off.cqes + cq_entries * sizeof(struct io_uring_cqe),
                      PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE, ring_fd,
                      IORING_OFF_CQ_RING);

       Closing  the  file  descriptor  returned  by  io_uring_setup(2)  will  free  all resources
       associated with the io_uring context. Note that this may happen asynchronously within  the
       kernel, so it is not guaranteed that resources are freed immediately.

RETURN VALUE

       io_uring_setup(2)  returns  a  new  file  descriptor on success.  The application may then
       provide the file descriptor in a  subsequent  mmap(2)  call  to  map  the  submission  and
       completion queues, or to the io_uring_register(2) or io_uring_enter(2) system calls.

       On error, a negative error code is returned. The caller should not rely on errno variable.

ERRORS

       EFAULT params is outside your accessible address space.

       EINVAL The  resv  array  contains  non-zero  data,  p.flags  contains an unsupported flag,
              entries   is   out   of   bounds,   IORING_SETUP_SQ_AFF    was    specified,    but
              IORING_SETUP_SQPOLL   was   not,   or   IORING_SETUP_CQSIZE   was   specified,  but
              io_uring_params.cq_entries was invalid.

       EMFILE The per-process limit on the number of open file descriptors has been reached  (see
              the description of RLIMIT_NOFILE in getrlimit(2)).

       ENFILE The system-wide limit on the total number of open files has been reached.

       ENOMEM Insufficient kernel resources are available.

       EPERM  IORING_SETUP_SQPOLL  was specified, but the effective user ID of the caller did not
              have sufficient privileges.

SEE ALSO

       io_uring_register(2), io_uring_enter(2)