Provided by: liburing-dev_2.6-1_amd64 bug

NAME

       io_uring_setup - setup a context for performing asynchronous I/O

SYNOPSIS

       #include <liburing.h>

       int io_uring_setup(u32 entries, struct io_uring_params *p);

DESCRIPTION

       The io_uring_setup(2) system call sets up a submission queue (SQ) and completion queue (CQ) with at least
       entries entries, and returns a file descriptor which can be used to perform subsequent operations on  the
       io_uring  instance.   The  submission  and completion queues are shared between userspace and the kernel,
       which eliminates the need to copy data when initiating and completing I/O.

       params is used by the application to pass options to the kernel, and by the kernel to convey  information
       about the ring buffers.

           struct io_uring_params {
               __u32 sq_entries;
               __u32 cq_entries;
               __u32 flags;
               __u32 sq_thread_cpu;
               __u32 sq_thread_idle;
               __u32 features;
               __u32 wq_fd;
               __u32 resv[3];
               struct io_sqring_offsets sq_off;
               struct io_cqring_offsets cq_off;
           };

       The  flags,  sq_thread_cpu, and sq_thread_idle fields are used to configure the io_uring instance.  flags
       is a bit mask of 0 or more of the following values ORed together:

       IORING_SETUP_IOPOLL
              Perform  busy-waiting  for  an  I/O  completion,  as  opposed  to  getting  notifications  via  an
              asynchronous  IRQ  (Interrupt  Request).   The  file system (if any) and block device must support
              polling in order for this to work.  Busy-waiting provides lower latency, but may consume more  CPU
              resources  than interrupt driven I/O.  Currently, this feature is usable only on a file descriptor
              opened using the O_DIRECT flag.  When a read or write  is  submitted  to  a  polled  context,  the
              application  must poll for completions on the CQ ring by calling io_uring_enter(2).  It is illegal
              to mix and match polled and non-polled I/O on an io_uring instance.

              This is only applicable for storage devices for now, and the storage device must be configured for
              polling.  How to do that depends on the device type in question. For NVMe devices, the nvme driver
              must be loaded with the poll_queues parameter set to the desired number  of  polling  queues.  The
              polling  queues will be shared appropriately between the CPUs in the system, if the number is less
              than the number of online CPU threads.

       IORING_SETUP_SQPOLL
              When this flag is specified, a kernel thread is created to perform submission queue  polling.   An
              io_uring  instance configured in this way enables an application to issue I/O without ever context
              switching into the kernel.  By using the submission queue to fill in new submission queue  entries
              and  watching  for  completions  on the completion queue, the application can submit and reap I/Os
              without doing a single system call.

              If the kernel thread  is  idle  for  more  than  sq_thread_idle  milliseconds,  it  will  set  the
              IORING_SQ_NEED_WAKEUP  bit  in  the  flags field of the struct io_sq_ring.  When this happens, the
              application must call io_uring_enter(2) to wake the kernel thread.   If  I/O  is  kept  busy,  the
              kernel  thread will never sleep.  An application making use of this feature will need to guard the
              io_uring_enter(2) call with the following code sequence:

                  /*
                   * Ensure that the wakeup flag is read after the tail pointer
                   * has been written. It's important to use memory load acquire
                   * semantics for the flags read, as otherwise the application
                   * and the kernel might not agree on the consistency of the
                   * wakeup flag.
                   */
                  unsigned flags = atomic_load_relaxed(sq_ring->flags);
                  if (flags & IORING_SQ_NEED_WAKEUP)
                      io_uring_enter(fd, 0, 0, IORING_ENTER_SQ_WAKEUP);

              where sq_ring is a submission queue ring setup using the struct io_sqring_offsets described below.

              Note that, when using  a  ring  setup  with  IORING_SETUP_SQPOLL,  you  never  directly  call  the
              io_uring_enter(2)  system  call.  That  is  usually taken care of by liburing's io_uring_submit(3)
              function. It automatically determines if you are using polling mode or not  and  deals  with  when
              your program needs to call io_uring_enter(2) without you having to bother about it.

              Before  version  5.11  of the Linux kernel, to successfully use this feature, the application must
              register  a  set  of  files  to  be  used  for   IO   through   io_uring_register(2)   using   the
              IORING_REGISTER_FILES  opcode.  Failure  to  do  so will result in submitted IO being errored with
              EBADF.  The presence of this feature can be detected by  the  IORING_FEAT_SQPOLL_NONFIXED  feature
              flag.  In version 5.11 and later, it is no longer necessary to register files to use this feature.
              5.11 also allows using this as non-root, if the user has the CAP_SYS_NICE capability. In 5.13 this
              requirement  was  also  relaxed, and no special privileges are needed for SQPOLL in newer kernels.
              Certain stable kernels older than 5.13 may also support unprivileged SQPOLL.

       IORING_SETUP_SQ_AFF
              If this flag is specified, then the poll thread will be bound to the cpu set in the  sq_thread_cpu
              field  of  the  struct  io_uring_params.  This flag is only meaningful when IORING_SETUP_SQPOLL is
              specified. When cgroup setting cpuset.cpus  changes  (typically  in  container  environment),  the
              bounded cpu set may be changed as well.

       IORING_SETUP_CQSIZE
              Create  the  completion  queue  with struct io_uring_params.cq_entries entries.  The value must be
              greater than entries, and may be rounded up to the next power-of-two.

       IORING_SETUP_CLAMP
              If this flag is specified, and if entries exceeds IORING_MAX_ENTRIES, then entries will be clamped
              at  IORING_MAX_ENTRIES.   If  the  flag  IORING_SETUP_CQSIZE  is  set,  and if the value of struct
              io_uring_params.cq_entries  exceeds  IORING_MAX_CQ_ENTRIES,   then   it   will   be   clamped   at
              IORING_MAX_CQ_ENTRIES.

       IORING_SETUP_ATTACH_WQ
              This  flag should be set in conjunction with struct io_uring_params.wq_fd being set to an existing
              io_uring ring file descriptor. When set, the  io_uring  instance  being  created  will  share  the
              asynchronous  worker  thread  backend  of  the  specified  io_uring ring, rather than create a new
              separate thread pool.

       IORING_SETUP_R_DISABLED
              If this flag is specified, the  io_uring  ring  starts  in  a  disabled  state.   In  this  state,
              restrictions  can  be  registered,  but submissions are not allowed.  See io_uring_register(2) for
              details on how to enable the ring. Available since 5.10.

       IORING_SETUP_SUBMIT_ALL
              Normally io_uring stops submitting a batch of requests, if one of these  requests  results  in  an
              error.  This  can cause submission of less than what is expected, if a request ends in error while
              being submitted. If the ring is created with this flag, io_uring_enter(2) will continue submitting
              requests  even  if  it encounters an error submitting a request. CQEs are still posted for errored
              request regardless of whether or not this flag is set at ring creation time, the  only  difference
              is if the submit sequence is halted or continued when an error is observed. Available since 5.18.

       IORING_SETUP_COOP_TASKRUN
              By  default, io_uring will interrupt a task running in userspace when a completion event comes in.
              This is to ensure that completions run in a timely manner.  For  a  lot  of  use  cases,  this  is
              overkill  and  can  cause  reduced  performance from both the inter-processor interrupt used to do
              this, the kernel/user transition, the needless interruption of the tasks userspace activities, and
              reduced batching if completions come in at a rapid rate. Most applications don't need the forceful
              interruption, as the events are processed at any kernel/user transition. The exception are  setups
              where  the  application  uses  multiple  threads operating on the same ring, where the application
              waiting on completions isn't the one that submitted them. For most other use cases,  setting  this
              flag will improve performance. Available since 5.19.

       IORING_SETUP_TASKRUN_FLAG
              Used in conjunction with IORING_SETUP_COOP_TASKRUN, this provides a flag, IORING_SQ_TASKRUN, which
              is set in the SQ ring flags whenever completions are pending that should  be  processed.  liburing
              will  check  for  this  flag  even when doing io_uring_peek_cqe(3) and enter the kernel to process
              them, and applications can do the same. This makes IORING_SETUP_TASKRUN_FLAG safe to use even when
              applications  rely on a peek style operation on the CQ ring to see if anything might be pending to
              reap. Available since 5.19.

       IORING_SETUP_SQE128
              If set, io_uring will use 128-byte SQEs rather than the normal 64-byte sized variant.  This  is  a
              requirement  for  using certain request types, as of 5.19 only the IORING_OP_URING_CMD passthrough
              command for NVMe passthrough needs this. Available since 5.19.

       IORING_SETUP_CQE32
              If set, io_uring will use 32-byte CQEs rather than the normal 16-byte sized  variant.  This  is  a
              requirement  for  using certain request types, as of 5.19 only the IORING_OP_URING_CMD passthrough
              command for NVMe passthrough needs this. Available since 5.19.

       IORING_SETUP_SINGLE_ISSUER
              A hint to the kernel that only a single task (or thread) will submit requests, which is  used  for
              internal  optimisations.  The  submission  task  is  either  the task that created the ring, or if
              IORING_SETUP_R_DISABLED  is  specified  then  it  is  the  task  that  enables  the  ring  through
              io_uring_register(2).   The  kernel  enforces  this  rule,  failing  requests  with -EEXIST if the
              restriction is violated.  Note that when IORING_SETUP_SQPOLL is set  it  is  considered  that  the
              polling  task  is  doing all submissions on behalf of the userspace and so it always complies with
              the rule disregarding how many userspace tasks do io_uring_enter(2).  Available since 6.0.

       IORING_SETUP_DEFER_TASKRUN
              By default, io_uring will process all outstanding work at the end of any  system  call  or  thread
              interrupt. This can delay the application from making other progress.  Setting this flag will hint
              to  io_uring  that  it  should   defer   work   until   an   io_uring_enter(2)   call   with   the
              IORING_ENTER_GETEVENTS flag set. This allows the application to request work to run just before it
              wants to process completions.  This flag requires the IORING_SETUP_SINGLE_ISSUER flag to  be  set,
              and also enforces that the call to io_uring_enter(2) is called from the same thread that submitted
              requests.  Note that if  this  flag  is  set  then  it  is  the  application's  responsibility  to
              periodically  trigger  work (for example via any of the CQE waiting functions) or else completions
              may not be delivered.  Available since 6.1.

       IORING_SETUP_NO_MMAP
              By default, io_uring allocates kernel memory that callers must subsequently mmap(2).  If this flag
              is  set,  io_uring  instead  uses  caller-allocated buffers; p->cq_off.user_addr must point to the
              memory for the sq/cq rings, and p->sq_off.user_addr must point to the memory for the  sqes.   Each
              allocation  must  be  contiguous  memory.  Typically, callers should allocate this memory by using
              mmap(2) to allocate a huge page.  If this flag  is  set,  a  subsequent  attempt  to  mmap(2)  the
              io_uring file descriptor will fail.  Available since 6.5.

       IORING_SETUP_REGISTERED_FD_ONLY
              If  this  flag  is set, io_uring will register the ring file descriptor, and return the registered
              descriptor index, without ever allocating an unregistered file descriptor. The caller will need to
              use  IORING_REGISTER_USE_REGISTERED_RING  when calling io_uring_register(2).  This flag only makes
              sense when used alongside with IORING_SETUP_NO_MMAP, which also needs to be set.  Available  since
              6.5.

       IORING_SETUP_NO_SQARRAY
              If  this  flag is set, entries in the submission queue will be submitted in order, wrapping around
              to the first entry after reaching the end of the queue. In other words,  there  will  be  no  more
              indirection  via  the  array  of submission entries, and the queue will be indexed directly by the
              submission queue tail and the range of indexed represented by it modulo queue size.  Subsequently,
              the  user  should  not  map the array of submission queue entries, and the corresponding offset in
              struct io_sqring_offsets will be set to zero. Available since 6.6.

       If no flags are specified, the io_uring instance is setup for interrupt driven I/O.  I/O may be submitted
       using io_uring_enter(2) and can be reaped by polling the completion queue.

       The resv array must be initialized to zero.

       features  is  filled  in  by  the  kernel,  which  specifies various features supported by current kernel
       version.

       IORING_FEAT_SINGLE_MMAP
              If this flag is set, the two SQ and CQ rings can be mapped with a single mmap(2)  call.  The  SQEs
              must  still  be  allocated  separately. This brings the necessary mmap(2) calls down from three to
              two. Available since kernel 5.4.

       IORING_FEAT_NODROP
              If this flag is set, io_uring supports almost never dropping completion events.  A  dropped  event
              can only occur if the kernel runs out of memory, in which case you have worse problems than a lost
              event. Your application and others will likely get OOM killed anyway. If a completion event occurs
              and the CQ ring is full, the kernel stores the event internally until such a time that the CQ ring
              has room for more entries. In earlier kernels, if this overflow condition is  entered,  attempting
              to  submit  more IO would fail with the -EBUSY error value, if it can't flush the overflown events
              to the CQ ring. If this happens, the application must reap events from the CQ ring and attempt the
              submit again. If the kernel has no free memory to store the event internally it will be visible by
              an increase in the overflow value  on  the  cqring.   Available  since  kernel  5.5.  Additionally
              io_uring_enter(2)  will  return  -EBADR  the  next  time  it  would  otherwise  sleep  waiting for
              completions (since kernel 5.19).

       IORING_FEAT_SUBMIT_STABLE
              If this flag is set, applications can be certain that any data for async offload has been consumed
              when the kernel has consumed the SQE. Available since kernel 5.5.

       IORING_FEAT_RW_CUR_POS
              If  this  flag  is  set,  applications  can  specify  offset == -1 with IORING_OP_{READV,WRITEV} ,
              IORING_OP_{READ,WRITE}_FIXED , and IORING_OP_{READ,WRITE} to mean  current  file  position,  which
              behaves  like  preadv2(2)  and  pwritev2(2) with offset == -1.  It'll use (and update) the current
              file position. This obviously comes with the caveat that if the application has multiple reads  or
              writes  in flight, then the end result will not be as expected. This is similar to threads sharing
              a file descriptor and doing IO using the current file position. Available since kernel 5.6.

       IORING_FEAT_CUR_PERSONALITY
              If this flag is set, then io_uring guarantees that both sync and  async  execution  of  a  request
              assumes  the  credentials of the task that called io_uring_enter(2) to queue the requests. If this
              flag isn't set, then requests are  issued  with  the  credentials  of  the  task  that  originally
              registered  the  io_uring.  If only one task is using a ring, then this flag doesn't matter as the
              credentials will always be the same. Note that this is  the  default  behavior,  tasks  can  still
              register different personalities through io_uring_register(2) with IORING_REGISTER_PERSONALITY and
              specify the personality to use in the sqe. Available since kernel 5.6.

       IORING_FEAT_FAST_POLL
              If this flag is set, then io_uring supports using an internal poll mechanism to  drive  data/space
              readiness.  This means that requests that cannot read or write data to a file no longer need to be
              punted to an async thread for handling, instead they will begin operation when the file is  ready.
              This is similar to doing poll + read/write in userspace, but eliminates the need to do so. If this
              flag is set, requests waiting on space/data consume a lot less resources doing so as they are  not
              blocking a thread. Available since kernel 5.7.

       IORING_FEAT_POLL_32BITS
              If  this  flag is set, the IORING_OP_POLL_ADD command accepts the full 32-bit range of epoll based
              flags. Most notably EPOLLEXCLUSIVE  which  allows  exclusive  (waking  single  waiters)  behavior.
              Available since kernel 5.9.

       IORING_FEAT_SQPOLL_NONFIXED
              If  this  flag  is set, the IORING_SETUP_SQPOLL feature no longer requires the use of fixed files.
              Any normal file descriptor can be used for IO commands  without  needing  registration.  Available
              since kernel 5.11.

       IORING_FEAT_ENTER_EXT_ARG
              If  this  flag  is  set,  then  the  io_uring_enter(2) system call supports passing in an extended
              argument instead of just the sigset_t of earlier kernels. This.   extended  argument  is  of  type
              struct  io_uring_getevents_arg  and  allows  the  caller  to pass in both a sigset_t and a timeout
              argument for waiting on events. The struct layout is as follows:

               struct io_uring_getevents_arg {
                  __u64 sigmask;
                  __u32 sigmask_sz;
                  __u32 pad;
                  __u64 ts;
              };

              and a pointer to this struct must be passed in if IORING_ENTER_EXT_ARG is set in the flags for the
              enter system call. Available since kernel 5.11.

       IORING_FEAT_NATIVE_WORKERS
              If  this  flag  is  set, io_uring is using native workers for its async helpers.  Previous kernels
              used kernel threads that assumed the identity of the original  io_uring  owning  task,  but  later
              kernels will actively create what looks more like regular process threads instead. Available since
              kernel 5.12.

       IORING_FEAT_RSRC_TAGS
              If this flag is set, then io_uring supports a variety of  features  related  to  fixed  files  and
              buffers.  In  particular,  it  indicates  that registered buffers can be updated in-place, whereas
              before the full set would have to be unregistered first. Available since kernel 5.13.

       IORING_FEAT_CQE_SKIP
              If this flag is set, then io_uring supports setting IOSQE_CQE_SKIP_SUCCESS in the  submitted  SQE,
              indicating  that  no  CQE  should  be  generated for this SQE if it executes normally. If an error
              happens processing the SQE, a CQE with the  appropriate  error  value  will  still  be  generated.
              Available since kernel 5.17.

       IORING_FEAT_LINKED_FILE
              If  this  flag  is  set,  then  io_uring  supports  sane  assignment  of  files for SQEs that have
              dependencies. For example, if a chain of SQEs  are  submitted  with  IOSQE_IO_LINK,  then  kernels
              without  this  flag  will prepare the file for each link upfront.  If a previous link opens a file
              with a known index, eg if direct descriptors are used with open or accept,  then  file  assignment
              needs  to  happen post execution of that SQE. If this flag is set, then the kernel will defer file
              assignment until execution of a given request is started. Available since kernel 5.17.

       IORING_FEAT_REG_REG_RING
              If this flag is set, then io_uring supports calling io_uring_register(2) using a  registered  ring
              fd, via IORING_REGISTER_USE_REGISTERED_RING.  Available since kernel 6.3.

       The  rest  of  the  fields  in  the  struct  io_uring_params are filled in by the kernel, and provide the
       information necessary to memory map the submission queue, completion queue, and the array  of  submission
       queue  entries.  sq_entries specifies the number of submission queue entries allocated.  sq_off describes
       the offsets of various ring buffer fields:

           struct io_sqring_offsets {
               __u32 head;
               __u32 tail;
               __u32 ring_mask;
               __u32 ring_entries;
               __u32 flags;
               __u32 dropped;
               __u32 array;
               __u32 resv1;
               __u64 user_addr;
           };

       Taken together, sq_entries and sq_off  provide  all  of  the  information  necessary  for  accessing  the
       submission  queue  ring  buffer and the submission queue entry array.  The submission queue can be mapped
       with a call like:

           ptr = mmap(0, sq_off.array + sq_entries * sizeof(__u32),
                      PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE,
                      ring_fd, IORING_OFF_SQ_RING);

       where sq_off is the io_sqring_offsets structure,  and  ring_fd  is  the  file  descriptor  returned  from
       io_uring_setup(2).   The  addition of sq_off.array to the length of the region accounts for the fact that
       the ring is located at the end of the data structure.  As an example, the ring buffer head pointer can be
       accessed by adding sq_off.head to the address returned from mmap(2):

           head = ptr + sq_off.head;

       The flags field is used by the kernel to communicate state information to the application.  Currently, it
       is used to inform the application when a call to io_uring_enter(2) is necessary.  See  the  documentation
       for  the  IORING_SETUP_SQPOLL  flag above.  The dropped member is incremented for each invalid submission
       queue entry encountered in the ring buffer.

       The head and tail track the ring  buffer  state.   The  tail  is  incremented  by  the  application  when
       submitting  new  I/O,  and  the  head  is  incremented  by  the kernel when the I/O has been successfully
       submitted.  Determining the index of the head or tail into the ring is accomplished by applying a mask:

           index = tail & ring_mask;

       The array of submission queue entries is mapped with:

           sqentries = mmap(0, sq_entries * sizeof(struct io_uring_sqe),
                            PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE,
                            ring_fd, IORING_OFF_SQES);

       The completion queue is described by cq_entries and cq_off shown here:

           struct io_cqring_offsets {
               __u32 head;
               __u32 tail;
               __u32 ring_mask;
               __u32 ring_entries;
               __u32 overflow;
               __u32 cqes;
               __u32 flags;
               __u32 resv1;
               __u64 user_addr;
           };

       The completion queue is simpler, since the entries are not separated from the queue itself,  and  can  be
       mapped with:

           ptr = mmap(0, cq_off.cqes + cq_entries * sizeof(struct io_uring_cqe),
                      PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE, ring_fd,
                      IORING_OFF_CQ_RING);

       Closing  the  file  descriptor  returned by io_uring_setup(2) will free all resources associated with the
       io_uring context. Note that this may happen asynchronously within the kernel, so  it  is  not  guaranteed
       that resources are freed immediately.

RETURN VALUE

       io_uring_setup(2)  returns  a  new file descriptor on success.  The application may then provide the file
       descriptor in a subsequent mmap(2)  call  to  map  the  submission  and  completion  queues,  or  to  the
       io_uring_register(2) or io_uring_enter(2) system calls.

       On error, a negative error code is returned. The caller should not rely on errno variable.

ERRORS

       EFAULT params is outside your accessible address space.

       EINVAL The  resv  array  contains  non-zero data, p.flags contains an unsupported flag, entries is out of
              bounds, IORING_SETUP_SQ_AFF was specified, but IORING_SETUP_SQPOLL was not, or IORING_SETUP_CQSIZE
              was  specified,  but  io_uring_params.cq_entries was invalid.  IORING_SETUP_REGISTERED_FD_ONLY was
              specified, but IORING_SETUP_NO_MMAP was not.

       EMFILE The per-process limit on the number of open file descriptors has been reached (see the description
              of RLIMIT_NOFILE in getrlimit(2)).

       ENFILE The system-wide limit on the total number of open files has been reached.

       ENOMEM Insufficient kernel resources are available.

       EPERM  IORING_SETUP_SQPOLL was specified, but the effective user ID of the caller did not have sufficient
              privileges.

       EPERM  /proc/sys/kernel/io_uring_disabled has the value 2, or it has the value 1 and the calling  process
              does not hold the CAP_SYS_ADMIN capability or is not a member of /proc/sys/kernel/io_uring_group.

SEE ALSO

       io_uring_register(2), io_uring_enter(2)