Ubuntu Manpage: io_uring_setup - setup a context for performing asynchronous I/O

NAME

       io_uring_setup - setup a context for performing asynchronous I/O

SYNOPSIS

       #include <liburing.h>

       int io_uring_setup(u32 entries, struct io_uring_params *p);

DESCRIPTION

The io_uring_setup(2) system call sets up a submission queue (SQ) and completion queue
(CQ) with at least entries entries, and returns a file descriptor which can be used to
perform subsequent operations on the io_uring instance. The submission and completion
queues are shared between userspace and the kernel, which eliminates the need to copy data
when initiating and completing I/O.

params is used by the application to pass options to the kernel, and by the kernel to
convey information about the ring buffers.

struct io_uring_params {
__u32 sq_entries;
__u32 cq_entries;
__u32 flags;
__u32 sq_thread_cpu;
__u32 sq_thread_idle;
__u32 features;
__u32 wq_fd;
__u32 resv[3];
struct io_sqring_offsets sq_off;
struct io_cqring_offsets cq_off;
};

The flags, sq_thread_cpu, and sq_thread_idle fields are used to configure the io_uring
instance. flags is a bit mask of 0 or more of the following values ORed together:

IORING_SETUP_IOPOLL
Perform busy-waiting for an I/O completion, as opposed to getting notifications via
an asynchronous IRQ (Interrupt Request). The file system (if any) and block device
must support polling in order for this to work. Busy-waiting provides lower
latency, but may consume more CPU resources than interrupt driven I/O. Currently,
this feature is usable only on a file descriptor opened using the O_DIRECT flag.
When a read or write is submitted to a polled context, the application must poll
for completions on the CQ ring by calling io_uring_enter(2). It is illegal to mix
and match polled and non-polled I/O on an io_uring instance.

This is only applicable for storage devices for now, and the storage device must be
configured for polling. How to do that depends on the device type in question. For
NVMe devices, the nvme driver must be loaded with the poll_queues parameter set to
the desired number of polling queues. The polling queues will be shared
appropriately between the CPUs in the system, if the number is less than the number
of online CPU threads.

IORING_SETUP_SQPOLL
When this flag is specified, a kernel thread is created to perform submission queue
polling. An io_uring instance configured in this way enables an application to
issue I/O without ever context switching into the kernel. By using the submission
queue to fill in new submission queue entries and watching for completions on the
completion queue, the application can submit and reap I/Os without doing a single
system call.

If the kernel thread is idle for more than sq_thread_idle milliseconds, it will set
the IORING_SQ_NEED_WAKEUP bit in the flags field of the struct io_sq_ring. When
this happens, the application must call io_uring_enter(2) to wake the kernel
thread. If I/O is kept busy, the kernel thread will never sleep. An application
making use of this feature will need to guard the io_uring_enter(2) call with the
following code sequence:

/*
* Ensure that the wakeup flag is read after the tail pointer
* has been written. It's important to use memory load acquire
* semantics for the flags read, as otherwise the application
* and the kernel might not agree on the consistency of the
* wakeup flag.
*/
unsigned flags = atomic_load_relaxed(sq_ring->flags);
if (flags & IORING_SQ_NEED_WAKEUP)
io_uring_enter(fd, 0, 0, IORING_ENTER_SQ_WAKEUP);

where sq_ring is a submission queue ring setup using the struct io_sqring_offsets
described below.

Note that, when using a ring setup with IORING_SETUP_SQPOLL, you never directly
call the io_uring_enter(2) system call. That is usually taken care of by liburing's
io_uring_submit(3) function. It automatically determines if you are using polling
mode or not and deals with when your program needs to call io_uring_enter(2)
without you having to bother about it.

Before version 5.11 of the Linux kernel, to successfully use this feature, the
application must register a set of files to be used for IO through
io_uring_register(2) using the IORING_REGISTER_FILES opcode. Failure to do so will
result in submitted IO being errored with EBADF. The presence of this feature can
be detected by the IORING_FEAT_SQPOLL_NONFIXED feature flag. In version 5.11 and
later, it is no longer necessary to register files to use this feature. 5.11 also
allows using this as non-root, if the user has the CAP_SYS_NICE capability. In 5.13
this requirement was also relaxed, and no special privileges are needed for SQPOLL
in newer kernels. Certain stable kernels older than 5.13 may also support
unprivileged SQPOLL.

IORING_SETUP_SQ_AFF
If this flag is specified, then the poll thread will be bound to the cpu set in the
sq_thread_cpu field of the struct io_uring_params. This flag is only meaningful
when IORING_SETUP_SQPOLL is specified. When cgroup setting cpuset.cpus changes
(typically in container environment), the bounded cpu set may be changed as well.

IORING_SETUP_CQSIZE
Create the completion queue with struct io_uring_params.cq_entries entries. The
value must be greater than entries, and may be rounded up to the next power-of-two.

IORING_SETUP_CLAMP
If this flag is specified, and if entries exceeds IORING_MAX_ENTRIES, then entries
will be clamped at IORING_MAX_ENTRIES. If the flag IORING_SETUP_CQSIZE is set, and
if the value of struct io_uring_params.cq_entries exceeds IORING_MAX_CQ_ENTRIES,
then it will be clamped at IORING_MAX_CQ_ENTRIES.

IORING_SETUP_ATTACH_WQ
This flag should be set in conjunction with struct io_uring_params.wq_fd being set
to an existing io_uring ring file descriptor. When set, the io_uring instance being
created will share the asynchronous worker thread backend of the specified io_uring
ring, rather than create a new separate thread pool.

IORING_SETUP_R_DISABLED
If this flag is specified, the io_uring ring starts in a disabled state. In this
state, restrictions can be registered, but submissions are not allowed. See
io_uring_register(2) for details on how to enable the ring. Available since 5.10.

IORING_SETUP_SUBMIT_ALL
Normally io_uring stops submitting a batch of requests, if one of these requests
results in an error. This can cause submission of less than what is expected, if a
request ends in error while being submitted. If the ring is created with this flag,
io_uring_enter(2) will continue submitting requests even if it encounters an error
submitting a request. CQEs are still posted for errored request regardless of
whether or not this flag is set at ring creation time, the only difference is if
the submit sequence is halted or continued when an error is observed. Available
since 5.18.

IORING_SETUP_COOP_TASKRUN
By default, io_uring will interrupt a task running in userspace when a completion
event comes in. This is to ensure that completions run in a timely manner. For a
lot of use cases, this is overkill and can cause reduced performance from both the
inter-processor interrupt used to do this, the kernel/user transition, the needless
interruption of the tasks userspace activities, and reduced batching if completions
come in at a rapid rate. Most applications don't need the forceful interruption, as
the events are processed at any kernel/user transition. The exception are setups
where the application uses multiple threads operating on the same ring, where the
application waiting on completions isn't the one that submitted them. For most
other use cases, setting this flag will improve performance. Available since 5.19.

IORING_SETUP_TASKRUN_FLAG
Used in conjunction with IORING_SETUP_COOP_TASKRUN, this provides a flag,
IORING_SQ_TASKRUN, which is set in the SQ ring flags whenever completions are
pending that should be processed. liburing will check for this flag even when doing
io_uring_peek_cqe(3) and enter the kernel to process them, and applications can do
the same. This makes IORING_SETUP_TASKRUN_FLAG safe to use even when applications
rely on a peek style operation on the CQ ring to see if anything might be pending
to reap. Available since 5.19.

IORING_SETUP_SQE128
If set, io_uring will use 128-byte SQEs rather than the normal 64-byte sized
variant. This is a requirement for using certain request types, as of 5.19 only the
IORING_OP_URING_CMD passthrough command for NVMe passthrough needs this. Available
since 5.19.

IORING_SETUP_CQE32
If set, io_uring will use 32-byte CQEs rather than the normal 16-byte sized
variant. This is a requirement for using certain request types, as of 5.19 only the
IORING_OP_URING_CMD passthrough command for NVMe passthrough needs this. Available
since 5.19.

IORING_SETUP_SINGLE_ISSUER
A hint to the kernel that only a single task (or thread) will submit requests,
which is used for internal optimisations. The submission task is either the task
that created the ring, or if IORING_SETUP_R_DISABLED is specified then it is the
task that enables the ring through io_uring_register(2). The kernel enforces this
rule, failing requests with -EEXIST if the restriction is violated. Note that when
IORING_SETUP_SQPOLL is set it is considered that the polling task is doing all
submissions on behalf of the userspace and so it always complies with the rule
disregarding how many userspace tasks do io_uring_enter(2). Available since 6.0.

IORING_SETUP_DEFER_TASKRUN
By default, io_uring will process all outstanding work at the end of any system
call or thread interrupt. This can delay the application from making other
progress. Setting this flag will hint to io_uring that it should defer work until
an io_uring_enter(2) call with the IORING_ENTER_GETEVENTS flag set. This allows the
application to request work to run just before it wants to process completions.
This flag requires the IORING_SETUP_SINGLE_ISSUER flag to be set, and also enforces
that the call to io_uring_enter(2) is called from the same thread that submitted
requests. Note that if this flag is set then it is the application's
responsibility to periodically trigger work (for example via any of the CQE waiting
functions) or else completions may not be delivered. Available since 6.1.

IORING_SETUP_NO_MMAP
By default, io_uring allocates kernel memory that callers must subsequently
mmap(2). If this flag is set, io_uring instead uses caller-allocated buffers;
p->cq_off.user_addr must point to the memory for the sq/cq rings, and
p->sq_off.user_addr must point to the memory for the sqes. Each allocation must be
contiguous memory. Typically, callers should allocate this memory by using mmap(2)
to allocate a huge page. If this flag is set, a subsequent attempt to mmap(2) the
io_uring file descriptor will fail. Available since 6.5.

IORING_SETUP_REGISTERED_FD_ONLY
If this flag is set, io_uring will register the ring file descriptor, and return
the registered descriptor index, without ever allocating an unregistered file
descriptor. The caller will need to use IORING_REGISTER_USE_REGISTERED_RING when
calling io_uring_register(2). This flag only makes sense when used alongside with
IORING_SETUP_NO_MMAP, which also needs to be set. Available since 6.5.

IORING_SETUP_NO_SQARRAY
If this flag is set, entries in the submission queue will be submitted in order,
wrapping around to the first entry after reaching the end of the queue. In other
words, there will be no more indirection via the array of submission entries, and
the queue will be indexed directly by the submission queue tail and the range of
indexed represented by it modulo queue size. Subsequently, the user should not map
the array of submission queue entries, and the corresponding offset in struct
io_sqring_offsets will be set to zero. Available since 6.6.

If no flags are specified, the io_uring instance is setup for interrupt driven I/O. I/O
may be submitted using io_uring_enter(2) and can be reaped by polling the completion
queue.

The resv array must be initialized to zero.

features is filled in by the kernel, which specifies various features supported by current
kernel version.

IORING_FEAT_SINGLE_MMAP
If this flag is set, the two SQ and CQ rings can be mapped with a single mmap(2)
call. The SQEs must still be allocated separately. This brings the necessary
mmap(2) calls down from three to two. Available since kernel 5.4.

IORING_FEAT_NODROP
If this flag is set, io_uring supports almost never dropping completion events. A
dropped event can only occur if the kernel runs out of memory, in which case you
have worse problems than a lost event. Your application and others will likely get
OOM killed anyway. If a completion event occurs and the CQ ring is full, the kernel
stores the event internally until such a time that the CQ ring has room for more
entries. In earlier kernels, if this overflow condition is entered, attempting to
submit more IO would fail with the -EBUSY error value, if it can't flush the
overflown events to the CQ ring. If this happens, the application must reap events
from the CQ ring and attempt the submit again. If the kernel has no free memory to
store the event internally it will be visible by an increase in the overflow value
on the cqring. Available since kernel 5.5. Additionally io_uring_enter(2) will
return -EBADR the next time it would otherwise sleep waiting for completions (since
kernel 5.19).

IORING_FEAT_SUBMIT_STABLE
If this flag is set, applications can be certain that any data for async offload
has been consumed when the kernel has consumed the SQE. Available since kernel 5.5.

IORING_FEAT_RW_CUR_POS
If this flag is set, applications can specify offset == -1 with
IORING_OP_{READV,WRITEV} , IORING_OP_{READ,WRITE}_FIXED , and
IORING_OP_{READ,WRITE} to mean current file position, which behaves like preadv2(2)
and pwritev2(2) with offset == -1. It'll use (and update) the current file
position. This obviously comes with the caveat that if the application has multiple
reads or writes in flight, then the end result will not be as expected. This is
similar to threads sharing a file descriptor and doing IO using the current file
position. Available since kernel 5.6.

IORING_FEAT_CUR_PERSONALITY
If this flag is set, then io_uring guarantees that both sync and async execution of
a request assumes the credentials of the task that called io_uring_enter(2) to
queue the requests. If this flag isn't set, then requests are issued with the
credentials of the task that originally registered the io_uring. If only one task
is using a ring, then this flag doesn't matter as the credentials will always be
the same. Note that this is the default behavior, tasks can still register
different personalities through io_uring_register(2) with
IORING_REGISTER_PERSONALITY and specify the personality to use in the sqe.
Available since kernel 5.6.

IORING_FEAT_FAST_POLL
If this flag is set, then io_uring supports using an internal poll mechanism to
drive data/space readiness. This means that requests that cannot read or write data
to a file no longer need to be punted to an async thread for handling, instead they
will begin operation when the file is ready. This is similar to doing poll +
read/write in userspace, but eliminates the need to do so. If this flag is set,
requests waiting on space/data consume a lot less resources doing so as they are
not blocking a thread. Available since kernel 5.7.

IORING_FEAT_POLL_32BITS
If this flag is set, the IORING_OP_POLL_ADD command accepts the full 32-bit range
of epoll based flags. Most notably EPOLLEXCLUSIVE which allows exclusive (waking
single waiters) behavior. Available since kernel 5.9.

IORING_FEAT_SQPOLL_NONFIXED
If this flag is set, the IORING_SETUP_SQPOLL feature no longer requires the use of
fixed files. Any normal file descriptor can be used for IO commands without needing
registration. Available since kernel 5.11.

IORING_FEAT_ENTER_EXT_ARG
If this flag is set, then the io_uring_enter(2) system call supports passing in an
extended argument instead of just the sigset_t of earlier kernels. This. extended
argument is of type struct io_uring_getevents_arg and allows the caller to pass in
both a sigset_t and a timeout argument for waiting on events. The struct layout is
as follows:

struct io_uring_getevents_arg {
__u64 sigmask;
__u32 sigmask_sz;
__u32 pad;
__u64 ts;
};

and a pointer to this struct must be passed in if IORING_ENTER_EXT_ARG is set in
the flags for the enter system call. Available since kernel 5.11.

IORING_FEAT_NATIVE_WORKERS
If this flag is set, io_uring is using native workers for its async helpers.
Previous kernels used kernel threads that assumed the identity of the original
io_uring owning task, but later kernels will actively create what looks more like
regular process threads instead. Available since kernel 5.12.

IORING_FEAT_RSRC_TAGS
If this flag is set, then io_uring supports a variety of features related to fixed
files and buffers. In particular, it indicates that registered buffers can be
updated in-place, whereas before the full set would have to be unregistered first.
Available since kernel 5.13.

IORING_FEAT_CQE_SKIP
If this flag is set, then io_uring supports setting IOSQE_CQE_SKIP_SUCCESS in the
submitted SQE, indicating that no CQE should be generated for this SQE if it
executes normally. If an error happens processing the SQE, a CQE with the
appropriate error value will still be generated. Available since kernel 5.17.

IORING_FEAT_LINKED_FILE
If this flag is set, then io_uring supports sane assignment of files for SQEs that
have dependencies. For example, if a chain of SQEs are submitted with
IOSQE_IO_LINK, then kernels without this flag will prepare the file for each link
upfront. If a previous link opens a file with a known index, eg if direct
descriptors are used with open or accept, then file assignment needs to happen post
execution of that SQE. If this flag is set, then the kernel will defer file
assignment until execution of a given request is started. Available since kernel
5.17.

IORING_FEAT_REG_REG_RING
If this flag is set, then io_uring supports calling io_uring_register(2) using a
registered ring fd, via IORING_REGISTER_USE_REGISTERED_RING. Available since
kernel 6.3.

The rest of the fields in the struct io_uring_params are filled in by the kernel, and
provide the information necessary to memory map the submission queue, completion queue,
and the array of submission queue entries. sq_entries specifies the number of submission
queue entries allocated. sq_off describes the offsets of various ring buffer fields:

struct io_sqring_offsets {
__u32 head;
__u32 tail;
__u32 ring_mask;
__u32 ring_entries;
__u32 flags;
__u32 dropped;
__u32 array;
__u32 resv1;
__u64 user_addr;
};

Taken together, sq_entries and sq_off provide all of the information necessary for
accessing the submission queue ring buffer and the submission queue entry array. The
submission queue can be mapped with a call like:

ptr = mmap(0, sq_off.array + sq_entries * sizeof(__u32),
PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE,
ring_fd, IORING_OFF_SQ_RING);

where sq_off is the io_sqring_offsets structure, and ring_fd is the file descriptor
returned from io_uring_setup(2). The addition of sq_off.array to the length of the region
accounts for the fact that the ring is located at the end of the data structure. As an
example, the ring buffer head pointer can be accessed by adding sq_off.head to the address
returned from mmap(2):

head = ptr + sq_off.head;

The flags field is used by the kernel to communicate state information to the application.
Currently, it is used to inform the application when a call to io_uring_enter(2) is
necessary. See the documentation for the IORING_SETUP_SQPOLL flag above. The dropped
member is incremented for each invalid submission queue entry encountered in the ring
buffer.

The head and tail track the ring buffer state. The tail is incremented by the application
when submitting new I/O, and the head is incremented by the kernel when the I/O has been
successfully submitted. Determining the index of the head or tail into the ring is
accomplished by applying a mask:

index = tail & ring_mask;

The array of submission queue entries is mapped with:

sqentries = mmap(0, sq_entries * sizeof(struct io_uring_sqe),
PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE,
ring_fd, IORING_OFF_SQES);

The completion queue is described by cq_entries and cq_off shown here:

struct io_cqring_offsets {
__u32 head;
__u32 tail;
__u32 ring_mask;
__u32 ring_entries;
__u32 overflow;
__u32 cqes;
__u32 flags;
__u32 resv1;
__u64 user_addr;
};

The completion queue is simpler, since the entries are not separated from the queue
itself, and can be mapped with:

ptr = mmap(0, cq_off.cqes + cq_entries * sizeof(struct io_uring_cqe),
PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE, ring_fd,
IORING_OFF_CQ_RING);

Closing the file descriptor returned by io_uring_setup(2) will free all resources
associated with the io_uring context. Note that this may happen asynchronously within the
kernel, so it is not guaranteed that resources are freed immediately.

RETURN VALUE

       io_uring_setup(2) returns a new file descriptor on  success.   The  application  may  then
       provide  the  file  descriptor  in  a  subsequent  mmap(2)  call to map the submission and
       completion queues, or to the io_uring_register(2) or io_uring_enter(2) system calls.

       On error, a negative error code is returned. The caller should not rely on errno variable.

ERRORS

       EFAULT params is outside your accessible address space.

       EINVAL The resv array contains  non-zero  data,  p.flags  contains  an  unsupported  flag,
              entries    is    out    of   bounds,   IORING_SETUP_SQ_AFF   was   specified,   but
              IORING_SETUP_SQPOLL  was   not,   or   IORING_SETUP_CQSIZE   was   specified,   but
              io_uring_params.cq_entries   was   invalid.    IORING_SETUP_REGISTERED_FD_ONLY  was
              specified, but IORING_SETUP_NO_MMAP was not.

       EMFILE The per-process limit on the number of open file descriptors has been reached  (see
              the description of RLIMIT_NOFILE in getrlimit(2)).

       ENFILE The system-wide limit on the total number of open files has been reached.

       ENOMEM Insufficient kernel resources are available.

       EPERM  IORING_SETUP_SQPOLL  was specified, but the effective user ID of the caller did not
              have sufficient privileges.

       EPERM  /proc/sys/kernel/io_uring_disabled has the value 2, or it has the value 1  and  the
              calling  process  does  not hold the CAP_SYS_ADMIN capability or is not a member of
              /proc/sys/kernel/io_uring_group.

NAME

SYNOPSIS

DESCRIPTION

RETURN VALUE

ERRORS

SEE ALSO