Skip to content

Commit f7338fd

Browse files
committed
man/io_uring_internal: Add man page about relevant internals for users
Adds a man page with details about the inner workings of io_uring that are likely to be useful for users as they relate to frequently misused flags of io_uring such as IOSQE_ASYNC and the taskrun flags. This mostly describes what needs to be done on the kernel side for each request, who does the work and most notably what the async punt is. Signed-off-by: Constantin Pestka <constantin.pestka@c-pestka.de>
1 parent 206650f commit f7338fd

File tree

1 file changed

+282
-0
lines changed

1 file changed

+282
-0
lines changed

man/io_uring_internals.7

Lines changed: 282 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,282 @@
1+
.TH io_uring_internals 7 2024-10-5 "Linux" "Linux Programmer's Manual"
2+
.SH NAME
3+
io_uring_internals
4+
.SH SYNOPSIS
5+
.nf
6+
.B "#include <linux/io_uring.h>"
7+
.fi
8+
.PP
9+
.SH DESCRIPTION
10+
.PP
11+
.B io_uring
12+
is a linux specific, asynchronous API that allows the submission of requests to
13+
the kernel. Applications pass requests to the kernel via a shared ring buffer
14+
the
15+
.I Submission Queue
16+
(SQ) and receive notifications of the completion of these requests via the
17+
.I Completion Queue
18+
(CQ). An important detail here is that after a request has been submitted to
19+
the kernel some CPU time has to be spent in kernel space to perform the
20+
required submission and completion related work.
21+
The mechanism used to provide this CPU time, as well as what process does so
22+
and when is different in
23+
.I io_uring
24+
than for the traditional API provided by regular syscalls.
25+
26+
.PP
27+
.SH Traditional Syscall Driven I/O
28+
.PP
29+
For regular syscalls the CPU time for this work is directly provided by the
30+
process issuing the syscall, with the submission side work in kernel space
31+
being directly executed after the context switch. The time for completion
32+
related work is either also subsequently directly provided in the case of
33+
polled I/O. In the case of interrupt driven I/O the CPU time is provided,
34+
depending on the driver in question, by either the traditional top and bottom
35+
half IRQ approach or via threaded IRQ handling. The CPU time for completion
36+
work is thus in this case provided by the CPU on which the hardware
37+
interrupt arrives, as well as the CPU to which the dedicated kernel worker
38+
thread for the threaded IRQ handling gets scheduled, if that is used.
39+
40+
.PP
41+
.SH The Submission Side Work
42+
.PP
43+
44+
The work required in kernel space on the submission side mostly consists of
45+
checking the SQ for newly arrived SQEs, parsing and checking them for
46+
validity and permissions and then passing them on to the responsible system,
47+
such as a block device driver, networking stack, etc. An important note here is
48+
that
49+
.I io_uring
50+
guarantees that the process of submitting the request to responsible subsystem
51+
and thus in this case the
52+
.IR io_uring_enter (2)
53+
syscall made to submit the new requests,
54+
.B will never
55+
.BR block .
56+
However, the mechanism how io_uring achieves this generally depends on the
57+
capabilities of the file a request operates on. While the mechanism
58+
.I io_uring
59+
ends up utilizing for this is not directly observable to the application it
60+
does have significant performance implications.
61+
There are generally four scenarios:
62+
.PP
63+
1. The operation is finished in its entirety immediately. Examples of this
64+
are reads or writes to a pipe or socket or reads and writes to regular
65+
files not using direct I/O that have be served via the page cache. In this
66+
scenario the according CQE is posted inline as well and will thus be visible
67+
to the application even before the
68+
.IR io_uring_enter (2)
69+
call returns.
70+
71+
2. The operation is not finished inline, but can be submitted fully
72+
asynchronously. How
73+
.I io_uring
74+
handles the asynchronous completion depends on whether or not interrupt or
75+
polled I/O is used (See section on Completion Side Work). An example of a
76+
backend capable of this fully asynchronous operation is the NVMe driver.
77+
78+
3. The operation is not finished inline, but the file can signal readiness for
79+
when the operation can be retried. Examples of such files are any pollable file
80+
including sockets, pipes etc. It should be noted that these retry operations
81+
are performed during subsequent
82+
.IR io_uring_enter (2)
83+
calls, if SQ polling is not used. The operation is thus performed in the
84+
context of the submitting thread and there are no additional other threads
85+
involved. If SQ polling is used the retries are performed by the SQ poll
86+
thread.
87+
88+
4. The operation is not finished inline and the file is incapable of signaling
89+
when it is ready to do I/O. This is the only case in which
90+
.I io_uring
91+
will
92+
.I async punt
93+
the request, i.e. offload the potentially blocking execution of the request to
94+
an asynchronous worker thread. (See IO WQ section below)
95+
.PP
96+
97+
.PP
98+
.SH The Completion Side Work
99+
.PP
100+
101+
The work required in kernel space on the completion side mostly come in the
102+
form of various request type dependent obligations, such as copying buffers,
103+
parsing packet headers etc., as well as posting a CQE to the CQ to inform the
104+
application of the completion of the request.
105+
106+
.PP
107+
.SH Who does the work
108+
.PP
109+
110+
One of
111+
the primary motivations behind
112+
.I io_uring
113+
was to reduce or entirely avoid the overheads of syscalls to provide the
114+
required CPU time in kernel space. The mechanism that
115+
.I io_uring
116+
utilizes to achieve this differs depending on the configuration with different
117+
trade-offs between configurations in respect to e.g. CPU efficiency and latency.
118+
119+
With the default configuration the primary mechanism to provide the kernel space
120+
CPU time in
121+
.I io_uring
122+
is also a syscall:
123+
.IR io_uring_enter (2)
124+
This still differs from requests made via their respective syscall directly,
125+
such as
126+
.IR read (2),
127+
in the sense that it allows for batching in a more flexible way than e.g.
128+
possible via
129+
.IR readv (2),
130+
as different syscalls types can be freely mixed and matched and chains of
131+
dependent requests, such as a
132+
.IR send (2)
133+
followed by a
134+
.IR recv (2)
135+
can be submitted with one syscall. Furthermore it is possible to both process
136+
requests for submissions and process arrived completions within the same
137+
.IR io_uring_enter (2)
138+
call. Applications can set the flag
139+
.I IORING_ENTER_GETEVENTS
140+
to in addition to processing any pending submissions, process any arrived
141+
completions and
142+
optionally wait until a specified amount of completions have arrived before
143+
returning.
144+
145+
If polled I/O is used all completion related work is performed during the
146+
.IR io_uring_enter (2)
147+
call. For interrupt driven I/O, the CPU receiving the hardware interrupt
148+
schedules the remaining work to be performed including posting the CQE to be
149+
performed via task work. Any outstanding task work is performed during any
150+
user-kernel space transition. Per default, the CPU that received the hw
151+
interrupt will after scheduling the task work interrupt a user space process
152+
via an inter processor interrupt (IPI), which will cause it to enter the kernel,
153+
and thus perform the scheduled work. While this ensures a timely delivery of
154+
the CQE, it is a relatively disruptive and high overhead operation. To avoid
155+
this applications can configure
156+
.I io_uring
157+
via
158+
.I IORING_SETUP_COOP_TASKRUN
159+
to elide the IPI. Applications must now ensure that they perform any syscall
160+
ever so often to be able to observe new completions, but benefit from eliding
161+
the overheads of the IPIs. Additionally
162+
.I io_uring
163+
can be configured to inform an application about the fact that it should now
164+
perform any syscall to reap new completions by setting
165+
.IR IORING_SETUP_TASKRUN_FLAG .
166+
This will result in
167+
.I io_uring
168+
setting
169+
.I IORING_SQ_TASKRUN
170+
in the SQ flags once the application should do so. This mechanism can be
171+
restricted further via
172+
.IR IORING_SETUP_DEFER_TASKRUN ,
173+
which results in the task work only being executed when
174+
.IR io_uring_enter (2)
175+
is called with
176+
.I IORING_ENTER_GETEVENTS
177+
set, rather than at any context switch, which gives the application more agency
178+
about when the work is executed, thus enabling e.g. more opportunities for
179+
batching.
180+
181+
.PP
182+
.SH IO Threads
183+
.PP
184+
185+
For SQ polling and the IO WQ (See below)
186+
.I io_uring
187+
utilizes special threads called
188+
.I IO
189+
.IR Threads .
190+
These are threads that only run in kernel space and never exit to user space,
191+
but are notably different to
192+
.I kernel
193+
.IR threads ,
194+
that are e.g. used for threaded interrupt handling. While kernel threads are
195+
not associated with any user space thread, IO Threads, like pthreads,
196+
inherit the file table, memory mappings, credentials etc. from their parent.
197+
In the case of
198+
.I io_uring
199+
any IO thread of an instance is a child of the processes that created that
200+
.I io_uring
201+
instance. This has many of the usual implications of this relation e.g. one can
202+
profile them and measure their resource consumption via the children specific
203+
options of
204+
.IR getrusage (2)
205+
and
206+
.IR perf_event_open (2).
207+
208+
.PP
209+
.SH Submission Queue Polling
210+
.PP
211+
212+
Sq polling introduces a dedicated IO thread that performs essentially all
213+
submission and completion related work from fetching SQEs from the SQ,
214+
submitting requests, polling requests, if configured for I/O poll and posting
215+
CQEs. Notably, async punt requests are still processed by the IO WQ, to not
216+
hinder the progress of other requests (See Submission Side Work sections for
217+
when the async punt will occur). If the SQ thread does not have any work
218+
to do for a user supplied timeout it goes to sleep. Sq polling removes the need
219+
for any syscall during operation, besides waking up the sq thread after long
220+
periods of inactivity and thus reduces per request overheads at the cost of a
221+
high constant upkeep cost.
222+
223+
.PP
224+
.SH IO Work Queue
225+
.PP
226+
227+
The IO WQ is a pool of IO threads used to execute any requests that can not be
228+
submitted in a non-blocking way (See Submission Side Work sections for when
229+
this is the case). After either the sq poll thread or a user space
230+
thread calling
231+
.IR io_uring_enter (2)
232+
fails the initial attempt to submit the request without blocking it passes the
233+
request on to a IO WQ thread that then performs the blocking submission. This
234+
mechanism ensures that
235+
.IR io_uring ,
236+
unlike e.g. AIO, never blocks on any of the submission paths. However, the
237+
blocking nature of the submission, the passing of the request to another
238+
thread, as well as the scheduling of the IO WQ threads are all ideally avoided
239+
overheads. Significant IO WQ activity can thus be seen as an indicator that
240+
something is very likely going wrong. Similarly the flag
241+
.I IOSQE_ASYNC
242+
should only be used if the user knows that a request will always or is very
243+
likely to async punt and not to ensure that the submission will not block, as
244+
.I io_uring
245+
guarantees to never block in any case.
246+
247+
.PP
248+
.SH Kernel Thread Management
249+
.PP
250+
251+
Each user space process utilizing
252+
.I io_uring
253+
posses an
254+
.I io_uring
255+
context, which manages all
256+
.I io_uring
257+
instances created within said process via
258+
.IR io_uring_setup (2).
259+
Per default, both the sq poll thread, as well as the IO WQ thread pool are
260+
dedicated for each
261+
.I io_uring
262+
instance and are thus not shared within a process and are never shared between
263+
different processes. However sharing these between two or more instances can
264+
be achieved during setup via
265+
.IR IORING_SETUP_ATTACH_WQ .
266+
The threads of the IO WQ are created lazily in response to request being async
267+
punted and fall into two accounts, the
268+
bounded account responsible for requests with a generally bounded execution
269+
time, such as block I/O and the unbounded account for requests with unbounded
270+
execution time such as e.g. recv operations.
271+
The maximum thread count of the accounts is per default 2 * NPROC and can be
272+
adjusted via
273+
.IR IORING_REGISTER_IOWQ_MAX_WORKERS .
274+
Their CPU affinity can be adjusted via
275+
.IR IORING_REGISTER_IOWQ_AFF .
276+
277+
.EE
278+
.SH SEE ALSO
279+
.BR io_uring (7)
280+
.BR io_uring_enter (2)
281+
.BR io_uring_register (2)
282+
.BR io_uring_setup (2)

0 commit comments

Comments
 (0)