|
| 1 | +.TH io_uring_internals 7 2024-10-5 "Linux" "Linux Programmer's Manual" |
| 2 | +.SH NAME |
| 3 | +io_uring_internals |
| 4 | +.SH SYNOPSIS |
| 5 | +.nf |
| 6 | +.B "#include <linux/io_uring.h>" |
| 7 | +.fi |
| 8 | +.PP |
| 9 | +.SH DESCRIPTION |
| 10 | +.PP |
| 11 | +.B io_uring |
| 12 | +is a linux specific, asynchronous API that allows the submission of requests to |
| 13 | +the kernel. Applications pass requests to the kernel via a shared ring buffer |
| 14 | +the |
| 15 | +.I Submission Queue |
| 16 | +(SQ) and receive notifications of the completion of these requests via the |
| 17 | +.I Completion Queue |
| 18 | +(CQ). An important detail here is that after a request has been submitted to |
| 19 | +the kernel some CPU time has to be spent in kernel space to perform the |
| 20 | +required submission and completion related work. |
| 21 | +The mechanism used to provide this CPU time, as well as what process does so |
| 22 | +and when is different in |
| 23 | +.I io_uring |
| 24 | +than for the traditional API provided by regular syscalls. |
| 25 | + |
| 26 | +.PP |
| 27 | +.SH Traditional Syscall Driven I/O |
| 28 | +.PP |
| 29 | +For regular syscalls the CPU time for this work is directly provided by the |
| 30 | +process issuing the syscall, with the submission side work in kernel space |
| 31 | +being directly executed after the context switch. The time for completion |
| 32 | +related work is either also subsequently directly provided in the case of |
| 33 | +polled I/O. In the case of interrupt driven I/O the CPU time is provided, |
| 34 | +depending on the driver in question, by either the traditional top and bottom |
| 35 | +half IRQ approach or via threaded IRQ handling. The CPU time for completion |
| 36 | +work is thus in this case provided by the CPU on which the hardware |
| 37 | +interrupt arrives, as well as the CPU to which the dedicated kernel worker |
| 38 | +thread for the threaded IRQ handling gets scheduled, if that is used. |
| 39 | + |
| 40 | +.PP |
| 41 | +.SH The Submission Side Work |
| 42 | +.PP |
| 43 | + |
| 44 | +The work required in kernel space on the submission side mostly consists of |
| 45 | +checking the SQ for newly arrived SQEs, parsing and checking them for |
| 46 | +validity and permissions and then passing them on to the responsible system, |
| 47 | +such as a block device driver, networking stack, etc. An important note here is |
| 48 | +that |
| 49 | +.I io_uring |
| 50 | +guarantees that the process of submitting the request to responsible subsystem |
| 51 | +and thus in this case the |
| 52 | +.IR io_uring_enter (2) |
| 53 | +syscall made to submit the new requests, |
| 54 | +.B will never |
| 55 | +.BR block . |
| 56 | +However, the mechanism how io_uring achieves this generally depends on the |
| 57 | +capabilities of the file a request operates on. While the mechanism |
| 58 | +.I io_uring |
| 59 | +ends up utilizing for this is not directly observable to the application it |
| 60 | +does have significant performance implications. |
| 61 | +There are generally four scenarios: |
| 62 | +.PP |
| 63 | +1. The operation is finished in its entirety immediately. Examples of this |
| 64 | +are reads or writes to a pipe or socket or reads and writes to regular |
| 65 | +files not using direct I/O that have be served via the page cache. In this |
| 66 | +scenario the according CQE is posted inline as well and will thus be visible |
| 67 | +to the application even before the |
| 68 | +.IR io_uring_enter (2) |
| 69 | +call returns. |
| 70 | + |
| 71 | +2. The operation is not finished inline, but can be submitted fully |
| 72 | +asynchronously. How |
| 73 | +.I io_uring |
| 74 | +handles the asynchronous completion depends on whether or not interrupt or |
| 75 | +polled I/O is used (See section on Completion Side Work). An example of a |
| 76 | +backend capable of this fully asynchronous operation is the NVMe driver. |
| 77 | + |
| 78 | +3. The operation is not finished inline, but the file can signal readiness for |
| 79 | +when the operation can be retried. Examples of such files are any pollable file |
| 80 | +including sockets, pipes etc. It should be noted that these retry operations |
| 81 | +are performed during subsequent |
| 82 | +.IR io_uring_enter (2) |
| 83 | +calls, if SQ polling is not used. The operation is thus performed in the |
| 84 | +context of the submitting thread and there are no additional other threads |
| 85 | +involved. If SQ polling is used the retries are performed by the SQ poll |
| 86 | +thread. |
| 87 | + |
| 88 | +4. The operation is not finished inline and the file is incapable of signaling |
| 89 | +when it is ready to do I/O. This is the only case in which |
| 90 | +.I io_uring |
| 91 | +will |
| 92 | +.I async punt |
| 93 | +the request, i.e. offload the potentially blocking execution of the request to |
| 94 | +an asynchronous worker thread. (See IO WQ section below) |
| 95 | +.PP |
| 96 | + |
| 97 | +.PP |
| 98 | +.SH The Completion Side Work |
| 99 | +.PP |
| 100 | + |
| 101 | +The work required in kernel space on the completion side mostly come in the |
| 102 | +form of various request type dependent obligations, such as copying buffers, |
| 103 | +parsing packet headers etc., as well as posting a CQE to the CQ to inform the |
| 104 | +application of the completion of the request. |
| 105 | + |
| 106 | +.PP |
| 107 | +.SH Who does the work |
| 108 | +.PP |
| 109 | + |
| 110 | +One of |
| 111 | +the primary motivations behind |
| 112 | +.I io_uring |
| 113 | +was to reduce or entirely avoid the overheads of syscalls to provide the |
| 114 | +required CPU time in kernel space. The mechanism that |
| 115 | +.I io_uring |
| 116 | +utilizes to achieve this differs depending on the configuration with different |
| 117 | +trade-offs between configurations in respect to e.g. CPU efficiency and latency. |
| 118 | + |
| 119 | +With the default configuration the primary mechanism to provide the kernel space |
| 120 | +CPU time in |
| 121 | +.I io_uring |
| 122 | +is also a syscall: |
| 123 | +.IR io_uring_enter (2) |
| 124 | +This still differs from requests made via their respective syscall directly, |
| 125 | +such as |
| 126 | +.IR read (2), |
| 127 | +in the sense that it allows for batching in a more flexible way than e.g. |
| 128 | +possible via |
| 129 | +.IR readv (2), |
| 130 | +as different syscalls types can be freely mixed and matched and chains of |
| 131 | +dependent requests, such as a |
| 132 | +.IR send (2) |
| 133 | +followed by a |
| 134 | +.IR recv (2) |
| 135 | +can be submitted with one syscall. Furthermore it is possible to both process |
| 136 | +requests for submissions and process arrived completions within the same |
| 137 | +.IR io_uring_enter (2) |
| 138 | +call. Applications can set the flag |
| 139 | +.I IORING_ENTER_GETEVENTS |
| 140 | +to in addition to processing any pending submissions, process any arrived |
| 141 | +completions and |
| 142 | +optionally wait until a specified amount of completions have arrived before |
| 143 | +returning. |
| 144 | + |
| 145 | +If polled I/O is used all completion related work is performed during the |
| 146 | +.IR io_uring_enter (2) |
| 147 | +call. For interrupt driven I/O, the CPU receiving the hardware interrupt |
| 148 | +schedules the remaining work to be performed including posting the CQE to be |
| 149 | +performed via task work. Any outstanding task work is performed during any |
| 150 | +user-kernel space transition. Per default, the CPU that received the hw |
| 151 | +interrupt will after scheduling the task work interrupt a user space process |
| 152 | +via an inter processor interrupt (IPI), which will cause it to enter the kernel, |
| 153 | +and thus perform the scheduled work. While this ensures a timely delivery of |
| 154 | +the CQE, it is a relatively disruptive and high overhead operation. To avoid |
| 155 | +this applications can configure |
| 156 | +.I io_uring |
| 157 | +via |
| 158 | +.I IORING_SETUP_COOP_TASKRUN |
| 159 | +to elide the IPI. Applications must now ensure that they perform any syscall |
| 160 | +ever so often to be able to observe new completions, but benefit from eliding |
| 161 | +the overheads of the IPIs. Additionally |
| 162 | +.I io_uring |
| 163 | +can be configured to inform an application about the fact that it should now |
| 164 | +perform any syscall to reap new completions by setting |
| 165 | +.IR IORING_SETUP_TASKRUN_FLAG . |
| 166 | +This will result in |
| 167 | +.I io_uring |
| 168 | +setting |
| 169 | +.I IORING_SQ_TASKRUN |
| 170 | +in the SQ flags once the application should do so. This mechanism can be |
| 171 | +restricted further via |
| 172 | +.IR IORING_SETUP_DEFER_TASKRUN , |
| 173 | +which results in the task work only being executed when |
| 174 | +.IR io_uring_enter (2) |
| 175 | +is called with |
| 176 | +.I IORING_ENTER_GETEVENTS |
| 177 | +set, rather than at any context switch, which gives the application more agency |
| 178 | +about when the work is executed, thus enabling e.g. more opportunities for |
| 179 | +batching. |
| 180 | + |
| 181 | +.PP |
| 182 | +.SH IO Threads |
| 183 | +.PP |
| 184 | + |
| 185 | +For SQ polling and the IO WQ (See below) |
| 186 | +.I io_uring |
| 187 | +utilizes special threads called |
| 188 | +.I IO |
| 189 | +.IR Threads . |
| 190 | +These are threads that only run in kernel space and never exit to user space, |
| 191 | +but are notably different to |
| 192 | +.I kernel |
| 193 | +.IR threads , |
| 194 | +that are e.g. used for threaded interrupt handling. While kernel threads are |
| 195 | +not associated with any user space thread, IO Threads, like pthreads, |
| 196 | +inherit the file table, memory mappings, credentials etc. from their parent. |
| 197 | +In the case of |
| 198 | +.I io_uring |
| 199 | +any IO thread of an instance is a child of the processes that created that |
| 200 | +.I io_uring |
| 201 | +instance. This has many of the usual implications of this relation e.g. one can |
| 202 | +profile them and measure their resource consumption via the children specific |
| 203 | +options of |
| 204 | +.IR getrusage (2) |
| 205 | +and |
| 206 | +.IR perf_event_open (2). |
| 207 | + |
| 208 | +.PP |
| 209 | +.SH Submission Queue Polling |
| 210 | +.PP |
| 211 | + |
| 212 | +Sq polling introduces a dedicated IO thread that performs essentially all |
| 213 | +submission and completion related work from fetching SQEs from the SQ, |
| 214 | +submitting requests, polling requests, if configured for I/O poll and posting |
| 215 | +CQEs. Notably, async punt requests are still processed by the IO WQ, to not |
| 216 | +hinder the progress of other requests (See Submission Side Work sections for |
| 217 | +when the async punt will occur). If the SQ thread does not have any work |
| 218 | +to do for a user supplied timeout it goes to sleep. Sq polling removes the need |
| 219 | +for any syscall during operation, besides waking up the sq thread after long |
| 220 | +periods of inactivity and thus reduces per request overheads at the cost of a |
| 221 | +high constant upkeep cost. |
| 222 | + |
| 223 | +.PP |
| 224 | +.SH IO Work Queue |
| 225 | +.PP |
| 226 | + |
| 227 | +The IO WQ is a pool of IO threads used to execute any requests that can not be |
| 228 | +submitted in a non-blocking way (See Submission Side Work sections for when |
| 229 | +this is the case). After either the sq poll thread or a user space |
| 230 | +thread calling |
| 231 | +.IR io_uring_enter (2) |
| 232 | +fails the initial attempt to submit the request without blocking it passes the |
| 233 | +request on to a IO WQ thread that then performs the blocking submission. This |
| 234 | +mechanism ensures that |
| 235 | +.IR io_uring , |
| 236 | +unlike e.g. AIO, never blocks on any of the submission paths. However, the |
| 237 | +blocking nature of the submission, the passing of the request to another |
| 238 | +thread, as well as the scheduling of the IO WQ threads are all ideally avoided |
| 239 | +overheads. Significant IO WQ activity can thus be seen as an indicator that |
| 240 | +something is very likely going wrong. Similarly the flag |
| 241 | +.I IOSQE_ASYNC |
| 242 | +should only be used if the user knows that a request will always or is very |
| 243 | +likely to async punt and not to ensure that the submission will not block, as |
| 244 | +.I io_uring |
| 245 | +guarantees to never block in any case. |
| 246 | + |
| 247 | +.PP |
| 248 | +.SH Kernel Thread Management |
| 249 | +.PP |
| 250 | + |
| 251 | +Each user space process utilizing |
| 252 | +.I io_uring |
| 253 | +posses an |
| 254 | +.I io_uring |
| 255 | +context, which manages all |
| 256 | +.I io_uring |
| 257 | +instances created within said process via |
| 258 | +.IR io_uring_setup (2). |
| 259 | +Per default, both the sq poll thread, as well as the IO WQ thread pool are |
| 260 | +dedicated for each |
| 261 | +.I io_uring |
| 262 | +instance and are thus not shared within a process and are never shared between |
| 263 | +different processes. However sharing these between two or more instances can |
| 264 | +be achieved during setup via |
| 265 | +.IR IORING_SETUP_ATTACH_WQ . |
| 266 | +The threads of the IO WQ are created lazily in response to request being async |
| 267 | +punted and fall into two accounts, the |
| 268 | +bounded account responsible for requests with a generally bounded execution |
| 269 | +time, such as block I/O and the unbounded account for requests with unbounded |
| 270 | +execution time such as e.g. recv operations. |
| 271 | +The maximum thread count of the accounts is per default 2 * NPROC and can be |
| 272 | +adjusted via |
| 273 | +.IR IORING_REGISTER_IOWQ_MAX_WORKERS . |
| 274 | +Their CPU affinity can be adjusted via |
| 275 | +.IR IORING_REGISTER_IOWQ_AFF . |
| 276 | + |
| 277 | +.EE |
| 278 | +.SH SEE ALSO |
| 279 | +.BR io_uring (7) |
| 280 | +.BR io_uring_enter (2) |
| 281 | +.BR io_uring_register (2) |
| 282 | +.BR io_uring_setup (2) |
0 commit comments