Spin the Advent Wreath as Fast as You can

Git-Repository: Template Solution Solution-Diff (Solution is posted at 18:00 CET)
Workload: 174 lines of code
Important System-Calls: io_uring_setup(2), io_uring_register(2)
Recommended Reads:

Until this year, the ELF workshop was modeled more after an old-school German Behörde than a modern service-oriented service desk. When one of the ELF workers would require multiple tools to prepare a certain gift, he goes to his ELF supervisor and ask for the first tool. The supervisor then walks into the equipment-storage facility, fetches the requested tool, and delivers it to the worker ELF, who then asks for the second tool. As you might have noticed, this is less than optimal, so it is your job to build a new workflow, where a worker ELF can ask for all tools at once.


In our initial discussion of system calls, we said that many syscalls are synchronous: The user thread issues the system call and is blocked until the system call completes. With select() and epoll(), we've already discussed two system calls that help us to break with this synchronous paradigm as we can perform non-blocking I/O: We can use them to detect if new data arrived on some file descriptor (EPOLLOUT) or if the file-descriptor is ready to accept more data (EPOLLIN). However, we still have to come back regularly to read or write data. For example, to avoid blocking when writing to an FD, we have to call write() (with O_NONBLOCK) over and over again until all data is written out, since the kernel accepts only a certain amount of data without blocking.

Wouldn't it be great, if we could just send a bunch of operations to the kernel and the kernel would signal us, at some later point, when the operation has completed. And this is exactly what asynchronous I/O is all about. We separate issue time and completion of an OS operation for the user thread.

Actually, POSIX also specifies an asynchronous I/O interface with aio(7) that Linux also supports. However, aio has several shortcomings like the need to explicitly poll for completed operations with a system call, which led to development of a new asynchronous system-call interface for Linux: io_uring(7) For a detailed discussion of the existing shortcomings and for a detailed introduction to io_uring, I refer you to "Efficient IO with io_uring" by Jens Axboe, the main author of io_uring. In today's article, we will not explain io_uring in detail, but only give a very coarse overview. Others, including Axboe and Shuveb Hussain have written extensive and excellent introductions.

The most important aspect of io_uring is the usage of memory-mapped ring buffers for the communication between user- and kernel space. The user writes its system call requests as submission queue entries (SQE) and the kernel answers with completion queue entries (CQE). Thereby, the user (and with the correct settings also the kernel) can poll those queues for new entries without crossing the privilege boundary but by only looking at the head and tail pointers of the queue.

   int io_uring_setup(u32 entries, struct io_uring_params *p);
   int io_uring_enter(unsigned int fd, unsigned int to_submit,
                      unsigned int min_complete, unsigned int flags, sigset_t *sig);

The two important system calls for io_uring(7) are: io_uring_setup(2), which we use to create a new uring object in the kernel, and io_uring_enter(2), which we use to explicitly submit to_submit SQEs and (optionally) wait for min_complete CQE entries.

For normal uses, you should use liburing, which is a thin abstraction layer that spares you the gory details of setting up an uring instance and managing SQEs. However, for today, we want to use the low-level interface, which is detailed (with examples to copy from) in io_uring(7).


Complete the given example program that uses io_uring to read arbitrary 4096-byte blocks from a given FILE, while keeping SQ_SIZE requests in flight:

usage: ./iouring SQ_SIZE FILE

The idea is to allow you to experiment with different levels of I/O parallelism: modern SSDs can process multiple requests in parallel. So if you only have one request in flight, you cannot saturate your cool SSD completely. On my machine, I see the following difference between 1 in-flight request and 12 in-flight requests:

$ ./iouring 1 test.jpg
init_ring: sq_size=1
SQ: 1 entries (0x7f5653a7c000), ring: 0x7f5653a7d180
CQ: 2 entries ring: 0x7f5653a7b140
in_flight: 1, read_blocks/s: 4.28K, read_bytes: 16.72 MiB/s
in_flight: 1, read_blocks/s: 9.72K, read_bytes: 37.96 MiB/s
in_flight: 1, read_blocks/s: 9.69K, read_bytes: 37.87 MiB/s
in_flight: 1, read_blocks/s: 9.75K, read_bytes: 38.08 MiB/s
in_flight: 1, read_blocks/s: 9.73K, read_bytes: 38.02 MiB/s

$ ./iouring 12 test.jpg 
init_ring: sq_size=12
SQ: 16 entries (0x7f04304db000), ring: 0x7f04304dc340
CQ: 32 entries ring: 0x7f04304da140
in_flight: 12, read_blocks/s: 93.58K, read_bytes: 365.56 MiB/s
in_flight: 12, read_blocks/s: 121.23K, read_bytes: 473.56 MiB/s
in_flight: 12, read_blocks/s: 121.27K, read_bytes: 473.70 MiB/s
in_flight: 12, read_blocks/s: 121.11K, read_bytes: 473.09 MiB/s
in_flight: 12, read_blocks/s: 121.50K, read_bytes: 474.60 MiB/s
in_flight: 12, read_blocks/s: 121.29K, read_bytes: 473.79 MiB/s

With 128 in-flight requests, I could even read more than 1300 MiB/s. Please note, that the kernel rounds up the submission and completion queue sizes to the next power-of-two.