Pipes Full of Gravy

Git-Repository: Template Solution Solution-Diff (Solution is posted at 18:00 CET)
Workload: 131 lines of code
Important System-Calls: epoll_create(2), epoll_ctl(2), epoll_wait(2), pipe(7), splice(2)
Recommended Reads:

Christmas is also the season, where the kitchen in the ELF village is a very busy place. All the food and all the cookies for Christmas eve are prepared ahead of time and stored for the big feast when Santa comes home from the tour. And there is one thing that Santa is especially fond of. There is only one thing that both quenches the thirst and fills up the stomach: Gravy! Immense amount of gravy to drench meat and vegetables alike. But making this special ELF gravy is a very peculiar process as it involves multiple process and filter steps to ensure the delicious clarity of the brown molten gold.

However, last year, while the individiual steps of the process were already optimized, there were a few difficulties with the overall gravy production: the transfers stalled and blocked as there was only a single ELF with a single cart that moved the proto-gravy around. For this year, the ELF senate decided to come up with a better gravy-transportation process.

Of Pipes and Polls

Today, we want to look at two different system call interfaces, combine them, and help our elves with the gravy problem. We will start with epoll(7), which is a more modern variant of select(2), and splice(2), which allows the user to directly redirect data from one pipe to another one. In essence, our ELF gets a mechanism to determine which process of the gravy production is producing proto-gravy at its output (epoll), and a mechanism to directly establish pipelines between the output and the input buckets of the gravy production without loading or unloading of the cart.


Let's first have a look at epoll() and the problem that it solves: With select, we passed a bit-set of queried file descriptors to the kernel and got back the same set where only the bits of the "ready" descriptors are set. So, we have to rejuvenate the poll set for every select invocation and we have to iterate over the set to determine which descriptors are actually ready. Both steps require time and are inefficient as usually the set of polled descriptors is very large and does not change that often.

To solve this, epoll(7) was created in Linux 2.5.44 (2002): In a nutshell, we create a epoll-descriptor with epoll_create(2), which is a file-descriptor itself, and register the file-descriptors that are interesting for us with epoll_ctl(2). With epoll_wait(2), we query the set of registered descriptors and the kernel returns one or multiple "events" (becoming ready is an event) that we can work on. This is fundamentally different than select in two respects: (1) Unless we register a file descriptor explicitly as EPOLLONESHOT, it remains registered even if it already produced an event. (2) The return value is a dense stream of events with much fewer elements than the number of registered fds. In its functionality, epoll is similar to the poll(2) system call, which already solves (1) and allows us to reuse the set of interesting fds. Nevertheless, poll still forces us to iterate over the fd set to find the interesting ones.

In essence, an epoll event loop looks like this:

// Create epoll object
int epoll_fd = epoll_create(...);

// Register interesting FDs
epoll_ctl(epoll_fd, EPOLL_ADD, fd, ...)

// The event loop
while (true) {
    struct epoll_event events[10];
    int n = epoll_wait(epoll_fd, events,....);
    for (int i = 0; i < n; i++) {

However, one thing is special about epoll: It does not tell us the file descriptor that produced an event, but it only hands back user-defined data (struct epoll_data_t) that we gave it when registering the fds. This could be the file descriptor number, but it also could be a pointer to some fancy context object.

Pipes and splice

The other part of today are Unix pipes (pipe(7) that we want to splice(2) together. For this, we first have to understand that a Unix pipe, which was championed by Doug Mellroy, is an connection between to file descriptors. These fds are not connected to a real file, but if we write some data into one fd, the data can be read from the other FD. This comes with the constraint that pipe are unidirectional and we have a "write end" and a "read end". And this is reflected in the API to create an pipe object with pipe(2) (or with flags: pipe2(2)):

int pipe(int pipefd[2]);
int P[2];
// P[0]: read end
// P[1]: write end

If you use your shell to spawn a command with pipes, this system call is used by the shell to establish the connection between the processes. For example, for cat file | grep bar, the shell invokes pipe2() once, gives the write end (P[1]) as stdout to the cat-process and the read end (P[0]) to the grep process. Thereby, both processes communicate directly with each other and the shell is no longer involved.

However, sometimes, we want to have a more fine grained control over how and when data is moved between two processes. For example, if we want to measure the amount of gravy that flows between two production steps. For this, the shell must become an intermediator, for example by creating two pipe pairs: one from cat to the shell and one from the shell to grep. Nevertheless, it is not desirable to read() the data into the shells address space first, just to write() it out to the grep-pipe. And this is where splice(2) comes in handy:

   ssize_t splice(int fd_in,  off64_t *off_in, 
                  int fd_out, off64_t *off_out, 
                  size_t len, unsigned int flags);

splice() connects two file descriptors (one or both of them must be a pipe) for len bytes. Furthermore, we can add some flags and set input and output file offsets. As a result, we get the number of bytes that were actually written.


Write a single-threaded program that takes multiple filter commands as separate arguments:

seq 1 100000000000000 | ./epoll "grep [0-5]" "grep 1$"  > /dev/null

The program spawns both filters, splices its stdin data with splice() to the first filter, splices the first filter's output to the second filter's input, and so on, until the output of the last filter is written to stdout of the ./epoll program. Meanwhile, the epoll program should record and regularly print the throughput of the filter pipeline to determine how much gravy is flowing. This is similar to the behavior of the pv(1) program. As an output, you should expect something like:

[grep [0-5]] Started filter as pid 458367
[grep 1$] Started filter as pid 458368
 73.63MiB/s 72.70MiB/s 7.29MiB/s
 85.22MiB/s 85.06MiB/s 8.52MiB/s
 92.25MiB/s 92.25MiB/s 9.22MiB/s
 92.44MiB/s 92.46MiB/s 9.25MiB/s