kuniga.me > NP-Incompleteness > Asynchronous I/O Overview

Asynchronous I/O Overview

16 May 2025

Abstract image with broken clocks, evoking asynchonicity

In this post I wish to do an overview of the different asynchronous I/O solutions, primarily on Linux.

We’ll start with blocking I/O, then explore a few different ways to perform non-blocking I/O such as busy loops, select(), epoll() and finally cover the library libevent.

File Descriptors

As we discussed in [1], file descriptors are a numerical identifier for resources such as files, network sockets and named pipes. The terminology “file” is confusing but it follows the Unix philosophy that “everything is a file”.

It’s probably due to historical reasons that the name stuck since it was intially designed for files and later expanded for other resources, with a realization that resources can be interacted with in a similar manner, so they can be abstracted.

The general interface is essentially read(), write(), open() and close(), so the primary way we interact with file descriptors is via Input/Output or I/O (or simply IO).

That’s why file descriptors is central to IO and hence async IO.

Blocking I/O

In our post explaining network sockets [2], we used APIs such as

getaddrinfo(url) - resolves the URL and returns metadata about it (e.g. IP address)
connect() - establish connection with remote host
recv() - wait for data from remote host
send() - send data to remote host

As mentioned in [3], these APIs are all blocking: getaddrinfo() does not return until it has succeeded or failed in resolving the URL; connect() does not return until it has connected; recv() calls do not return until they have received data or a close; and send() calls do not return until they have at least flushed their output to the kernel’s write buffers.

This can be a problem if we want to write a server that supports multiple inbound connections or a client that talks to many servers (e.g. a browser).

One way to work around this is to handle each connection in child process (via fork()) or a thread. However this doesn’t scale well with a lot of connections, since if the number of connections is much higher than the number of CPU cores, the kernel will have to do a lot of context switching.

Busy Loops

One option around this is to make the socket operations non-blocking and keep polling them in an infinite loop. We can do this by marking the file socket as non-blocking via the fcntl() (file control) method:

fcntl(fd, F_SETFL, O_NONBLOCK);

And then do an infinite loop (not all branches shown):

// Assume we have an array of socket file descriptors fd
while (1) {
    for (int i = 0; i < n_sockets; ++i) {
        int n = recv(fd[i], buf, sizeof(buf), 0);
        if (n < 0) {
            if (errno != EAGAIN) // else keep trying
                handle_error(fd[i], errno)
        } else {
            handle_input(fd[i], buf, n);
        }
    }
}

The problem with this approach is that it could be that most of the time none of the sockets are receiving data, so we’re just burning CPU with busy loops.

`select()`

A better alternative is to use the select() API, in which the thread is put to sleep instead of looping and only awoken when there is an event.

From the caller side, the idea is to use a bitmap to indicate which file descriptors we’re interested in. The index of the bitmap is the numerical value of the file descriptor.

Then we call select() that blocks until at least one of the fds of interests is ready to be consumed. The example below can be modified as follows:

fd_set readset;
while (1) {
    int maxfd = -1;
    // Clears the bitmask
    FD_ZERO(&readset); // (1)

    for (int i = 0; i < n_sockets; ++i) {
        if (fd[i] > maxfd) maxfd = fd[i];
        FD_SET(fd[i], &readset);
    }

    // Block until one or more fds are ready to read
    select(maxfd+1, &readset, NULL, NULL, NULL);

    // Process all of the fds that are still set in readset
    for (int i = 0; i < n_sockets; ++i) {
        if (!FD_ISSET(fd[i], &readset)) continue

        int n = recv(fd[i], buf, sizeof(buf), 0);
        if (n < 0) {
            if (errno != EAGAIN) // (2)
                handle_error(fd[i], errno)
        } else {
            handle_input(fd[i], buf, n);
        }
    }
}

Some comments:

(1) - Note that we don’t pass a size to this function. That’s because readset has a fixed length of 1024 bits and the whole array is reset.
(2) - Even though we only now call recv() if the socket is ready, it’s still possible it errors with EAGAIN, since another thread could have read it.

The first comment highlights a problem: all file descriptor values must be under 1024 to fit in the bitmask, which naturally limits the number of connections. If this is a problem, the poll() API can be used, which works in a similar way at a high-level.

Why is select() more efficient than polling? When we call select() the kernel blocks the calling thread by putting it to sleep [4]. When data is received in any of the file descriptors being monitored, it will wake up the thread.

A more detailed flow for the case of network sockets is the following: when a packet arrives, a hardware interrupt triggers a callback on the kernel, which then transfers the data to the socket queue and also wakes up the thread waiting on select().

The kernel doesn’t wait for more file descriptors to be triggered, which means that most of the time only one fd will have FD_ISSET() return true. This is pretty innefficient if we have many fds being monitored since each time a fd is ready we have to scan $O(N)$ file descriptors to find which one it is.

`epoll()`

To work around the problems of select() and poll(), the API epoll() was introduced in Linux systems. Other OSes have analogous implementations.

The key idea is to move the bitmask fd_set into the kernel space so that it can constantly update it, and not wait to start doing that when we call select(). And also not use a bitmask, but rather a list so that we only need to traverse fds that are actually ready. This new structure is called epoll.

The general steps are:

Create a instance of epoll via epoll_create1()
Register file descriptors we’re interested in via epoll_ctl()
Wait until at least one event is ready via epoll_wait()

The key thing is that epoll_wait() returns only events that are ready, so you don’t need to traverse all fds. Our ongoing example can be expressed as follows:

// Creates an epoll instance, returns a file descriptor.
int efd = epoll_create1(); // (1)

// Registers fds
for (int i = 0; i < n_sockets; ++i) {
    struct epoll_event evt;
    // This will let us know later which fd was modified
    evt.data.fd = fd[i];
    // Bitmask indicating which events we're interested in
    evt.events = EPOLLIN;
    epoll_ctl(efd, EPOLL_CTL_ADD, fd[i], evt);
}

// Will be populated by epoll_wait()
struct epoll_event events[MAX_EVENTS]; // (2)
while (1) {
    // Blocks until one or more fds are ready to read,
    // returns the number of events added to events.
    int n = epoll_wait(efd, events, MAX_EVENTS, -1);

    // Process only events that are ready
    for (int i = 0; i < n; ++i) {
        int fd = events[i].data.fd;
        assert(events[i].events == EPOLLIN); // (3)
        int n = recv(fd, buf, sizeof(buf), 0);
        if (n < 0) {
            if (errno != EAGAIN)
                handle_error(fd, errno)
        } else {
            handle_input(fd, buf, n);
        }
    }
}

Some comments:

(1) - notice that we don’t manipulate the epoll instance directly: it lives in kernel space. We only have a reference to it as a file descriptor.
(2) - this array is a buffer which is filled by epoll_wait(). It doesn’t mean it fills all the positions though, only the first n.
(3) - the bitmask events in events[i] contains a subset of the bitmask provided when registering the fd in epoll_ctl(), corresponding to the event types that were actually triggered.

Level-triggered vs. Edge-triggered

There are two notification modes supported by epoll. Level-trigger is the default mode and it will trigger a notification as long as there’s data in the file descriptor.

Edge-triggered can be enabled by including the flag EPOLLET:

evt.events = EPOLLIN | EPOLLET;

In this mode it will only trigger a notification when there’s a change of state. If the client doesn’t read all the data in the channel and no new events arrive, the next time it epoll_wait() the corresponding file descriptor won’t be marked as ready.

The terms level and edge come from electronics. An analogy: the honk is a level trigger because it stays on as long as it’s being pressed. The button of a stop watch is a edge trigger because it starts once you press the button but nothing changes if you keep the button pressed.

libevent

As we mentioned in the previous section, different OSes came up with different solutions to the limitations of select() and poll(), which makes it hard to write cross-platform code.

libevent is a library that aims at abstracting these different implementations and also hiding some of the boilerplate such as the loop common to all the different solutions.

#include <event2/event.h>

// Similar to epoll_create1()
auto base = event_base_new();

// associates a callback, handle_input(), to
// the file descriptor fd
listener_event = event_new(
    base,
    fd,
    EV_READ | EV_PERSIST,
    handle_input, // custom callback
    (void*)base
);

// Similar to epoll_ctl()
event_add(listener_event, NULL);

// Run the loop (abstracts the loop for earlier examples)
event_base_dispatch(base);

Conclusion

In this post we covered the progression of different asynchronous libraries from busy-loops to poll() and then epoll(). We also learned about libevent, a library that aims to abstract the underlying implementation of these different flavors.

We didn’t dive into great detail but I feel I have now enough understanding of the high-level operating mode of these to be able to reason about them.

I found fascinating to think about the hardware-software interface: what happens when electrical signals reach the NIC (Network Interface Card), get re-interpreted as logical bits and then trigger callbacks in the kernel.

One option I left out of the discussion but I’m looking forward to learning is io_uring. It aims to improve performance further over epoll().

In Python Coroutines we use the library select, which provides an API to select() but it also supports epoll().

References

[1] NP-Incompleteness: Linux Filesystems Overview
[2] NP-Incompleteness: Network Sockets
[3] A tiny introduction to synchronous non-blocking IO
[4] ChatGPT
[5] The method to epoll’s madness

| Tags: c++, computer network, operating systems