-
Notifications
You must be signed in to change notification settings - Fork 6
Description
Summary
Investigate using EPOLLEXCLUSIVE to reduce thundering herd overhead when multiple threads call epoll_wait() on the same epoll file descriptor.
Background
Current Implementation
Corosio's epoll scheduler allows multiple threads to call run() on the same io_context, resulting in multiple threads blocking in epoll_wait() on a shared epoll fd. This provides natural load balancing for I/O events since the kernel typically wakes only one thread per event.
However, the wakeup() mechanism writes to an eventfd to signal waiting threads when:
- Work is posted via
post()ordispatch() - The scheduler is stopped
- A timer deadline changes
When the eventfd becomes readable, all threads blocked in epoll_wait() wake up simultaneously, but only one thread actually has work to do. The others acquire the mutex, find no work, and return to epoll_wait(). This is the classic thundering herd problem.
Current Mitigation
The existing implementation accepts this overhead because:
- Thundering herd only occurs on explicit
wakeup()calls, not on every I/O event - The mutex ensures correct behavior (only one thread processes work)
- Modern kernels handle spurious wakeups efficiently
However, in high-throughput scenarios with frequent post() calls and many worker threads, this can cause measurable overhead from:
- Context switches for all threads
- Cache line contention on the mutex
- Increased CPU utilization from spurious wakeups
Proposed Solution: EPOLLEXCLUSIVE
Linux 4.5+ introduced EPOLLEXCLUSIVE, a flag that changes wakeup behavior:
struct epoll_event ev;
ev.events = EPOLLIN | EPOLLEXCLUSIVE;
ev.data.fd = event_fd;
epoll_ctl(epoll_fd, EPOLL_CTL_ADD, event_fd, &ev);When EPOLLEXCLUSIVE is set:
- The kernel wakes only one thread blocked in
epoll_wait()for that fd - If multiple fds become ready, different threads may be woken for different fds
- Round-robin or LIFO wakeup policy (implementation-defined)
Application to Corosio
The eventfd used for wakeup signaling is registered at scheduler.cpp:94:
ev.events = EPOLLIN;
ev.data.ptr = nullptr;
if (::epoll_ctl(epoll_fd_, EPOLL_CTL_ADD, event_fd_, &ev) == -1) {
// error handling
}Adding EPOLLEXCLUSIVE here would ensure only one thread wakes on each wakeup() call:
ev.events = EPOLLIN | EPOLLEXCLUSIVE;Technical Considerations
Kernel Version Detection
EPOLLEXCLUSIVE requires Linux 4.5+. Options for detection:
- Compile-time: Check for
EPOLLEXCLUSIVEmacro definition - Runtime: Attempt registration and fall back on
EINVAL
#ifdef EPOLLEXCLUSIVE
ev.events = EPOLLIN | EPOLLEXCLUSIVE;
#else
ev.events = EPOLLIN;
#endifOr with runtime fallback:
ev.events = EPOLLIN | EPOLLEXCLUSIVE;
if (::epoll_ctl(epoll_fd_, EPOLL_CTL_ADD, event_fd_, &ev) == -1) {
if (errno == EINVAL) {
// Fallback for older kernels
ev.events = EPOLLIN;
::epoll_ctl(epoll_fd_, EPOLL_CTL_ADD, event_fd_, &ev);
}
}Socket Accept Operations
EPOLLEXCLUSIVE is also relevant for accept operations on listening sockets (sockets.hpp:462). When multiple threads wait to accept on the same socket, EPOLLEXCLUSIVE prevents all threads from waking on each incoming connection.
However, this requires careful consideration:
- Socket registration currently uses edge-triggered mode (
EPOLLIN | EPOLLET) EPOLLEXCLUSIVEcombined withEPOLLEThas specific semantics- Need to verify correct behavior with the one-shot unregister pattern
Interaction with Edge-Triggered Mode
The current implementation uses EPOLLET (edge-triggered) for all socket operations. When combining EPOLLEXCLUSIVE with EPOLLET:
- Wakeup occurs on edge (transition to ready state)
- Only one thread receives the notification
- If that thread doesn't fully drain the fd, subsequent data won't trigger another wakeup until the fd returns to non-ready state
This should be compatible with Corosio's one-shot pattern where fds are unregistered immediately after epoll_wait() returns.
Level-Triggered Eventfd
The eventfd used for wakeup is currently level-triggered (no EPOLLET). With EPOLLEXCLUSIVE:
- One thread wakes per
epoll_wait()cycle - If multiple
wakeup()calls occur, the accumulated value is read once - This matches desired behavior (wake one thread to process queue)
Benchmarking Strategy
To measure the impact, create a benchmark that:
- Spawns N worker threads calling
io_context::run() - Has a producer thread calling
post()at high frequency - Measures:
- Total throughput (posts/second)
- CPU utilization
- Context switch rate
- Latency distribution
Compare results with and without EPOLLEXCLUSIVE.
Compatibility
| Requirement | Version |
|---|---|
| Linux Kernel | 4.5+ |
| glibc | 2.24+ |
| musl | 1.1.18+ |
For older systems, the library should gracefully fall back to standard behavior.
Alternatives Considered
1. Single-Threaded Wakeup Consumer
Designate one thread as the "wakeup handler" that distributes work to others. Rejected because:
- Adds complexity
- Creates a bottleneck
- Doesn't leverage kernel-level load balancing
2. Per-Thread Eventfds
Give each thread its own eventfd and wake threads round-robin. Rejected because:
- Requires tracking which threads are blocked
- Adds memory overhead (one eventfd per thread)
- Complicates the scheduler implementation
3. Condition Variable Signaling
Replace eventfd with pthread condition variables for wakeup. Rejected because:
- Requires restructuring the event loop
- Loses the unified epoll-based wait
- May not integrate well with timer handling