Investigate `EPOLLEXCLUSIVE`

## Summary

Investigate using `EPOLLEXCLUSIVE` to reduce thundering herd overhead when multiple threads call `epoll_wait()` on the same epoll file descriptor.

## Background

### Current Implementation

Corosio's epoll scheduler allows multiple threads to call `run()` on the same `io_context`, resulting in multiple threads blocking in `epoll_wait()` on a shared epoll fd. This provides natural load balancing for I/O events since the kernel typically wakes only one thread per event.

However, the `wakeup()` mechanism writes to an eventfd to signal waiting threads when:
- Work is posted via `post()` or `dispatch()`
- The scheduler is stopped
- A timer deadline changes

When the eventfd becomes readable, **all** threads blocked in `epoll_wait()` wake up simultaneously, but only one thread actually has work to do. The others acquire the mutex, find no work, and return to `epoll_wait()`. This is the classic thundering herd problem.

### Current Mitigation

The existing implementation accepts this overhead because:
1. Thundering herd only occurs on explicit `wakeup()` calls, not on every I/O event
2. The mutex ensures correct behavior (only one thread processes work)
3. Modern kernels handle spurious wakeups efficiently

However, in high-throughput scenarios with frequent `post()` calls and many worker threads, this can cause measurable overhead from:
- Context switches for all threads
- Cache line contention on the mutex
- Increased CPU utilization from spurious wakeups

## Proposed Solution: EPOLLEXCLUSIVE

Linux 4.5+ introduced `EPOLLEXCLUSIVE`, a flag that changes wakeup behavior:

```c
struct epoll_event ev;
ev.events = EPOLLIN | EPOLLEXCLUSIVE;
ev.data.fd = event_fd;
epoll_ctl(epoll_fd, EPOLL_CTL_ADD, event_fd, &ev);
```

When `EPOLLEXCLUSIVE` is set:
- The kernel wakes only **one** thread blocked in `epoll_wait()` for that fd
- If multiple fds become ready, different threads may be woken for different fds
- Round-robin or LIFO wakeup policy (implementation-defined)

### Application to Corosio

The eventfd used for wakeup signaling is registered at `scheduler.cpp:94`:

```cpp
ev.events = EPOLLIN;
ev.data.ptr = nullptr;
if (::epoll_ctl(epoll_fd_, EPOLL_CTL_ADD, event_fd_, &ev) == -1) {
    // error handling
}
```

Adding `EPOLLEXCLUSIVE` here would ensure only one thread wakes on each `wakeup()` call:

```cpp
ev.events = EPOLLIN | EPOLLEXCLUSIVE;
```

## Technical Considerations

### Kernel Version Detection

`EPOLLEXCLUSIVE` requires Linux 4.5+. Options for detection:

1. **Compile-time**: Check for `EPOLLEXCLUSIVE` macro definition
2. **Runtime**: Attempt registration and fall back on `EINVAL`

```cpp
#ifdef EPOLLEXCLUSIVE
    ev.events = EPOLLIN | EPOLLEXCLUSIVE;
#else
    ev.events = EPOLLIN;
#endif
```

Or with runtime fallback:

```cpp
ev.events = EPOLLIN | EPOLLEXCLUSIVE;
if (::epoll_ctl(epoll_fd_, EPOLL_CTL_ADD, event_fd_, &ev) == -1) {
    if (errno == EINVAL) {
        // Fallback for older kernels
        ev.events = EPOLLIN;
        ::epoll_ctl(epoll_fd_, EPOLL_CTL_ADD, event_fd_, &ev);
    }
}
```

### Socket Accept Operations

`EPOLLEXCLUSIVE` is also relevant for accept operations on listening sockets (`sockets.hpp:462`). When multiple threads wait to accept on the same socket, `EPOLLEXCLUSIVE` prevents all threads from waking on each incoming connection.

However, this requires careful consideration:
- Socket registration currently uses edge-triggered mode (`EPOLLIN | EPOLLET`)
- `EPOLLEXCLUSIVE` combined with `EPOLLET` has specific semantics
- Need to verify correct behavior with the one-shot unregister pattern

### Interaction with Edge-Triggered Mode

The current implementation uses `EPOLLET` (edge-triggered) for all socket operations. When combining `EPOLLEXCLUSIVE` with `EPOLLET`:
- Wakeup occurs on edge (transition to ready state)
- Only one thread receives the notification
- If that thread doesn't fully drain the fd, subsequent data won't trigger another wakeup until the fd returns to non-ready state

This should be compatible with Corosio's one-shot pattern where fds are unregistered immediately after `epoll_wait()` returns.

### Level-Triggered Eventfd

The eventfd used for wakeup is currently level-triggered (no `EPOLLET`). With `EPOLLEXCLUSIVE`:
- One thread wakes per `epoll_wait()` cycle
- If multiple `wakeup()` calls occur, the accumulated value is read once
- This matches desired behavior (wake one thread to process queue)

### Benchmarking Strategy

To measure the impact, create a benchmark that:
1. Spawns N worker threads calling `io_context::run()`
2. Has a producer thread calling `post()` at high frequency
3. Measures:
   - Total throughput (posts/second)
   - CPU utilization
   - Context switch rate
   - Latency distribution

Compare results with and without `EPOLLEXCLUSIVE`.

## Compatibility

| Requirement | Version |
|-------------|---------|
| Linux Kernel | 4.5+ |
| glibc | 2.24+ |
| musl | 1.1.18+ |

For older systems, the library should gracefully fall back to standard behavior.

## Alternatives Considered

### 1. Single-Threaded Wakeup Consumer

Designate one thread as the "wakeup handler" that distributes work to others. Rejected because:
- Adds complexity
- Creates a bottleneck
- Doesn't leverage kernel-level load balancing

### 2. Per-Thread Eventfds

Give each thread its own eventfd and wake threads round-robin. Rejected because:
- Requires tracking which threads are blocked
- Adds memory overhead (one eventfd per thread)
- Complicates the scheduler implementation

### 3. Condition Variable Signaling

Replace eventfd with pthread condition variables for wakeup. Rejected because:
- Requires restructuring the event loop
- Loses the unified epoll-based wait
- May not integrate well with timer handling

## References

- [epoll_ctl(2) man page](https://man7.org/linux/man-pages/man2/epoll_ctl.2.html)
- [LWN: EPOLLEXCLUSIVE and the thundering herd](https://lwn.net/Articles/632590/)
- [Linux kernel commit adding EPOLLEXCLUSIVE](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=df0108c5da561c66c333bb46bfe3c1fc65905898)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Investigate `EPOLLEXCLUSIVE` #33

Summary

Background

Current Implementation

Current Mitigation

Proposed Solution: EPOLLEXCLUSIVE

Application to Corosio

Technical Considerations

Kernel Version Detection

Socket Accept Operations

Interaction with Edge-Triggered Mode

Level-Triggered Eventfd

Benchmarking Strategy

Compatibility

Alternatives Considered

1. Single-Threaded Wakeup Consumer

2. Per-Thread Eventfds

3. Condition Variable Signaling

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Investigate EPOLLEXCLUSIVE #33

Description

Summary

Background

Current Implementation

Current Mitigation

Proposed Solution: EPOLLEXCLUSIVE

Application to Corosio

Technical Considerations

Kernel Version Detection

Socket Accept Operations

Interaction with Edge-Triggered Mode

Level-Triggered Eventfd

Benchmarking Strategy

Compatibility

Alternatives Considered

1. Single-Threaded Wakeup Consumer

2. Per-Thread Eventfds

3. Condition Variable Signaling

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Investigate `EPOLLEXCLUSIVE` #33