Skip to content

http_server: harden listener and architecture#11742

Open
edsiper wants to merge 7 commits intomasterfrom
harden-http-server
Open

http_server: harden listener and architecture#11742
edsiper wants to merge 7 commits intomasterfrom
harden-http-server

Conversation

@edsiper
Copy link
Copy Markdown
Member

@edsiper edsiper commented Apr 23, 2026

This PR hardens Fluent Bit’s shared HTTP server against stalled and partially-sent client connections by introducing a default per-connection idle timeout and cleaning up session lifecycle handling around dropped or timed-out sockets.

It also propagates the timeout setting to HTTP services that construct server options manually, adds internal coverage for timeout/config behavior, and adds integration tests that verify blocked connections are evicted and normal traffic recovers correctly for both HTTP/1.1 and HTTP/2.


Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

Summary by CodeRabbit

  • New Features

    • HTTP server supports configurable idle timeout via http_server.idle_timeout (default 10s)
    • Optional per-connection drop-notification hook for connection lifecycle events
  • Improvements / Bug Fixes

    • More robust session handling, stale-session reaping, and safer connection teardown
    • Idle-timeout consistently applied across workers and networking paths
  • Tests

    • Integration and unit tests added for idle-timeout behavior and connection cleanup (HTTP/1.1 and HTTP/2)

@edsiper edsiper requested a review from cosmo0920 as a code owner April 23, 2026 18:11
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 23, 2026

📝 Walkthrough

Walkthrough

Adds a per-connection drop-notification callback and an HTTP server idle-timeout option; wires idle_timeout through config/options/workers/network setup, introduces connection-drop handling and stale-session reaping, and adds integration/unit tests for idle eviction and session cleanup.

Changes

Cohort / File(s) Summary
Connection API
include/fluent-bit/flb_connection.h
Adds flb_connection_drop_notification_callback typedef and drop_notification_callback field to struct flb_connection.
Downstream teardown & EOF fix
src/flb_downstream.c, src/flb_connection.c
Call drop_notification_callback during prepare_destroy_conn() before event deregistration/close; add trailing newline to flb_connection.c.
HTTP server API & config
include/fluent-bit/http_server/flb_http_server.h, src/http_server/flb_http_server_config_map.c
Add HTTP_SERVER_DEFAULT_IDLE_TIMEOUT, add idle_timeout to config/options/runtime structs, and register http_server.idle_timeout config key (default 10s).
HTTP server core
src/http_server/flb_http_server.c
Wire idle_timeout into per-worker networking setup, add flb_http_server_connection_drop(), add stale-session reaper, strengthen NULL/invalid-connection checks, adjust session destruction/cleanup, and expose flb_http_server_runtime_worker_net_setup_get().
Plugin integrations
plugins/out_prometheus_exporter/prom_http.c, plugins/out_vivo_exporter/vivo_http.c
Propagate configured idle_timeout into flb_http_server_options when creating HTTP servers.
Tests — integration & unit
tests/integration/.../in_http_max_connections/*, tests/internal/http_server.c
Add/modify tests for idle-timeout eviction (HTTP/1.1 and HTTP/2), config propagation, net_setup.io_timeout mapping, multi-worker behavior, and session destruction/drop_pending semantics; refactor test helpers and startup/skip logic.

Sequence Diagram

sequenceDiagram
    participant Client
    participant HTTPServer as HTTP Server
    participant Connection
    participant Session
    participant Downstream as Downstream

    Client->>HTTPServer: Establish connection
    HTTPServer->>Connection: create connection, set drop_notification_callback
    HTTPServer->>Session: create session linked to connection
    Note over HTTPServer,Session: idle_timeout tracking

    Note over Connection: idle timeout expires / server enforces max_connections
    Connection->>Downstream: prepare_destroy_conn()
    Downstream->>Connection: invoke drop_notification_callback
    Connection->>Session: mark drop_pending / detach connection refs
    Downstream->>Downstream: deregister events, close socket, cleanup
    HTTPServer->>HTTPServer: reap stale sessions (destroy sessions with invalid/NULL connection)

    Client->>HTTPServer: New connection attempt succeeds after eviction
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 I nibbled idle seconds down to none,
I tugged the hook when sockets come undone,
Sessions unlinked, queues breathe free,
New hops arrive — another tasty spree! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.56% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'http_server: harden listener and architecture' directly reflects the main objective: hardening the HTTP server against stalled connections and improving session lifecycle handling through idle timeouts.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch harden-http-server

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a0cd7e4faa

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 368 to +370
session = (struct flb_http_server_session *) connection->user_data;
if (session == NULL) {
return -1;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Destroy timed-out sessions after user_data is cleared

When a timeout drops a connection, prepare_destroy_conn() now invokes the drop callback, which clears connection->user_data before the injected cleanup event is processed. This early return path then skips flb_http_server_session_destroy(), so the session can remain linked in server->clients (and keep its buffers) until another accept path happens to sweep it. In single-worker/caller-event-loop mode, flb_http_server_worker_maintenance() is never run, so these timed-out sessions are retained unnecessarily.

Useful? React with 👍 / 👎.

Comment on lines +938 to +941
if (options->networking_setup != NULL &&
options->networking_setup->io_timeout <= 0 &&
options->idle_timeout > 0) {
options->networking_setup->io_timeout = options->idle_timeout;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve disabled idle timeout in multi-worker servers

This assignment is correct for normal initialization, but worker servers are initialized with fresh default options (idle_timeout=10) and a copied net_setup. If users configure http_server.idle_timeout to 0 (documented to disable), net_setup.io_timeout stays 0 in the parent and this block re-applies 10 during worker initialization, so disabling idle timeout does not work when http_server.workers > 1.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (4)
include/fluent-bit/flb_connection.h (1)

61-72: Document the drop callback contract.

Consider a brief comment on drop_notification_callback describing the invariants: it's invoked from prepare_destroy_conn before the event is deregistered and the fd is closed, while the connection is still linked in busy_queue. Callers (e.g. the HTTP server session) must not release/destroy the connection from within the callback, and must not unlink it from the queue, since prepare_destroy_conn will do that right after. A short comment here will prevent future misuse.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@include/fluent-bit/flb_connection.h` around lines 61 - 72, Add a doc comment
for the drop_notification_callback field in struct flb_connection explaining its
contract: state that flb_connection_drop_notification_callback is invoked from
prepare_destroy_conn while the connection is still linked in busy_queue and
before the event is deregistered and the fd is closed; callers (e.g., HTTP
server session) must not free/destroy or unlink the flb_connection inside the
callback because prepare_destroy_conn will handle unlinking and final teardown
immediately after the callback returns.
src/flb_connection.c (1)

20-20: Nit: redundant explicit NULL after memset.

The memset(connection, 0, sizeof(...)) on line 16 already zero-initializes drop_notification_callback. The explicit = NULL is harmless and matches the pattern of the surrounding fields, so feel free to keep it for readability or drop it.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/flb_connection.c` at line 20, Remove the redundant explicit NULL
assignment to connection->drop_notification_callback since the preceding
memset(connection, 0, sizeof(*connection)) already zero-initializes that field;
locate the assignment "connection->drop_notification_callback = NULL;" and
delete that line (or if you prefer to keep for readability, add a comment
stating it's intentional), leaving memset as the single initializer.
src/http_server/flb_http_server.c (1)

120-146: Minor: duplicated stale-session reaping logic.

flb_http_server_client_count and flb_http_server_worker_maintenance now both walk server->clients and destroy sessions whose connection is NULL or whose fd is FLB_INVALID_SOCKET. Consider extracting into a single helper (e.g. flb_http_server_reap_stale_sessions()) to keep the two code paths from drifting — especially since client_count is on the connection-accept hot path.

♻️ Proposed extraction
+static void flb_http_server_reap_stale_sessions(struct flb_http_server *server)
+{
+    struct cfl_list                *iterator_backup;
+    struct cfl_list                *iterator;
+    struct flb_http_server_session *session;
+
+    cfl_list_foreach_safe(iterator, iterator_backup, &server->clients) {
+        session = cfl_list_entry(iterator,
+                                 struct flb_http_server_session,
+                                 _head);
+
+        if (session->connection == NULL ||
+            session->connection->fd == FLB_INVALID_SOCKET) {
+            flb_http_server_session_destroy(session);
+        }
+    }
+}

Then have flb_http_server_client_count call it before counting, and flb_http_server_worker_maintenance call it after the downstream timeout sweep.

Also applies to: 509-537

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/http_server/flb_http_server.c` around lines 120 - 146, Extract the
duplicated stale-session reaping loop into a new helper
flb_http_server_reap_stale_sessions(struct flb_http_server *server) that
iterates server->clients, calls flb_http_server_session_destroy(session) for
sessions where session->connection == NULL or session->connection->fd ==
FLB_INVALID_SOCKET, and returns void; then replace the inline loops in
flb_http_server_client_count and flb_http_server_worker_maintenance to call
flb_http_server_reap_stale_sessions(server) (call it at the start of
client_count as suggested and after the downstream timeout sweep in
worker_maintenance) so both paths share the same logic and avoid drift.
tests/integration/scenarios/in_http_max_connections/tests/test_in_http_max_connections_001.py (1)

99-101: Consider mirroring the FluentBitStartupError handling from the existing test.

test_in_http_max_connections_blocks_and_recovers (L49–56) gracefully skips when the running Fluent Bit binary does not support http_server.max_connections. The two new tests call service.start() without that guard, so on older binaries (or if the new http_server.idle_timeout key is missing) they will hard-fail instead of skipping. Reusing the same pattern would keep the suite resilient.

Also applies to: 146-151

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/integration/scenarios/in_http_max_connections/tests/test_in_http_max_connections_001.py`
around lines 99 - 101, The new tests call Service().start() unguarded and will
hard-fail on binaries that lack the http_server.max_connections or
http_server.idle_timeout keys; mirror the existing FluentBitStartupError
handling found in test_in_http_max_connections_blocks_and_recovers by wrapping
service.start() in a try/except that catches FluentBitStartupError and calls
pytest.skip with a descriptive message, and apply the same change to the other
occurrence around the second test (the block at the later location that also
calls service.start()) so both tests skip gracefully on older/unsupported Fluent
Bit binaries.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@tests/integration/scenarios/in_http_max_connections/tests/test_in_http_max_connections_001.py`:
- Around line 110-143: Initialize response = None before the try block and move
the recovery run_curl_request (the second call that posts
"idle-timeout-recovered") out of the try/finally so it always runs after
held_connection is closed and service.stop() is called; ensure you still close
held_connection in the finally and then perform the recovery request and set
response from its return value before the final assertions that inspect response
and forwarded_payloads (references: response, run_curl_request, held_connection,
service.stop, service.service.wait_for_condition).
- Around line 99-143: The test
test_in_http_idle_timeout_evicts_partial_request_connection can leave response
unassigned if the second run_curl_request() raises, so initialize or guarantee
assignment before the finally/asserts: add a default assignment (e.g. response =
{"status_code": 0}) before the try/finally or wrap the recovery
run_curl_request(...) in its own try/except and set response to a suitable
failure dict on exception, and likewise ensure forwarded_payloads is initialized
or set to an empty list on error; this keeps references to response and
forwarded_payloads safe for the final assertions.

---

Nitpick comments:
In `@include/fluent-bit/flb_connection.h`:
- Around line 61-72: Add a doc comment for the drop_notification_callback field
in struct flb_connection explaining its contract: state that
flb_connection_drop_notification_callback is invoked from prepare_destroy_conn
while the connection is still linked in busy_queue and before the event is
deregistered and the fd is closed; callers (e.g., HTTP server session) must not
free/destroy or unlink the flb_connection inside the callback because
prepare_destroy_conn will handle unlinking and final teardown immediately after
the callback returns.

In `@src/flb_connection.c`:
- Line 20: Remove the redundant explicit NULL assignment to
connection->drop_notification_callback since the preceding memset(connection, 0,
sizeof(*connection)) already zero-initializes that field; locate the assignment
"connection->drop_notification_callback = NULL;" and delete that line (or if you
prefer to keep for readability, add a comment stating it's intentional), leaving
memset as the single initializer.

In `@src/http_server/flb_http_server.c`:
- Around line 120-146: Extract the duplicated stale-session reaping loop into a
new helper flb_http_server_reap_stale_sessions(struct flb_http_server *server)
that iterates server->clients, calls flb_http_server_session_destroy(session)
for sessions where session->connection == NULL or session->connection->fd ==
FLB_INVALID_SOCKET, and returns void; then replace the inline loops in
flb_http_server_client_count and flb_http_server_worker_maintenance to call
flb_http_server_reap_stale_sessions(server) (call it at the start of
client_count as suggested and after the downstream timeout sweep in
worker_maintenance) so both paths share the same logic and avoid drift.

In
`@tests/integration/scenarios/in_http_max_connections/tests/test_in_http_max_connections_001.py`:
- Around line 99-101: The new tests call Service().start() unguarded and will
hard-fail on binaries that lack the http_server.max_connections or
http_server.idle_timeout keys; mirror the existing FluentBitStartupError
handling found in test_in_http_max_connections_blocks_and_recovers by wrapping
service.start() in a try/except that catches FluentBitStartupError and calls
pytest.skip with a descriptive message, and apply the same change to the other
occurrence around the second test (the block at the later location that also
calls service.start()) so both tests skip gracefully on older/unsupported Fluent
Bit binaries.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2d178d8e-a82d-4827-962e-6b30b626d772

📥 Commits

Reviewing files that changed from the base of the PR and between 29deec9 and a0cd7e4.

📒 Files selected for processing (11)
  • include/fluent-bit/flb_connection.h
  • include/fluent-bit/http_server/flb_http_server.h
  • plugins/out_prometheus_exporter/prom_http.c
  • plugins/out_vivo_exporter/vivo_http.c
  • src/flb_connection.c
  • src/flb_downstream.c
  • src/http_server/flb_http_server.c
  • src/http_server/flb_http_server_config_map.c
  • tests/integration/scenarios/in_http_max_connections/config/in_http_max_connections.yaml
  • tests/integration/scenarios/in_http_max_connections/tests/test_in_http_max_connections_001.py
  • tests/internal/http_server.c

edsiper added 2 commits April 23, 2026 12:35
Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
tests/integration/scenarios/in_http_max_connections/tests/test_in_http_max_connections_001.py (1)

65-69: Swallowed cleanup exception — acceptable, optional debug log.

Ruff flags S110/BLE001 on the try: service.stop() / except Exception: pass. On a skip path where startup already failed, silently swallowing a secondary teardown error is reasonable. If you want to clear the lint warning and retain observability, log at debug level (logging.getLogger(__name__).debug("stop after skip failed", exc_info=True)) rather than restructuring the flow.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/integration/scenarios/in_http_max_connections/tests/test_in_http_max_connections_001.py`
around lines 65 - 69, The except block currently swallows all exceptions from
service.stop() before calling pytest.skip; replace the silent pass with a
debug-level log to satisfy linting and preserve observability: inside the except
for service.stop() log the failure using logging.getLogger(__name__).debug with
a short message like "stop after skip failed" and include exc_info=True, then
proceed to call pytest.skip as before so teardown errors are visible in debug
logs but do not change test flow.
include/fluent-bit/http_server/flb_http_server.h (1)

41-41: Nit: document the unit of HTTP_SERVER_DEFAULT_IDLE_TIMEOUT.

The value 10 is only meaningful once you trace it through to net_setup.io_timeout (seconds). A short /* seconds */ comment next to the macro (and the idle_timeout fields below) would make the contract obvious at the header level.

Proposed tweak
-#define HTTP_SERVER_DEFAULT_IDLE_TIMEOUT       10
+#define HTTP_SERVER_DEFAULT_IDLE_TIMEOUT       10 /* seconds */
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@include/fluent-bit/http_server/flb_http_server.h` at line 41, The macro
HTTP_SERVER_DEFAULT_IDLE_TIMEOUT lacks a unit comment and the idle_timeout
fields are undocumented; add a brief comment indicating the unit (seconds) next
to the macro definition and add matching comments for any struct fields named
idle_timeout in flb_http_server.h (or related header declarations) so the
timeout contract is explicit (reference HTTP_SERVER_DEFAULT_IDLE_TIMEOUT and the
idle_timeout fields).
tests/internal/http_server.c (1)

25-44: Shadow struct is fragile — any drift in the private definition silently invalidates these assertions.

test_http_server_runtime_shadow / test_http_server_worker_context_shadow are layout-copies of private structs in src/http_server/flb_http_server.c. If someone reorders, inserts, or resizes a field there (e.g., adds another int), these tests will read garbage without any compile-time signal.

At minimum, add a sentinel assertion. Better: expose a tiny test-only accessor (flb_http_server_runtime_get_worker_net_setup(server, idx)) behind #ifdef FLB_TESTS so the tests don't need to duplicate the layout.

Minimal guard (in the test file)
/* If the private struct ever changes, this catches it at compile time. */
_Static_assert(sizeof(struct test_http_server_worker_context_shadow)
               == /* keep in sync with flb_http_server_worker_context */ ...,
               "shadow struct drifted from flb_http_server_worker_context");

(Requires exposing sizeof via a getter, since the real struct is file-local — reinforcing that an accessor is the cleaner fix.)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/internal/http_server.c` around lines 25 - 44, The shadow structs
test_http_server_runtime_shadow and test_http_server_worker_context_shadow are
brittle; either add a compile-time sentinel or—preferably—remove the layout
duplication and add a small test-only accessor in the real implementation (e.g.,
flb_http_server_runtime_get_worker_net_setup(server, idx) or
flb_http_server_runtime_worker_size_get()) guarded by `#ifdef` FLB_TESTS in
src/http_server/flb_http_server.c; update tests to call that accessor to obtain
the worker's net_setup (or sizeof the real worker) instead of relying on copied
struct layout, and delete the duplicated private-layout structs from tests (or
replace them with a _Static_assert that compares sizeof via the accessor) so
future private-field changes won’t silently break tests.
src/http_server/flb_http_server.c (1)

144-149: Side-effecting accessor: flb_http_server_client_count now mutates state.

Reaping on every count works, but it makes a read-shaped helper silently destroy sessions, and the cost is O(N) per incoming connection accept (line 468). Two low-risk options:

  • Rename to make intent explicit (e.g. flb_http_server_active_client_count with a short comment), or
  • Hoist the reap call into the caller(s) so client_count stays a pure accessor.

Not a blocker — the periodic maintenance reap (line 535) already covers the common case — but worth a note for future readers.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/http_server/flb_http_server.c` around lines 144 - 149,
flb_http_server_client_count currently calls flb_http_server_reap_stale_sessions
and thus mutates state; make the accessor pure by removing the reap call from
flb_http_server_client_count so it simply returns
cfl_list_size(&server->clients), then hoist explicit calls to
flb_http_server_reap_stale_sessions into the callers that require stale-session
cleanup (for example the connection-accept path that currently invokes
client_count) so the mutation is explicit; alternatively, if you prefer to keep
the combined behavior, rename flb_http_server_client_count to
flb_http_server_active_client_count and add a brief comment stating it reaps
stale sessions as a side effect so intent is clear.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@tests/integration/scenarios/in_http_max_connections/tests/test_in_http_max_connections_001.py`:
- Around line 150-172: Replace the brittle fixed sleep call (time.sleep(3)) with
a short polling loop that repeatedly calls run_curl_request (using
service.flb_listener_port and the same payload/headers/http_mode) until the
response["status_code"] == 201 or a generous timeout (e.g., 10s) is reached with
a small sleep interval (e.g., 0.5s); keep held_connection cleanup and the use of
service.service.wait_for_condition for forwarded_payloads; apply the identical
polling change to the HTTP/2 variant mentioned around line ~206 so both http1.1
and http2 tests use the same robust retry pattern.

---

Nitpick comments:
In `@include/fluent-bit/http_server/flb_http_server.h`:
- Line 41: The macro HTTP_SERVER_DEFAULT_IDLE_TIMEOUT lacks a unit comment and
the idle_timeout fields are undocumented; add a brief comment indicating the
unit (seconds) next to the macro definition and add matching comments for any
struct fields named idle_timeout in flb_http_server.h (or related header
declarations) so the timeout contract is explicit (reference
HTTP_SERVER_DEFAULT_IDLE_TIMEOUT and the idle_timeout fields).

In `@src/http_server/flb_http_server.c`:
- Around line 144-149: flb_http_server_client_count currently calls
flb_http_server_reap_stale_sessions and thus mutates state; make the accessor
pure by removing the reap call from flb_http_server_client_count so it simply
returns cfl_list_size(&server->clients), then hoist explicit calls to
flb_http_server_reap_stale_sessions into the callers that require stale-session
cleanup (for example the connection-accept path that currently invokes
client_count) so the mutation is explicit; alternatively, if you prefer to keep
the combined behavior, rename flb_http_server_client_count to
flb_http_server_active_client_count and add a brief comment stating it reaps
stale sessions as a side effect so intent is clear.

In
`@tests/integration/scenarios/in_http_max_connections/tests/test_in_http_max_connections_001.py`:
- Around line 65-69: The except block currently swallows all exceptions from
service.stop() before calling pytest.skip; replace the silent pass with a
debug-level log to satisfy linting and preserve observability: inside the except
for service.stop() log the failure using logging.getLogger(__name__).debug with
a short message like "stop after skip failed" and include exc_info=True, then
proceed to call pytest.skip as before so teardown errors are visible in debug
logs but do not change test flow.

In `@tests/internal/http_server.c`:
- Around line 25-44: The shadow structs test_http_server_runtime_shadow and
test_http_server_worker_context_shadow are brittle; either add a compile-time
sentinel or—preferably—remove the layout duplication and add a small test-only
accessor in the real implementation (e.g.,
flb_http_server_runtime_get_worker_net_setup(server, idx) or
flb_http_server_runtime_worker_size_get()) guarded by `#ifdef` FLB_TESTS in
src/http_server/flb_http_server.c; update tests to call that accessor to obtain
the worker's net_setup (or sizeof the real worker) instead of relying on copied
struct layout, and delete the duplicated private-layout structs from tests (or
replace them with a _Static_assert that compares sizeof via the accessor) so
future private-field changes won’t silently break tests.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9c2e2121-0484-44e6-8301-7c0e60966bbe

📥 Commits

Reviewing files that changed from the base of the PR and between a0cd7e4 and 3b96d65.

📒 Files selected for processing (11)
  • include/fluent-bit/flb_connection.h
  • include/fluent-bit/http_server/flb_http_server.h
  • plugins/out_prometheus_exporter/prom_http.c
  • plugins/out_vivo_exporter/vivo_http.c
  • src/flb_connection.c
  • src/flb_downstream.c
  • src/http_server/flb_http_server.c
  • src/http_server/flb_http_server_config_map.c
  • tests/integration/scenarios/in_http_max_connections/config/in_http_max_connections.yaml
  • tests/integration/scenarios/in_http_max_connections/tests/test_in_http_max_connections_001.py
  • tests/internal/http_server.c
✅ Files skipped from review due to trivial changes (3)
  • tests/integration/scenarios/in_http_max_connections/config/in_http_max_connections.yaml
  • src/flb_connection.c
  • src/http_server/flb_http_server_config_map.c
🚧 Files skipped from review as they are similar to previous changes (3)
  • src/flb_downstream.c
  • plugins/out_vivo_exporter/vivo_http.c
  • plugins/out_prometheus_exporter/prom_http.c

Comment on lines +150 to +172
time.sleep(3)
finally:
if held_connection:
held_connection.close()
held_connection = None

response = run_curl_request(
f"http://127.0.0.1:{service.flb_listener_port}/",
payload='{"message":"idle-timeout-recovered"}',
headers=["Content-Type: application/json"],
http_mode="http1.1",
)
forwarded_payloads = service.service.wait_for_condition(
lambda: data_storage["payloads"] if data_storage["payloads"] else None,
timeout=10,
interval=0.5,
description="forwarded idle-timeout payload",
)
finally:
service.stop()

assert response["status_code"] == 201
assert forwarded_payloads[0][0]["message"] == "idle-timeout-recovered"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Replace time.sleep(3) with a short polling loop to reduce CI flake.

The config uses http_server.idle_timeout: 2s and the test sleeps 3s before retrying. On a loaded CI worker (especially with thread scheduling jitter plus the 250ms maintenance tick in flb_http_server_worker_maintenance), that margin is thin and will occasionally produce a status that is still != 201 before the eviction completes. Polling the recovery request until it succeeds (or a generous deadline like timeout=10) is more robust and matches the wait_for_condition style already used in the same test.

Sketch
-            time.sleep(3)
-        finally:
+        finally:
             if held_connection:
                 held_connection.close()
                 held_connection = None

-        response = run_curl_request(
-            f"http://127.0.0.1:{service.flb_listener_port}/",
-            payload='{"message":"idle-timeout-recovered"}',
-            headers=["Content-Type: application/json"],
-            http_mode="http1.1",
-        )
+        deadline = time.monotonic() + 10
+        response = {"status_code": 0}
+        while time.monotonic() < deadline:
+            response = run_curl_request(
+                f"http://127.0.0.1:{service.flb_listener_port}/",
+                payload='{"message":"idle-timeout-recovered"}',
+                headers=["Content-Type: application/json"],
+                http_mode="http1.1",
+            )
+            if response["status_code"] == 201:
+                break
+            time.sleep(0.25)

Apply the same change around line 206 for the HTTP/2 variant.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/integration/scenarios/in_http_max_connections/tests/test_in_http_max_connections_001.py`
around lines 150 - 172, Replace the brittle fixed sleep call (time.sleep(3))
with a short polling loop that repeatedly calls run_curl_request (using
service.flb_listener_port and the same payload/headers/http_mode) until the
response["status_code"] == 201 or a generous timeout (e.g., 10s) is reached with
a small sleep interval (e.g., 0.5s); keep held_connection cleanup and the use of
service.service.wait_for_condition for forwarded_payloads; apply the identical
polling change to the HTTP/2 variant mentioned around line ~206 so both http1.1
and http2 tests use the same robust retry pattern.

edsiper added 4 commits April 23, 2026 20:32
Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/integration/scenarios/in_http_max_connections/tests/test_in_http_max_connections_001.py (1)

106-149: ⚠️ Potential issue | 🟡 Minor

service.stop() can be skipped if the held‑connection block raises.

service.start() happens at line 111, but service.stop() only lives in the second try/finally (lines 148–149). If anything inside the first try body (e.g., socket.create_connection at line 115 on a busy CI box) raises before control reaches line 135, the first finally closes the socket, the exception propagates, and the Fluent Bit service is never stopped — leaking the worker process and its listener port for subsequent tests in the same session.

The two new idle‑timeout tests already use a single outer try/finally around service.stop(), so aligning this test with that shape keeps the suite consistent and leak‑free.

♻️ Proposed shape
     _start_service_or_skip(service, ["http_server.max_connections"])

     held_connection = None
     try:
-        held_connection = socket.create_connection(("127.0.0.1", service.flb_listener_port), timeout=2)
-        held_connection.settimeout(2)
-
-        overflow_rejected = False
         try:
-            response = run_curl_request(
-                f"http://127.0.0.1:{service.flb_listener_port}/",
-                payload='{"message":"max-connections"}',
-                headers=["Content-Type: application/json"],
-                http_mode="http1.1",
-            )
-            overflow_rejected = response["status_code"] != 201
-        except Exception:
-            overflow_rejected = True
-
-        assert overflow_rejected
-    finally:
-        if held_connection:
-            held_connection.close()
+            held_connection = socket.create_connection(("127.0.0.1", service.flb_listener_port), timeout=2)
+            held_connection.settimeout(2)

-    try:
+            overflow_rejected = False
+            try:
+                response = run_curl_request(
+                    f"http://127.0.0.1:{service.flb_listener_port}/",
+                    payload='{"message":"max-connections"}',
+                    headers=["Content-Type: application/json"],
+                    http_mode="http1.1",
+                )
+                overflow_rejected = response["status_code"] != 201
+            except Exception:
+                overflow_rejected = True
+
+            assert overflow_rejected
+        finally:
+            if held_connection:
+                held_connection.close()
+                held_connection = None
+
         accepted = run_curl_request(
             f"http://127.0.0.1:{service.flb_listener_port}/",
             payload='{"message":"max-connections"}',
             headers=["Content-Type: application/json"],
             http_mode="http1.1",
         )
         forwarded_payloads = service.service.wait_for_condition(
             lambda: data_storage["payloads"] if data_storage["payloads"] else None,
             timeout=10,
             interval=0.5,
             description="forwarded max-connections payload",
         )
     finally:
         service.stop()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/integration/scenarios/in_http_max_connections/tests/test_in_http_max_connections_001.py`
around lines 106 - 149, The test can leak the Service if an exception occurs
before the second finally; wrap the entire test runtime (after calling
_start_service_or_skip/service.start) in a single outer try/finally that always
calls service.stop() so the Fluent Bit worker is stopped regardless of where an
exception is raised (reference Service, _start_service_or_skip, held_connection,
run_curl_request and service.stop); remove/adjust the inner finally so
service.stop() is only invoked from the outer finally to guarantee cleanup on
any failure.
🧹 Nitpick comments (2)
tests/integration/scenarios/in_http_max_connections/tests/test_in_http_max_connections_001.py (2)

77-103: Helper silently returns a non‑201 response after the deadline — intentional but worth a short docstring.

When the deadline expires without ever observing a 201, _wait_for_accepted_request returns the last observed response (which may be e.g. 4xx/5xx) rather than raising. That is actually the behavior the callers rely on (so assert response["status_code"] == 201 fails with the real status code rather than a timeout), but it's non‑obvious from the name alone. A one‑line docstring clarifying "returns the last response on deadline; only raises if no response was ever received" would prevent future misuse.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/integration/scenarios/in_http_max_connections/tests/test_in_http_max_connections_001.py`
around lines 77 - 103, Add a one-line docstring to the helper function
_wait_for_accepted_request clarifying its deadline behavior: state that it polls
until a 201 is seen, returns the last observed HTTP response if the deadline
expires (so callers can assert on non-201 status codes), and only raises if no
response was ever received; place this docstring immediately under the def
_wait_for_accepted_request(...) signature to make the intent explicit to future
readers.

59-74: Property‑name match in startup error string is substring‑based and could misfire.

The check property_name in error_message will also match if the property name happens to appear as a substring of an unrelated error (e.g., a different property whose name contains http_server.max_connections — unlikely today but fragile). Matching the more specific log form (unknown configuration property '<name>') in both branches would make the skip logic harder to trigger accidentally. Optional — the current form is good enough for the known error shapes.

♻️ Proposed tightening
         for property_name in required_properties:
-            if (property_name in error_message or
-                    f"unknown configuration property '{property_name}'" in log_contents):
+            unknown_marker = f"unknown configuration property '{property_name}'"
+            if unknown_marker in error_message or unknown_marker in log_contents:
                 try:
                     service.stop()
                 except Exception:
                     LOGGER.debug("stop after skip failed", exc_info=True)
                 pytest.skip(f"{property_name} is not supported by this Fluent Bit binary")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/integration/scenarios/in_http_max_connections/tests/test_in_http_max_connections_001.py`
around lines 59 - 74, The skip logic in _start_service_or_skip is using a
substring check (property_name in error_message) which can false-positive;
tighten it by matching the specific unknown-property form in both places: check
for the exact phrase f"unknown configuration property '{property_name}'" (or an
equivalent anchored regex) against error_message and log_contents instead of
using a bare substring containment, so the function only skips when the error
explicitly reports the given property as unknown; update references to
error_message and log_contents accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In
`@tests/integration/scenarios/in_http_max_connections/tests/test_in_http_max_connections_001.py`:
- Around line 106-149: The test can leak the Service if an exception occurs
before the second finally; wrap the entire test runtime (after calling
_start_service_or_skip/service.start) in a single outer try/finally that always
calls service.stop() so the Fluent Bit worker is stopped regardless of where an
exception is raised (reference Service, _start_service_or_skip, held_connection,
run_curl_request and service.stop); remove/adjust the inner finally so
service.stop() is only invoked from the outer finally to guarantee cleanup on
any failure.

---

Nitpick comments:
In
`@tests/integration/scenarios/in_http_max_connections/tests/test_in_http_max_connections_001.py`:
- Around line 77-103: Add a one-line docstring to the helper function
_wait_for_accepted_request clarifying its deadline behavior: state that it polls
until a 201 is seen, returns the last observed HTTP response if the deadline
expires (so callers can assert on non-201 status codes), and only raises if no
response was ever received; place this docstring immediately under the def
_wait_for_accepted_request(...) signature to make the intent explicit to future
readers.
- Around line 59-74: The skip logic in _start_service_or_skip is using a
substring check (property_name in error_message) which can false-positive;
tighten it by matching the specific unknown-property form in both places: check
for the exact phrase f"unknown configuration property '{property_name}'" (or an
equivalent anchored regex) against error_message and log_contents instead of
using a bare substring containment, so the function only skips when the error
explicitly reports the given property as unknown; update references to
error_message and log_contents accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 97e44797-e0ee-4f41-8105-f31406bfba33

📥 Commits

Reviewing files that changed from the base of the PR and between 3b96d65 and 6f6bf60.

📒 Files selected for processing (8)
  • include/fluent-bit/http_server/flb_http_server.h
  • plugins/out_prometheus_exporter/prom_http.c
  • plugins/out_vivo_exporter/vivo_http.c
  • src/http_server/flb_http_server.c
  • src/http_server/flb_http_server_config_map.c
  • tests/integration/scenarios/in_http_max_connections/config/in_http_max_connections.yaml
  • tests/integration/scenarios/in_http_max_connections/tests/test_in_http_max_connections_001.py
  • tests/internal/http_server.c
✅ Files skipped from review due to trivial changes (2)
  • tests/integration/scenarios/in_http_max_connections/config/in_http_max_connections.yaml
  • tests/internal/http_server.c
🚧 Files skipped from review as they are similar to previous changes (4)
  • plugins/out_prometheus_exporter/prom_http.c
  • plugins/out_vivo_exporter/vivo_http.c
  • src/http_server/flb_http_server_config_map.c
  • src/http_server/flb_http_server.c

Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
@edsiper edsiper force-pushed the harden-http-server branch from 6f6bf60 to acb63fb Compare April 24, 2026 02:45
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
tests/integration/scenarios/in_http_max_connections/tests/test_in_http_max_connections_001.py (3)

59-75: Non-skip error path does not stop the service.

When FluentBitStartupError is raised but the error does not match any of required_properties, raise re-throws without calling service.stop(), so the pre-start receiver (started by _start_receiver) stays running until pytest tears down the process. Mirror the skip path's cleanup to keep behavior symmetric:

♻️ Suggested tweak
         for property_name in required_properties:
             unknown_property_error = f"unknown configuration property '{property_name}'"
             if (unknown_property_error in error_message or
                     unknown_property_error in log_contents):
                 try:
                     service.stop()
                 except Exception:
                     LOGGER.debug("stop after skip failed", exc_info=True)
                 pytest.skip(f"{property_name} is not supported by this Fluent Bit binary")
+        try:
+            service.stop()
+        except Exception:
+            LOGGER.debug("stop after startup failure failed", exc_info=True)
         raise
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/integration/scenarios/in_http_max_connections/tests/test_in_http_max_connections_001.py`
around lines 59 - 75, The error path in _start_service_or_skip currently
re-raises FluentBitStartupError without stopping the started service; modify
_start_service_or_skip so that before re-raising the exception (in the final
raise when no required_properties matched) it attempts to stop the service (call
service.stop()) and log any stop failure (use LOGGER.debug with exc_info=True)
similar to the skip branch, then re-raise the original FluentBitStartupError;
keep the existing checks that inspect error_message and _read_fluent_bit_log and
ensure the stop attempt happens for the non-skip path as well.

207-258: LGTM; consider parameterizing to dedupe with the HTTP/1.1 variant.

The partial preface (PRI * HTTP/2.0\r\n\r\nSM\r\n, missing the trailing \r\n\r\n) correctly simulates a stalled prior-knowledge HTTP/2 client, and the curl_supports_http2() guard is right. The only nit is that this test is ~95% identical to test_in_http_idle_timeout_evicts_partial_request_connection; a pytest.mark.parametrize over (http_mode, partial_bytes, extra_asserts) would cut the duplication without losing readability. Optional — fine to leave as-is for clarity.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/integration/scenarios/in_http_max_connections/tests/test_in_http_max_connections_001.py`
around lines 207 - 258, The two nearly identical tests
test_in_http_idle_timeout_evicts_partial_http2_preface_connection and
test_in_http_idle_timeout_evicts_partial_request_connection should be
consolidated via pytest.mark.parametrize to remove duplication: extract the
differing pieces (http_mode value, the partial preface bytes sent, and any
variant-specific assertions) and parametrize the test function to iterate over
those tuples; update the test body to use the parameters where run_curl_request,
socket.sendall, and _wait_for_accepted_request are invoked and preserve existing
asserts (e.g., response["http_version"] checks) via an extra_asserts lambda or
flag so behavior for both HTTP/1.1 and HTTP/2 is retained.

78-105: Fallback return shape is inconsistent with the HTTP/2 test's expectations.

The degenerate fallback at line 105 returns {"status_code": 0}, but the HTTP/2 test reads response["http_version"] unconditionally at line 257. This path is only reachable if the loop body never runs (e.g., timeout <= 0), so it's effectively dead with the current defaults — still worth making defensive to avoid a future surprise:

🛡️ Suggested tweak
-    return {"status_code": 0}
+    return {"status_code": 0, "http_version": ""}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/integration/scenarios/in_http_max_connections/tests/test_in_http_max_connections_001.py`
around lines 78 - 105, The fallback return in _wait_for_accepted_request
currently returns {"status_code": 0} which is missing keys the HTTP/2 test
expects (e.g., "http_version"); update the degenerate return value in
_wait_for_accepted_request to include the same response shape used by
run_curl_request (at minimum add "http_version": None or a sensible default like
"HTTP/2" along with "status_code": 0) so callers that read
response["http_version"] won't raise on the fallback path.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In
`@tests/integration/scenarios/in_http_max_connections/tests/test_in_http_max_connections_001.py`:
- Around line 59-75: The error path in _start_service_or_skip currently
re-raises FluentBitStartupError without stopping the started service; modify
_start_service_or_skip so that before re-raising the exception (in the final
raise when no required_properties matched) it attempts to stop the service (call
service.stop()) and log any stop failure (use LOGGER.debug with exc_info=True)
similar to the skip branch, then re-raise the original FluentBitStartupError;
keep the existing checks that inspect error_message and _read_fluent_bit_log and
ensure the stop attempt happens for the non-skip path as well.
- Around line 207-258: The two nearly identical tests
test_in_http_idle_timeout_evicts_partial_http2_preface_connection and
test_in_http_idle_timeout_evicts_partial_request_connection should be
consolidated via pytest.mark.parametrize to remove duplication: extract the
differing pieces (http_mode value, the partial preface bytes sent, and any
variant-specific assertions) and parametrize the test function to iterate over
those tuples; update the test body to use the parameters where run_curl_request,
socket.sendall, and _wait_for_accepted_request are invoked and preserve existing
asserts (e.g., response["http_version"] checks) via an extra_asserts lambda or
flag so behavior for both HTTP/1.1 and HTTP/2 is retained.
- Around line 78-105: The fallback return in _wait_for_accepted_request
currently returns {"status_code": 0} which is missing keys the HTTP/2 test
expects (e.g., "http_version"); update the degenerate return value in
_wait_for_accepted_request to include the same response shape used by
run_curl_request (at minimum add "http_version": None or a sensible default like
"HTTP/2" along with "status_code": 0) so callers that read
response["http_version"] won't raise on the fallback path.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4f933fa7-747e-4ba1-b675-99442d8386db

📥 Commits

Reviewing files that changed from the base of the PR and between 6f6bf60 and acb63fb.

📒 Files selected for processing (2)
  • tests/integration/scenarios/in_http_max_connections/config/in_http_max_connections.yaml
  • tests/integration/scenarios/in_http_max_connections/tests/test_in_http_max_connections_001.py
✅ Files skipped from review due to trivial changes (1)
  • tests/integration/scenarios/in_http_max_connections/config/in_http_max_connections.yaml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant