http_server: harden listener and architecture#11742
Conversation
📝 WalkthroughWalkthroughAdds a per-connection drop-notification callback and an HTTP server idle-timeout option; wires idle_timeout through config/options/workers/network setup, introduces connection-drop handling and stale-session reaping, and adds integration/unit tests for idle eviction and session cleanup. Changes
Sequence DiagramsequenceDiagram
participant Client
participant HTTPServer as HTTP Server
participant Connection
participant Session
participant Downstream as Downstream
Client->>HTTPServer: Establish connection
HTTPServer->>Connection: create connection, set drop_notification_callback
HTTPServer->>Session: create session linked to connection
Note over HTTPServer,Session: idle_timeout tracking
Note over Connection: idle timeout expires / server enforces max_connections
Connection->>Downstream: prepare_destroy_conn()
Downstream->>Connection: invoke drop_notification_callback
Connection->>Session: mark drop_pending / detach connection refs
Downstream->>Downstream: deregister events, close socket, cleanup
HTTPServer->>HTTPServer: reap stale sessions (destroy sessions with invalid/NULL connection)
Client->>HTTPServer: New connection attempt succeeds after eviction
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a0cd7e4faa
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| session = (struct flb_http_server_session *) connection->user_data; | ||
| if (session == NULL) { | ||
| return -1; |
There was a problem hiding this comment.
Destroy timed-out sessions after user_data is cleared
When a timeout drops a connection, prepare_destroy_conn() now invokes the drop callback, which clears connection->user_data before the injected cleanup event is processed. This early return path then skips flb_http_server_session_destroy(), so the session can remain linked in server->clients (and keep its buffers) until another accept path happens to sweep it. In single-worker/caller-event-loop mode, flb_http_server_worker_maintenance() is never run, so these timed-out sessions are retained unnecessarily.
Useful? React with 👍 / 👎.
| if (options->networking_setup != NULL && | ||
| options->networking_setup->io_timeout <= 0 && | ||
| options->idle_timeout > 0) { | ||
| options->networking_setup->io_timeout = options->idle_timeout; |
There was a problem hiding this comment.
Preserve disabled idle timeout in multi-worker servers
This assignment is correct for normal initialization, but worker servers are initialized with fresh default options (idle_timeout=10) and a copied net_setup. If users configure http_server.idle_timeout to 0 (documented to disable), net_setup.io_timeout stays 0 in the parent and this block re-applies 10 during worker initialization, so disabling idle timeout does not work when http_server.workers > 1.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (4)
include/fluent-bit/flb_connection.h (1)
61-72: Document the drop callback contract.Consider a brief comment on
drop_notification_callbackdescribing the invariants: it's invoked fromprepare_destroy_connbefore the event is deregistered and the fd is closed, while the connection is still linked inbusy_queue. Callers (e.g. the HTTP server session) must not release/destroy the connection from within the callback, and must not unlink it from the queue, sinceprepare_destroy_connwill do that right after. A short comment here will prevent future misuse.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@include/fluent-bit/flb_connection.h` around lines 61 - 72, Add a doc comment for the drop_notification_callback field in struct flb_connection explaining its contract: state that flb_connection_drop_notification_callback is invoked from prepare_destroy_conn while the connection is still linked in busy_queue and before the event is deregistered and the fd is closed; callers (e.g., HTTP server session) must not free/destroy or unlink the flb_connection inside the callback because prepare_destroy_conn will handle unlinking and final teardown immediately after the callback returns.src/flb_connection.c (1)
20-20: Nit: redundant explicit NULL after memset.The
memset(connection, 0, sizeof(...))on line 16 already zero-initializesdrop_notification_callback. The explicit= NULLis harmless and matches the pattern of the surrounding fields, so feel free to keep it for readability or drop it.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/flb_connection.c` at line 20, Remove the redundant explicit NULL assignment to connection->drop_notification_callback since the preceding memset(connection, 0, sizeof(*connection)) already zero-initializes that field; locate the assignment "connection->drop_notification_callback = NULL;" and delete that line (or if you prefer to keep for readability, add a comment stating it's intentional), leaving memset as the single initializer.src/http_server/flb_http_server.c (1)
120-146: Minor: duplicated stale-session reaping logic.
flb_http_server_client_countandflb_http_server_worker_maintenancenow both walkserver->clientsand destroy sessions whose connection is NULL or whose fd isFLB_INVALID_SOCKET. Consider extracting into a single helper (e.g.flb_http_server_reap_stale_sessions()) to keep the two code paths from drifting — especially sinceclient_countis on the connection-accept hot path.♻️ Proposed extraction
+static void flb_http_server_reap_stale_sessions(struct flb_http_server *server) +{ + struct cfl_list *iterator_backup; + struct cfl_list *iterator; + struct flb_http_server_session *session; + + cfl_list_foreach_safe(iterator, iterator_backup, &server->clients) { + session = cfl_list_entry(iterator, + struct flb_http_server_session, + _head); + + if (session->connection == NULL || + session->connection->fd == FLB_INVALID_SOCKET) { + flb_http_server_session_destroy(session); + } + } +}Then have
flb_http_server_client_countcall it before counting, andflb_http_server_worker_maintenancecall it after the downstream timeout sweep.Also applies to: 509-537
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/http_server/flb_http_server.c` around lines 120 - 146, Extract the duplicated stale-session reaping loop into a new helper flb_http_server_reap_stale_sessions(struct flb_http_server *server) that iterates server->clients, calls flb_http_server_session_destroy(session) for sessions where session->connection == NULL or session->connection->fd == FLB_INVALID_SOCKET, and returns void; then replace the inline loops in flb_http_server_client_count and flb_http_server_worker_maintenance to call flb_http_server_reap_stale_sessions(server) (call it at the start of client_count as suggested and after the downstream timeout sweep in worker_maintenance) so both paths share the same logic and avoid drift.tests/integration/scenarios/in_http_max_connections/tests/test_in_http_max_connections_001.py (1)
99-101: Consider mirroring theFluentBitStartupErrorhandling from the existing test.
test_in_http_max_connections_blocks_and_recovers(L49–56) gracefully skips when the running Fluent Bit binary does not supporthttp_server.max_connections. The two new tests callservice.start()without that guard, so on older binaries (or if the newhttp_server.idle_timeoutkey is missing) they will hard-fail instead of skipping. Reusing the same pattern would keep the suite resilient.Also applies to: 146-151
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/integration/scenarios/in_http_max_connections/tests/test_in_http_max_connections_001.py` around lines 99 - 101, The new tests call Service().start() unguarded and will hard-fail on binaries that lack the http_server.max_connections or http_server.idle_timeout keys; mirror the existing FluentBitStartupError handling found in test_in_http_max_connections_blocks_and_recovers by wrapping service.start() in a try/except that catches FluentBitStartupError and calls pytest.skip with a descriptive message, and apply the same change to the other occurrence around the second test (the block at the later location that also calls service.start()) so both tests skip gracefully on older/unsupported Fluent Bit binaries.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@tests/integration/scenarios/in_http_max_connections/tests/test_in_http_max_connections_001.py`:
- Around line 110-143: Initialize response = None before the try block and move
the recovery run_curl_request (the second call that posts
"idle-timeout-recovered") out of the try/finally so it always runs after
held_connection is closed and service.stop() is called; ensure you still close
held_connection in the finally and then perform the recovery request and set
response from its return value before the final assertions that inspect response
and forwarded_payloads (references: response, run_curl_request, held_connection,
service.stop, service.service.wait_for_condition).
- Around line 99-143: The test
test_in_http_idle_timeout_evicts_partial_request_connection can leave response
unassigned if the second run_curl_request() raises, so initialize or guarantee
assignment before the finally/asserts: add a default assignment (e.g. response =
{"status_code": 0}) before the try/finally or wrap the recovery
run_curl_request(...) in its own try/except and set response to a suitable
failure dict on exception, and likewise ensure forwarded_payloads is initialized
or set to an empty list on error; this keeps references to response and
forwarded_payloads safe for the final assertions.
---
Nitpick comments:
In `@include/fluent-bit/flb_connection.h`:
- Around line 61-72: Add a doc comment for the drop_notification_callback field
in struct flb_connection explaining its contract: state that
flb_connection_drop_notification_callback is invoked from prepare_destroy_conn
while the connection is still linked in busy_queue and before the event is
deregistered and the fd is closed; callers (e.g., HTTP server session) must not
free/destroy or unlink the flb_connection inside the callback because
prepare_destroy_conn will handle unlinking and final teardown immediately after
the callback returns.
In `@src/flb_connection.c`:
- Line 20: Remove the redundant explicit NULL assignment to
connection->drop_notification_callback since the preceding memset(connection, 0,
sizeof(*connection)) already zero-initializes that field; locate the assignment
"connection->drop_notification_callback = NULL;" and delete that line (or if you
prefer to keep for readability, add a comment stating it's intentional), leaving
memset as the single initializer.
In `@src/http_server/flb_http_server.c`:
- Around line 120-146: Extract the duplicated stale-session reaping loop into a
new helper flb_http_server_reap_stale_sessions(struct flb_http_server *server)
that iterates server->clients, calls flb_http_server_session_destroy(session)
for sessions where session->connection == NULL or session->connection->fd ==
FLB_INVALID_SOCKET, and returns void; then replace the inline loops in
flb_http_server_client_count and flb_http_server_worker_maintenance to call
flb_http_server_reap_stale_sessions(server) (call it at the start of
client_count as suggested and after the downstream timeout sweep in
worker_maintenance) so both paths share the same logic and avoid drift.
In
`@tests/integration/scenarios/in_http_max_connections/tests/test_in_http_max_connections_001.py`:
- Around line 99-101: The new tests call Service().start() unguarded and will
hard-fail on binaries that lack the http_server.max_connections or
http_server.idle_timeout keys; mirror the existing FluentBitStartupError
handling found in test_in_http_max_connections_blocks_and_recovers by wrapping
service.start() in a try/except that catches FluentBitStartupError and calls
pytest.skip with a descriptive message, and apply the same change to the other
occurrence around the second test (the block at the later location that also
calls service.start()) so both tests skip gracefully on older/unsupported Fluent
Bit binaries.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 2d178d8e-a82d-4827-962e-6b30b626d772
📒 Files selected for processing (11)
include/fluent-bit/flb_connection.hinclude/fluent-bit/http_server/flb_http_server.hplugins/out_prometheus_exporter/prom_http.cplugins/out_vivo_exporter/vivo_http.csrc/flb_connection.csrc/flb_downstream.csrc/http_server/flb_http_server.csrc/http_server/flb_http_server_config_map.ctests/integration/scenarios/in_http_max_connections/config/in_http_max_connections.yamltests/integration/scenarios/in_http_max_connections/tests/test_in_http_max_connections_001.pytests/internal/http_server.c
Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
a0cd7e4 to
3b96d65
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (4)
tests/integration/scenarios/in_http_max_connections/tests/test_in_http_max_connections_001.py (1)
65-69: Swallowed cleanup exception — acceptable, optional debug log.Ruff flags S110/BLE001 on the
try: service.stop() / except Exception: pass. On a skip path where startup already failed, silently swallowing a secondary teardown error is reasonable. If you want to clear the lint warning and retain observability, log at debug level (logging.getLogger(__name__).debug("stop after skip failed", exc_info=True)) rather than restructuring the flow.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/integration/scenarios/in_http_max_connections/tests/test_in_http_max_connections_001.py` around lines 65 - 69, The except block currently swallows all exceptions from service.stop() before calling pytest.skip; replace the silent pass with a debug-level log to satisfy linting and preserve observability: inside the except for service.stop() log the failure using logging.getLogger(__name__).debug with a short message like "stop after skip failed" and include exc_info=True, then proceed to call pytest.skip as before so teardown errors are visible in debug logs but do not change test flow.include/fluent-bit/http_server/flb_http_server.h (1)
41-41: Nit: document the unit ofHTTP_SERVER_DEFAULT_IDLE_TIMEOUT.The value
10is only meaningful once you trace it through tonet_setup.io_timeout(seconds). A short/* seconds */comment next to the macro (and theidle_timeoutfields below) would make the contract obvious at the header level.Proposed tweak
-#define HTTP_SERVER_DEFAULT_IDLE_TIMEOUT 10 +#define HTTP_SERVER_DEFAULT_IDLE_TIMEOUT 10 /* seconds */🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@include/fluent-bit/http_server/flb_http_server.h` at line 41, The macro HTTP_SERVER_DEFAULT_IDLE_TIMEOUT lacks a unit comment and the idle_timeout fields are undocumented; add a brief comment indicating the unit (seconds) next to the macro definition and add matching comments for any struct fields named idle_timeout in flb_http_server.h (or related header declarations) so the timeout contract is explicit (reference HTTP_SERVER_DEFAULT_IDLE_TIMEOUT and the idle_timeout fields).tests/internal/http_server.c (1)
25-44: Shadow struct is fragile — any drift in the private definition silently invalidates these assertions.
test_http_server_runtime_shadow/test_http_server_worker_context_shadoware layout-copies of private structs insrc/http_server/flb_http_server.c. If someone reorders, inserts, or resizes a field there (e.g., adds anotherint), these tests will read garbage without any compile-time signal.At minimum, add a sentinel assertion. Better: expose a tiny test-only accessor (
flb_http_server_runtime_get_worker_net_setup(server, idx)) behind#ifdef FLB_TESTSso the tests don't need to duplicate the layout.Minimal guard (in the test file)
/* If the private struct ever changes, this catches it at compile time. */ _Static_assert(sizeof(struct test_http_server_worker_context_shadow) == /* keep in sync with flb_http_server_worker_context */ ..., "shadow struct drifted from flb_http_server_worker_context");(Requires exposing
sizeofvia a getter, since the real struct is file-local — reinforcing that an accessor is the cleaner fix.)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/internal/http_server.c` around lines 25 - 44, The shadow structs test_http_server_runtime_shadow and test_http_server_worker_context_shadow are brittle; either add a compile-time sentinel or—preferably—remove the layout duplication and add a small test-only accessor in the real implementation (e.g., flb_http_server_runtime_get_worker_net_setup(server, idx) or flb_http_server_runtime_worker_size_get()) guarded by `#ifdef` FLB_TESTS in src/http_server/flb_http_server.c; update tests to call that accessor to obtain the worker's net_setup (or sizeof the real worker) instead of relying on copied struct layout, and delete the duplicated private-layout structs from tests (or replace them with a _Static_assert that compares sizeof via the accessor) so future private-field changes won’t silently break tests.src/http_server/flb_http_server.c (1)
144-149: Side-effecting accessor:flb_http_server_client_countnow mutates state.Reaping on every count works, but it makes a read-shaped helper silently destroy sessions, and the cost is O(N) per incoming connection accept (line 468). Two low-risk options:
- Rename to make intent explicit (e.g.
flb_http_server_active_client_countwith a short comment), or- Hoist the reap call into the caller(s) so
client_countstays a pure accessor.Not a blocker — the periodic maintenance reap (line 535) already covers the common case — but worth a note for future readers.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/http_server/flb_http_server.c` around lines 144 - 149, flb_http_server_client_count currently calls flb_http_server_reap_stale_sessions and thus mutates state; make the accessor pure by removing the reap call from flb_http_server_client_count so it simply returns cfl_list_size(&server->clients), then hoist explicit calls to flb_http_server_reap_stale_sessions into the callers that require stale-session cleanup (for example the connection-accept path that currently invokes client_count) so the mutation is explicit; alternatively, if you prefer to keep the combined behavior, rename flb_http_server_client_count to flb_http_server_active_client_count and add a brief comment stating it reaps stale sessions as a side effect so intent is clear.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@tests/integration/scenarios/in_http_max_connections/tests/test_in_http_max_connections_001.py`:
- Around line 150-172: Replace the brittle fixed sleep call (time.sleep(3)) with
a short polling loop that repeatedly calls run_curl_request (using
service.flb_listener_port and the same payload/headers/http_mode) until the
response["status_code"] == 201 or a generous timeout (e.g., 10s) is reached with
a small sleep interval (e.g., 0.5s); keep held_connection cleanup and the use of
service.service.wait_for_condition for forwarded_payloads; apply the identical
polling change to the HTTP/2 variant mentioned around line ~206 so both http1.1
and http2 tests use the same robust retry pattern.
---
Nitpick comments:
In `@include/fluent-bit/http_server/flb_http_server.h`:
- Line 41: The macro HTTP_SERVER_DEFAULT_IDLE_TIMEOUT lacks a unit comment and
the idle_timeout fields are undocumented; add a brief comment indicating the
unit (seconds) next to the macro definition and add matching comments for any
struct fields named idle_timeout in flb_http_server.h (or related header
declarations) so the timeout contract is explicit (reference
HTTP_SERVER_DEFAULT_IDLE_TIMEOUT and the idle_timeout fields).
In `@src/http_server/flb_http_server.c`:
- Around line 144-149: flb_http_server_client_count currently calls
flb_http_server_reap_stale_sessions and thus mutates state; make the accessor
pure by removing the reap call from flb_http_server_client_count so it simply
returns cfl_list_size(&server->clients), then hoist explicit calls to
flb_http_server_reap_stale_sessions into the callers that require stale-session
cleanup (for example the connection-accept path that currently invokes
client_count) so the mutation is explicit; alternatively, if you prefer to keep
the combined behavior, rename flb_http_server_client_count to
flb_http_server_active_client_count and add a brief comment stating it reaps
stale sessions as a side effect so intent is clear.
In
`@tests/integration/scenarios/in_http_max_connections/tests/test_in_http_max_connections_001.py`:
- Around line 65-69: The except block currently swallows all exceptions from
service.stop() before calling pytest.skip; replace the silent pass with a
debug-level log to satisfy linting and preserve observability: inside the except
for service.stop() log the failure using logging.getLogger(__name__).debug with
a short message like "stop after skip failed" and include exc_info=True, then
proceed to call pytest.skip as before so teardown errors are visible in debug
logs but do not change test flow.
In `@tests/internal/http_server.c`:
- Around line 25-44: The shadow structs test_http_server_runtime_shadow and
test_http_server_worker_context_shadow are brittle; either add a compile-time
sentinel or—preferably—remove the layout duplication and add a small test-only
accessor in the real implementation (e.g.,
flb_http_server_runtime_get_worker_net_setup(server, idx) or
flb_http_server_runtime_worker_size_get()) guarded by `#ifdef` FLB_TESTS in
src/http_server/flb_http_server.c; update tests to call that accessor to obtain
the worker's net_setup (or sizeof the real worker) instead of relying on copied
struct layout, and delete the duplicated private-layout structs from tests (or
replace them with a _Static_assert that compares sizeof via the accessor) so
future private-field changes won’t silently break tests.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 9c2e2121-0484-44e6-8301-7c0e60966bbe
📒 Files selected for processing (11)
include/fluent-bit/flb_connection.hinclude/fluent-bit/http_server/flb_http_server.hplugins/out_prometheus_exporter/prom_http.cplugins/out_vivo_exporter/vivo_http.csrc/flb_connection.csrc/flb_downstream.csrc/http_server/flb_http_server.csrc/http_server/flb_http_server_config_map.ctests/integration/scenarios/in_http_max_connections/config/in_http_max_connections.yamltests/integration/scenarios/in_http_max_connections/tests/test_in_http_max_connections_001.pytests/internal/http_server.c
✅ Files skipped from review due to trivial changes (3)
- tests/integration/scenarios/in_http_max_connections/config/in_http_max_connections.yaml
- src/flb_connection.c
- src/http_server/flb_http_server_config_map.c
🚧 Files skipped from review as they are similar to previous changes (3)
- src/flb_downstream.c
- plugins/out_vivo_exporter/vivo_http.c
- plugins/out_prometheus_exporter/prom_http.c
| time.sleep(3) | ||
| finally: | ||
| if held_connection: | ||
| held_connection.close() | ||
| held_connection = None | ||
|
|
||
| response = run_curl_request( | ||
| f"http://127.0.0.1:{service.flb_listener_port}/", | ||
| payload='{"message":"idle-timeout-recovered"}', | ||
| headers=["Content-Type: application/json"], | ||
| http_mode="http1.1", | ||
| ) | ||
| forwarded_payloads = service.service.wait_for_condition( | ||
| lambda: data_storage["payloads"] if data_storage["payloads"] else None, | ||
| timeout=10, | ||
| interval=0.5, | ||
| description="forwarded idle-timeout payload", | ||
| ) | ||
| finally: | ||
| service.stop() | ||
|
|
||
| assert response["status_code"] == 201 | ||
| assert forwarded_payloads[0][0]["message"] == "idle-timeout-recovered" |
There was a problem hiding this comment.
Replace time.sleep(3) with a short polling loop to reduce CI flake.
The config uses http_server.idle_timeout: 2s and the test sleeps 3s before retrying. On a loaded CI worker (especially with thread scheduling jitter plus the 250ms maintenance tick in flb_http_server_worker_maintenance), that margin is thin and will occasionally produce a status that is still != 201 before the eviction completes. Polling the recovery request until it succeeds (or a generous deadline like timeout=10) is more robust and matches the wait_for_condition style already used in the same test.
Sketch
- time.sleep(3)
- finally:
+ finally:
if held_connection:
held_connection.close()
held_connection = None
- response = run_curl_request(
- f"http://127.0.0.1:{service.flb_listener_port}/",
- payload='{"message":"idle-timeout-recovered"}',
- headers=["Content-Type: application/json"],
- http_mode="http1.1",
- )
+ deadline = time.monotonic() + 10
+ response = {"status_code": 0}
+ while time.monotonic() < deadline:
+ response = run_curl_request(
+ f"http://127.0.0.1:{service.flb_listener_port}/",
+ payload='{"message":"idle-timeout-recovered"}',
+ headers=["Content-Type: application/json"],
+ http_mode="http1.1",
+ )
+ if response["status_code"] == 201:
+ break
+ time.sleep(0.25)Apply the same change around line 206 for the HTTP/2 variant.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@tests/integration/scenarios/in_http_max_connections/tests/test_in_http_max_connections_001.py`
around lines 150 - 172, Replace the brittle fixed sleep call (time.sleep(3))
with a short polling loop that repeatedly calls run_curl_request (using
service.flb_listener_port and the same payload/headers/http_mode) until the
response["status_code"] == 201 or a generous timeout (e.g., 10s) is reached with
a small sleep interval (e.g., 0.5s); keep held_connection cleanup and the use of
service.service.wait_for_condition for forwarded_payloads; apply the identical
polling change to the HTTP/2 variant mentioned around line ~206 so both http1.1
and http2 tests use the same robust retry pattern.
Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
3b96d65 to
6f6bf60
Compare
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/integration/scenarios/in_http_max_connections/tests/test_in_http_max_connections_001.py (1)
106-149:⚠️ Potential issue | 🟡 Minor
service.stop()can be skipped if the held‑connection block raises.
service.start()happens at line 111, butservice.stop()only lives in the secondtry/finally(lines 148–149). If anything inside the firsttrybody (e.g.,socket.create_connectionat line 115 on a busy CI box) raises before control reaches line 135, the firstfinallycloses the socket, the exception propagates, and the Fluent Bit service is never stopped — leaking the worker process and its listener port for subsequent tests in the same session.The two new idle‑timeout tests already use a single outer
try/finallyaroundservice.stop(), so aligning this test with that shape keeps the suite consistent and leak‑free.♻️ Proposed shape
_start_service_or_skip(service, ["http_server.max_connections"]) held_connection = None try: - held_connection = socket.create_connection(("127.0.0.1", service.flb_listener_port), timeout=2) - held_connection.settimeout(2) - - overflow_rejected = False try: - response = run_curl_request( - f"http://127.0.0.1:{service.flb_listener_port}/", - payload='{"message":"max-connections"}', - headers=["Content-Type: application/json"], - http_mode="http1.1", - ) - overflow_rejected = response["status_code"] != 201 - except Exception: - overflow_rejected = True - - assert overflow_rejected - finally: - if held_connection: - held_connection.close() + held_connection = socket.create_connection(("127.0.0.1", service.flb_listener_port), timeout=2) + held_connection.settimeout(2) - try: + overflow_rejected = False + try: + response = run_curl_request( + f"http://127.0.0.1:{service.flb_listener_port}/", + payload='{"message":"max-connections"}', + headers=["Content-Type: application/json"], + http_mode="http1.1", + ) + overflow_rejected = response["status_code"] != 201 + except Exception: + overflow_rejected = True + + assert overflow_rejected + finally: + if held_connection: + held_connection.close() + held_connection = None + accepted = run_curl_request( f"http://127.0.0.1:{service.flb_listener_port}/", payload='{"message":"max-connections"}', headers=["Content-Type: application/json"], http_mode="http1.1", ) forwarded_payloads = service.service.wait_for_condition( lambda: data_storage["payloads"] if data_storage["payloads"] else None, timeout=10, interval=0.5, description="forwarded max-connections payload", ) finally: service.stop()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/integration/scenarios/in_http_max_connections/tests/test_in_http_max_connections_001.py` around lines 106 - 149, The test can leak the Service if an exception occurs before the second finally; wrap the entire test runtime (after calling _start_service_or_skip/service.start) in a single outer try/finally that always calls service.stop() so the Fluent Bit worker is stopped regardless of where an exception is raised (reference Service, _start_service_or_skip, held_connection, run_curl_request and service.stop); remove/adjust the inner finally so service.stop() is only invoked from the outer finally to guarantee cleanup on any failure.
🧹 Nitpick comments (2)
tests/integration/scenarios/in_http_max_connections/tests/test_in_http_max_connections_001.py (2)
77-103: Helper silently returns a non‑201 response after the deadline — intentional but worth a short docstring.When the deadline expires without ever observing a 201,
_wait_for_accepted_requestreturns the last observed response (which may be e.g. 4xx/5xx) rather than raising. That is actually the behavior the callers rely on (soassert response["status_code"] == 201fails with the real status code rather than a timeout), but it's non‑obvious from the name alone. A one‑line docstring clarifying "returns the last response on deadline; only raises if no response was ever received" would prevent future misuse.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/integration/scenarios/in_http_max_connections/tests/test_in_http_max_connections_001.py` around lines 77 - 103, Add a one-line docstring to the helper function _wait_for_accepted_request clarifying its deadline behavior: state that it polls until a 201 is seen, returns the last observed HTTP response if the deadline expires (so callers can assert on non-201 status codes), and only raises if no response was ever received; place this docstring immediately under the def _wait_for_accepted_request(...) signature to make the intent explicit to future readers.
59-74: Property‑name match in startup error string is substring‑based and could misfire.The check
property_name in error_messagewill also match if the property name happens to appear as a substring of an unrelated error (e.g., a different property whose name containshttp_server.max_connections— unlikely today but fragile). Matching the more specific log form (unknown configuration property '<name>') in both branches would make the skip logic harder to trigger accidentally. Optional — the current form is good enough for the known error shapes.♻️ Proposed tightening
for property_name in required_properties: - if (property_name in error_message or - f"unknown configuration property '{property_name}'" in log_contents): + unknown_marker = f"unknown configuration property '{property_name}'" + if unknown_marker in error_message or unknown_marker in log_contents: try: service.stop() except Exception: LOGGER.debug("stop after skip failed", exc_info=True) pytest.skip(f"{property_name} is not supported by this Fluent Bit binary")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/integration/scenarios/in_http_max_connections/tests/test_in_http_max_connections_001.py` around lines 59 - 74, The skip logic in _start_service_or_skip is using a substring check (property_name in error_message) which can false-positive; tighten it by matching the specific unknown-property form in both places: check for the exact phrase f"unknown configuration property '{property_name}'" (or an equivalent anchored regex) against error_message and log_contents instead of using a bare substring containment, so the function only skips when the error explicitly reports the given property as unknown; update references to error_message and log_contents accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In
`@tests/integration/scenarios/in_http_max_connections/tests/test_in_http_max_connections_001.py`:
- Around line 106-149: The test can leak the Service if an exception occurs
before the second finally; wrap the entire test runtime (after calling
_start_service_or_skip/service.start) in a single outer try/finally that always
calls service.stop() so the Fluent Bit worker is stopped regardless of where an
exception is raised (reference Service, _start_service_or_skip, held_connection,
run_curl_request and service.stop); remove/adjust the inner finally so
service.stop() is only invoked from the outer finally to guarantee cleanup on
any failure.
---
Nitpick comments:
In
`@tests/integration/scenarios/in_http_max_connections/tests/test_in_http_max_connections_001.py`:
- Around line 77-103: Add a one-line docstring to the helper function
_wait_for_accepted_request clarifying its deadline behavior: state that it polls
until a 201 is seen, returns the last observed HTTP response if the deadline
expires (so callers can assert on non-201 status codes), and only raises if no
response was ever received; place this docstring immediately under the def
_wait_for_accepted_request(...) signature to make the intent explicit to future
readers.
- Around line 59-74: The skip logic in _start_service_or_skip is using a
substring check (property_name in error_message) which can false-positive;
tighten it by matching the specific unknown-property form in both places: check
for the exact phrase f"unknown configuration property '{property_name}'" (or an
equivalent anchored regex) against error_message and log_contents instead of
using a bare substring containment, so the function only skips when the error
explicitly reports the given property as unknown; update references to
error_message and log_contents accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 97e44797-e0ee-4f41-8105-f31406bfba33
📒 Files selected for processing (8)
include/fluent-bit/http_server/flb_http_server.hplugins/out_prometheus_exporter/prom_http.cplugins/out_vivo_exporter/vivo_http.csrc/http_server/flb_http_server.csrc/http_server/flb_http_server_config_map.ctests/integration/scenarios/in_http_max_connections/config/in_http_max_connections.yamltests/integration/scenarios/in_http_max_connections/tests/test_in_http_max_connections_001.pytests/internal/http_server.c
✅ Files skipped from review due to trivial changes (2)
- tests/integration/scenarios/in_http_max_connections/config/in_http_max_connections.yaml
- tests/internal/http_server.c
🚧 Files skipped from review as they are similar to previous changes (4)
- plugins/out_prometheus_exporter/prom_http.c
- plugins/out_vivo_exporter/vivo_http.c
- src/http_server/flb_http_server_config_map.c
- src/http_server/flb_http_server.c
Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
6f6bf60 to
acb63fb
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (3)
tests/integration/scenarios/in_http_max_connections/tests/test_in_http_max_connections_001.py (3)
59-75: Non-skip error path does not stop the service.When
FluentBitStartupErroris raised but the error does not match any ofrequired_properties,raisere-throws without callingservice.stop(), so the pre-start receiver (started by_start_receiver) stays running until pytest tears down the process. Mirror the skip path's cleanup to keep behavior symmetric:♻️ Suggested tweak
for property_name in required_properties: unknown_property_error = f"unknown configuration property '{property_name}'" if (unknown_property_error in error_message or unknown_property_error in log_contents): try: service.stop() except Exception: LOGGER.debug("stop after skip failed", exc_info=True) pytest.skip(f"{property_name} is not supported by this Fluent Bit binary") + try: + service.stop() + except Exception: + LOGGER.debug("stop after startup failure failed", exc_info=True) raise🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/integration/scenarios/in_http_max_connections/tests/test_in_http_max_connections_001.py` around lines 59 - 75, The error path in _start_service_or_skip currently re-raises FluentBitStartupError without stopping the started service; modify _start_service_or_skip so that before re-raising the exception (in the final raise when no required_properties matched) it attempts to stop the service (call service.stop()) and log any stop failure (use LOGGER.debug with exc_info=True) similar to the skip branch, then re-raise the original FluentBitStartupError; keep the existing checks that inspect error_message and _read_fluent_bit_log and ensure the stop attempt happens for the non-skip path as well.
207-258: LGTM; consider parameterizing to dedupe with the HTTP/1.1 variant.The partial preface (
PRI * HTTP/2.0\r\n\r\nSM\r\n, missing the trailing\r\n\r\n) correctly simulates a stalled prior-knowledge HTTP/2 client, and thecurl_supports_http2()guard is right. The only nit is that this test is ~95% identical totest_in_http_idle_timeout_evicts_partial_request_connection; apytest.mark.parametrizeover(http_mode, partial_bytes, extra_asserts)would cut the duplication without losing readability. Optional — fine to leave as-is for clarity.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/integration/scenarios/in_http_max_connections/tests/test_in_http_max_connections_001.py` around lines 207 - 258, The two nearly identical tests test_in_http_idle_timeout_evicts_partial_http2_preface_connection and test_in_http_idle_timeout_evicts_partial_request_connection should be consolidated via pytest.mark.parametrize to remove duplication: extract the differing pieces (http_mode value, the partial preface bytes sent, and any variant-specific assertions) and parametrize the test function to iterate over those tuples; update the test body to use the parameters where run_curl_request, socket.sendall, and _wait_for_accepted_request are invoked and preserve existing asserts (e.g., response["http_version"] checks) via an extra_asserts lambda or flag so behavior for both HTTP/1.1 and HTTP/2 is retained.
78-105: Fallback return shape is inconsistent with the HTTP/2 test's expectations.The degenerate fallback at line 105 returns
{"status_code": 0}, but the HTTP/2 test readsresponse["http_version"]unconditionally at line 257. This path is only reachable if the loop body never runs (e.g.,timeout <= 0), so it's effectively dead with the current defaults — still worth making defensive to avoid a future surprise:🛡️ Suggested tweak
- return {"status_code": 0} + return {"status_code": 0, "http_version": ""}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/integration/scenarios/in_http_max_connections/tests/test_in_http_max_connections_001.py` around lines 78 - 105, The fallback return in _wait_for_accepted_request currently returns {"status_code": 0} which is missing keys the HTTP/2 test expects (e.g., "http_version"); update the degenerate return value in _wait_for_accepted_request to include the same response shape used by run_curl_request (at minimum add "http_version": None or a sensible default like "HTTP/2" along with "status_code": 0) so callers that read response["http_version"] won't raise on the fallback path.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In
`@tests/integration/scenarios/in_http_max_connections/tests/test_in_http_max_connections_001.py`:
- Around line 59-75: The error path in _start_service_or_skip currently
re-raises FluentBitStartupError without stopping the started service; modify
_start_service_or_skip so that before re-raising the exception (in the final
raise when no required_properties matched) it attempts to stop the service (call
service.stop()) and log any stop failure (use LOGGER.debug with exc_info=True)
similar to the skip branch, then re-raise the original FluentBitStartupError;
keep the existing checks that inspect error_message and _read_fluent_bit_log and
ensure the stop attempt happens for the non-skip path as well.
- Around line 207-258: The two nearly identical tests
test_in_http_idle_timeout_evicts_partial_http2_preface_connection and
test_in_http_idle_timeout_evicts_partial_request_connection should be
consolidated via pytest.mark.parametrize to remove duplication: extract the
differing pieces (http_mode value, the partial preface bytes sent, and any
variant-specific assertions) and parametrize the test function to iterate over
those tuples; update the test body to use the parameters where run_curl_request,
socket.sendall, and _wait_for_accepted_request are invoked and preserve existing
asserts (e.g., response["http_version"] checks) via an extra_asserts lambda or
flag so behavior for both HTTP/1.1 and HTTP/2 is retained.
- Around line 78-105: The fallback return in _wait_for_accepted_request
currently returns {"status_code": 0} which is missing keys the HTTP/2 test
expects (e.g., "http_version"); update the degenerate return value in
_wait_for_accepted_request to include the same response shape used by
run_curl_request (at minimum add "http_version": None or a sensible default like
"HTTP/2" along with "status_code": 0) so callers that read
response["http_version"] won't raise on the fallback path.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 4f933fa7-747e-4ba1-b675-99442d8386db
📒 Files selected for processing (2)
tests/integration/scenarios/in_http_max_connections/config/in_http_max_connections.yamltests/integration/scenarios/in_http_max_connections/tests/test_in_http_max_connections_001.py
✅ Files skipped from review due to trivial changes (1)
- tests/integration/scenarios/in_http_max_connections/config/in_http_max_connections.yaml
This PR hardens Fluent Bit’s shared HTTP server against stalled and partially-sent client connections by introducing a default per-connection idle timeout and cleaning up session lifecycle handling around dropped or timed-out sockets.
It also propagates the timeout setting to HTTP services that construct server options manually, adds internal coverage for timeout/config behavior, and adds integration tests that verify blocked connections are evicted and normal traffic recovers correctly for both HTTP/1.1 and HTTP/2.
Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.
Summary by CodeRabbit
New Features
Improvements / Bug Fixes
Tests