engine: ignore duplicate STOP to prevent shutdown spin#11747
engine: ignore duplicate STOP to prevent shutdown spin#11747jinyongchoi wants to merge 2 commits intofluent:masterfrom
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds a guard in the engine STOP handling to skip re-entrance when shutdown is already in progress, and adds a POSIX-only runtime test that calls shutdown twice under a watchdog to ensure the engine does not busy-loop. Changes
Sequence Diagram(s)sequenceDiagram
participant Caller as Caller (SIGTERM / plugin)
participant Engine as Engine (flb_engine_manager)
participant Config as Config (config->is_shutting_down)
participant EventLoop as EventLoop/Timerfd
Caller->>Engine: send FLB_ENGINE_STOP
Engine->>Config: read is_shutting_down
alt not shutting down
Engine->>EventLoop: register shutdown timer / set shutting flag
Engine->>Engine: perform flush and shutdown sequence
else already shutting down
Engine-->>Caller: return 0 (no-op)
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
This PR makes Fluent Bit’s engine shutdown path idempotent by ignoring duplicate FLB_ENGINE_STOP messages once shutdown has already started, preventing an epoll/timerfd busy-loop that can pin CPU and block termination (Fixes #11744).
Changes:
- Ignore duplicate STOP events in
flb_engine_manager()whenconfig->is_shutting_downis already set. - Add a runtime regression test that triggers back-to-back STOP requests and asserts
flb_stop()returns within the grace window. - Register the new runtime test in the runtime test CMake target list.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
src/flb_engine.c |
Adds an idempotency guard for duplicate STOP messages to prevent shutdown spin. |
tests/runtime/core_shutdown_spin.c |
Introduces a regression test for duplicate STOP shutdown behavior. |
tests/runtime/CMakeLists.txt |
Registers the new runtime test for CTest execution. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/runtime/core_shutdown_spin.c`:
- Around line 22-41: The test core_shutdown_spin.c uses POSIX-only APIs (alarm,
sigaction, SIGALRM, STDERR_FILENO, _exit, and the timeout_abort handler) and is
being registered unconditionally; update tests/runtime/CMakeLists.txt to guard
the test registration for core_shutdown_spin.c with the existing Windows check
pattern (wrap the add_test/add_executable lines for core_shutdown_spin.c inside
an if(NOT FLB_SYSTEM_WINDOWS) ... endif block) so the test (and its use of
timeout_abort, SIGALRM, alarm, STDERR_FILENO, _exit) is only added on
non-Windows platforms.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 76542311-7e98-455b-af44-bc625fc54bc6
📒 Files selected for processing (3)
src/flb_engine.ctests/runtime/CMakeLists.txttests/runtime/core_shutdown_spin.c
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 839473d39e
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
When FLB_ENGINE_STOP arrives more than once in quick succession (e.g. an input plugin's internal flb_engine_exit() followed by an external SIGTERM), the second invocation re-enters the STOP handler block in flb_engine_start() and resets config->event_shutdown->status to MK_EVENT_NONE while the shutdown timerfd is still registered in the kernel's epoll set. The event loop dispatcher then drops the timer event because of the 'status != MK_EVENT_NONE' guard in flb_event_load_bucket_queue(), but the level-triggered timerfd keeps reporting EPOLLIN. The pipeline thread busy-loops in epoll_wait() at 100% CPU, grace_count never advances, and the process fails to terminate. Swallow duplicate STOP messages at the flb_engine_manager() boundary once shutdown is already in progress (config->is_shutting_down is set by flb_engine_stop_ingestion() during the first STOP). The first STOP arms the shutdown timer and drives the grace flow; any further STOPs would only corrupt existing event state without benefit. Periodic work during shutdown (flushing, task draining, grace counter) is already handled by the 1s tick in the FLB_ENGINE_SHUTDOWN branch, so swallowing the duplicate is safe. Signed-off-by: jinyong.choi <inimax801@gmail.com>
839473d to
cbd652c
Compare
Add core_shutdown_spin.c covering the duplicate-FLB_ENGINE_STOP busy-spin bug fixed in the previous commit. The test builds a minimal lib-input -> null-output pipeline, invokes flb_engine_exit() twice in quick succession, and asserts that flb_stop() returns within the grace period. A SIGALRM watchdog (SHUTDOWN_WATCHDOG_SEC=10) bounds the wait: if the guard regresses, pthread_join on the spinning worker never returns, the handler aborts the process with a visible FAIL message and exit code 1. This avoids relying on CTest's per-test timeout (1500s default) and surfaces the regression quickly regardless of how the binary is invoked. Signed-off-by: jinyong.choi <inimax801@gmail.com>
cbd652c to
00a1c2b
Compare
Fix a shutdown spin where duplicate
FLB_ENGINE_STOPmessages (e.g. an input plugin's internalflb_engine_exit()followed by an external SIGTERM, or any path that writes STOP twice toch_manager) cause the pipeline thread to busy-loop at 100% CPU and the process fails to terminate.The second STOP re-enters the handler block in
flb_engine_start()and resetsconfig->event_shutdown->statustoMK_EVENT_NONEwhile the shutdown timerfd is still registered in epoll. The dispatcher then drops the timer event via thestatus != MK_EVENT_NONEguard inflb_event_load_bucket_queue(), but the level-triggered timerfd keeps reporting EPOLLIN — busy-loop,grace_countnever advances,flb_engine_shutdown()unreachable.Fix: swallow duplicate STOP at the
flb_engine_manager()boundary whenconfig->is_shutting_downis already set (first STOP sets it viaflb_engine_stop_ingestion()). Periodic shutdown work (flush, task drain, grace counter) is already driven by the 1s tick in theFLB_ENGINE_SHUTDOWNbranch, so swallowing the duplicate is safe.I considered putting the guard at each
flb_engine_exit()call site or insideflb_engine_exit()itself, but chose to place it inflb_engine_manager()for the following reasons. I'd be happy to move it if maintainers prefer a different layer.flb_engine_exit()has 12+ call sites (in_tail, in_exec, in_stdin, out_exit, filter_expect, flb_lib, winsvc, ...) that look like independent shutdown triggers, so a caller-side check looked prone to races:config->is_shutting_downis set by the engine loop rather than byflb_engine_exit(), so two producers could both observeFALSEand each send a STOP.if (shutdown_fd <= 0)), so extending the same idempotency to theevent_shutdownreset seemed consistent with that existing pattern.Fixes #11744
Enter
[N/A]in the box, if an item is not applicable to your change.Testing
Before we can approve your change; please submit the following in a comment:
Shutdown timing race — not reproducible via configuration, so a regression test is included instead:
tests/runtime/core_shutdown_spin.creproduces the duplicate-STOP race via two back-to-backflb_engine_exit()calls and assertsflb_stop()returns within the grace period.If this is a change to packaging of containers or native binaries then please confirm it works for all targets.
ok-package-testlabel to test for all targets (requires maintainer to do).Documentation
Backporting
Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.
Summary by CodeRabbit
Bug Fixes
Tests