Skip to content

engine: ignore duplicate STOP to prevent shutdown spin#11747

Open
jinyongchoi wants to merge 2 commits intofluent:masterfrom
jinyongchoi:fix/engine-shutdown-spin
Open

engine: ignore duplicate STOP to prevent shutdown spin#11747
jinyongchoi wants to merge 2 commits intofluent:masterfrom
jinyongchoi:fix/engine-shutdown-spin

Conversation

@jinyongchoi
Copy link
Copy Markdown
Contributor

@jinyongchoi jinyongchoi commented Apr 25, 2026

Fix a shutdown spin where duplicate FLB_ENGINE_STOP messages (e.g. an input plugin's internal flb_engine_exit() followed by an external SIGTERM, or any path that writes STOP twice to ch_manager) cause the pipeline thread to busy-loop at 100% CPU and the process fails to terminate.

The second STOP re-enters the handler block in flb_engine_start() and resets config->event_shutdown->status to MK_EVENT_NONE while the shutdown timerfd is still registered in epoll. The dispatcher then drops the timer event via the status != MK_EVENT_NONE guard in flb_event_load_bucket_queue(), but the level-triggered timerfd keeps reporting EPOLLIN — busy-loop, grace_count never advances, flb_engine_shutdown() unreachable.

Fix: swallow duplicate STOP at the flb_engine_manager() boundary when config->is_shutting_down is already set (first STOP sets it via flb_engine_stop_ingestion()). Periodic shutdown work (flush, task drain, grace counter) is already driven by the 1s tick in the FLB_ENGINE_SHUTDOWN branch, so swallowing the duplicate is safe.

I considered putting the guard at each flb_engine_exit() call site or inside flb_engine_exit() itself, but chose to place it in flb_engine_manager() for the following reasons. I'd be happy to move it if maintainers prefer a different layer.

  • flb_engine_exit() has 12+ call sites (in_tail, in_exec, in_stdin, out_exit, filter_expect, flb_lib, winsvc, ...) that look like independent shutdown triggers, so a caller-side check looked prone to races: config->is_shutting_down is set by the engine loop rather than by flb_engine_exit(), so two producers could both observe FALSE and each send a STOP.
  • The engine loop is a single-threaded consumer, which seemed like a natural serialization point to make STOP handling idempotent without introducing new locks or atomics.
  • From what I can tell, the timerfd creation was already guarded (if (shutdown_fd <= 0)), so extending the same idempotency to the event_shutdown reset seemed consistent with that existing pattern.

Fixes #11744


Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • [N/A] Example configuration file for the change
  • [N/A] Debug log output from testing the change
  • [N/A] Attached Valgrind output that shows no leaks or memory corruption was found

Shutdown timing race — not reproducible via configuration, so a regression test is included instead:

  • tests/runtime/core_shutdown_spin.c reproduces the duplicate-STOP race via two back-to-back flb_engine_exit() calls and asserts flb_stop() returns within the grace period.
  • A SIGALRM watchdog (10s) bounds the wait so the regression surfaces quickly instead of hanging for the 1500s CTest default.
  • Verified: PASS in ~1.9s with fix, FAIL in ~11s without fix (watchdog fires).

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • [N/A] Run local packaging test showing all targets (including any new ones) build.
  • [N/A] Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

  • [N/A] Documentation required for this feature

Backporting

  • [N/A] Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

Summary by CodeRabbit

  • Bug Fixes

    • Prevented duplicate shutdown handling to avoid a busy-loop or hang when stop/exit is triggered multiple times, ensuring graceful shutdown.
  • Tests

    • Added a regression test (non-Windows) that verifies rapid consecutive stop/exit calls are handled without spinning and complete within a watchdog time limit.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 25, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d484b261-8350-405c-aef3-f78b608a793a

📥 Commits

Reviewing files that changed from the base of the PR and between cbd652c and 00a1c2b.

📒 Files selected for processing (2)
  • tests/runtime/CMakeLists.txt
  • tests/runtime/core_shutdown_spin.c
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/runtime/core_shutdown_spin.c

📝 Walkthrough

Walkthrough

Adds a guard in the engine STOP handling to skip re-entrance when shutdown is already in progress, and adds a POSIX-only runtime test that calls shutdown twice under a watchdog to ensure the engine does not busy-loop.

Changes

Cohort / File(s) Summary
Engine Core
src/flb_engine.c
Added an early config->is_shutting_down check in the FLB_ENGINE_STOP handling path to return immediately if shutdown is active, preventing repeated flush/timer state resets.
Runtime Test
tests/runtime/CMakeLists.txt, tests/runtime/core_shutdown_spin.c
Registered a POSIX-only test executable and added duplicate_stop_no_spin which issues two consecutive flb_engine_exit calls with a SIGALRM watchdog to assert shutdown completes within a time limit.

Sequence Diagram(s)

sequenceDiagram
    participant Caller as Caller (SIGTERM / plugin)
    participant Engine as Engine (flb_engine_manager)
    participant Config as Config (config->is_shutting_down)
    participant EventLoop as EventLoop/Timerfd

    Caller->>Engine: send FLB_ENGINE_STOP
    Engine->>Config: read is_shutting_down
    alt not shutting down
        Engine->>EventLoop: register shutdown timer / set shutting flag
        Engine->>Engine: perform flush and shutdown sequence
    else already shutting down
        Engine-->>Caller: return 0 (no-op)
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

backport to v4.0.x

Suggested reviewers

  • edsiper

Poem

🐰 A double-tap once made me spin,
I hopped and checked the timer's din.
Now one tap sets the gentle pace,
No busy-loop to chase or race.
Quiet fields, and graceful grace. 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: preventing duplicate STOP events from causing shutdown spin.
Linked Issues check ✅ Passed The PR fully addresses issue #11744 by implementing idempotent STOP handling via config->is_shutting_down check to prevent re-entrance.
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing the duplicate STOP issue: engine logic fix, test infrastructure setup, and regression test.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR makes Fluent Bit’s engine shutdown path idempotent by ignoring duplicate FLB_ENGINE_STOP messages once shutdown has already started, preventing an epoll/timerfd busy-loop that can pin CPU and block termination (Fixes #11744).

Changes:

  • Ignore duplicate STOP events in flb_engine_manager() when config->is_shutting_down is already set.
  • Add a runtime regression test that triggers back-to-back STOP requests and asserts flb_stop() returns within the grace window.
  • Register the new runtime test in the runtime test CMake target list.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
src/flb_engine.c Adds an idempotency guard for duplicate STOP messages to prevent shutdown spin.
tests/runtime/core_shutdown_spin.c Introduces a regression test for duplicate STOP shutdown behavior.
tests/runtime/CMakeLists.txt Registers the new runtime test for CTest execution.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/runtime/core_shutdown_spin.c
Comment thread tests/runtime/CMakeLists.txt Outdated
Comment thread src/flb_engine.c Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/runtime/core_shutdown_spin.c`:
- Around line 22-41: The test core_shutdown_spin.c uses POSIX-only APIs (alarm,
sigaction, SIGALRM, STDERR_FILENO, _exit, and the timeout_abort handler) and is
being registered unconditionally; update tests/runtime/CMakeLists.txt to guard
the test registration for core_shutdown_spin.c with the existing Windows check
pattern (wrap the add_test/add_executable lines for core_shutdown_spin.c inside
an if(NOT FLB_SYSTEM_WINDOWS) ... endif block) so the test (and its use of
timeout_abort, SIGALRM, alarm, STDERR_FILENO, _exit) is only added on
non-Windows platforms.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 76542311-7e98-455b-af44-bc625fc54bc6

📥 Commits

Reviewing files that changed from the base of the PR and between 29deec9 and 839473d.

📒 Files selected for processing (3)
  • src/flb_engine.c
  • tests/runtime/CMakeLists.txt
  • tests/runtime/core_shutdown_spin.c

Comment thread tests/runtime/core_shutdown_spin.c
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 839473d39e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread tests/runtime/core_shutdown_spin.c
When FLB_ENGINE_STOP arrives more than once in quick succession (e.g.
an input plugin's internal flb_engine_exit() followed by an external
SIGTERM), the second invocation re-enters the STOP handler block in
flb_engine_start() and resets config->event_shutdown->status to
MK_EVENT_NONE while the shutdown timerfd is still registered in the
kernel's epoll set.

The event loop dispatcher then drops the timer event because of the
'status != MK_EVENT_NONE' guard in flb_event_load_bucket_queue(), but
the level-triggered timerfd keeps reporting EPOLLIN. The pipeline
thread busy-loops in epoll_wait() at 100% CPU, grace_count never
advances, and the process fails to terminate.

Swallow duplicate STOP messages at the flb_engine_manager() boundary
once shutdown is already in progress (config->is_shutting_down is set
by flb_engine_stop_ingestion() during the first STOP). The first STOP
arms the shutdown timer and drives the grace flow; any further STOPs
would only corrupt existing event state without benefit. Periodic
work during shutdown (flushing, task draining, grace counter) is
already handled by the 1s tick in the FLB_ENGINE_SHUTDOWN branch, so
swallowing the duplicate is safe.

Signed-off-by: jinyong.choi <inimax801@gmail.com>
Add core_shutdown_spin.c covering the duplicate-FLB_ENGINE_STOP
busy-spin bug fixed in the previous commit. The test builds a minimal
lib-input -> null-output pipeline, invokes flb_engine_exit() twice in
quick succession, and asserts that flb_stop() returns within the grace
period.

A SIGALRM watchdog (SHUTDOWN_WATCHDOG_SEC=10) bounds the wait: if the
guard regresses, pthread_join on the spinning worker never returns,
the handler aborts the process with a visible FAIL message and exit
code 1. This avoids relying on CTest's per-test timeout (1500s
default) and surfaces the regression quickly regardless of how the
binary is invoked.

Signed-off-by: jinyong.choi <inimax801@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

engine: shutdown hangs at 100% CPU when duplicate STOP signals arrive

2 participants