Skip to content

filter_kubernetes: destroy TLS session to prevent SSL object leak#11730

Open
ShelbyZ wants to merge 2 commits intofluent:masterfrom
ShelbyZ:filter-ssl-fix
Open

filter_kubernetes: destroy TLS session to prevent SSL object leak#11730
ShelbyZ wants to merge 2 commits intofluent:masterfrom
ShelbyZ:filter-ssl-fix

Conversation

@ShelbyZ
Copy link
Copy Markdown
Contributor

@ShelbyZ ShelbyZ commented Apr 20, 2026

Summary

Problem:

The fetch_pod_service_map background thread creates its own event loop but never drains the destroy_queue, so TLS sessions released via conn_release accumulate indefinitely — one per fetch interval.

Fix:

Explicitly call flb_tls_session_destroy before conn_release on all exit paths in fetch_pod_service_map. flb_tls_session_destroy self-nulls connection->tls_session, so the deferred destroy_conn safely skips it with no double-free.

Testing
Before we can approve your change; please submit the following in a comment:

Valgrind/fluent-bit - https://gist.github.com/ShelbyZ/35d555b041fd1693f60aabf7c05d3616

Early Testing

Valgrind reported only 157 bytes definitely lost — unchanged across all runs. The SSL* objects remained reachable via the destroy_queue at all times, so memcheck classified them as "still reachable" at exit rather than leaked. This class of growth is invisible to memcheck; it only shows up as unbounded RSS growth during runtime.

How it was discovered

Fluent Bit was built with jemalloc --enable-prof and run with:

MALLOC_CONF=prof:true,prof_active:true,prof_thread_active_init:true,prof_leak:true,
prof_final:true,lg_prof_interval:32,prof_accum:true,prof_prefix:/tmp/jeprof,
lg_prof_sample:17,background_thread:true,abort_conf:true

Mid-run heap diffs via jeprof --base isolated runtime growth from init allocations and pointed directly at update_pod_service_map → fetch_pod_service_map as 84–115% of heap growth. RSS comparison across four clusters confirmed the fix — all four showed negative RSS delta after the change.

Introduced changes to AWS for Fluent Bit debug image to capture heap/rss during run - aws/aws-for-fluent-bit#1115,

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • [N/A] Run local packaging test showing all targets (including any new ones) build.
  • [N/A] Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

  • [N/A] Documentation required for this feature

Backporting

  • [N/A] Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

Summary by CodeRabbit

  • Bug Fixes

    • Strengthened TLS session cleanup in the Kubernetes AWS filter to ensure sessions are reliably destroyed on error and normal paths, preventing resource leaks and potential crashes.
  • Tests

    • Added tests validating TLS session lifecycle and ensuring session destruction is performed exactly once to avoid double-free and stability regressions.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 20, 2026

📝 Walkthrough

Walkthrough

Added explicit flb_tls_session_destroy() calls in Kubernetes AWS plugin error and cleanup paths to ensure TLS sessions are freed reliably. Expanded upstream/TLS tests with a shared connection setup helper and a new test that verifies no double-free when a TLS session is destroyed before connection release.

Changes

Cohort / File(s) Summary
Kubernetes AWS plugin
plugins/filter_kubernetes/kubernetes_aws.c
Call flb_tls_session_destroy(u_conn->tls_session) (with null checks) on additional error/cleanup paths: HTTP client creation failure, HTTP request failure/non-200 response, and after normal response parsing. Updated in-code comment about TLS draining behavior.
Upstream TLS tests
tests/internal/upstream_tls.c
Added destroy_calls counter and test_session_destroy() callback; introduced setup_conn() helper to centralize test connection creation; refactored an existing test and added test_tls_session_destroy_no_double_free() which heap-allocates flb_connection/flb_tls_session, calls flb_tls_session_destroy() before release, and asserts a single backend destroy invocation. Updated TEST_LIST.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

backport to v4.2.x

Suggested reviewers

  • edsiper
  • cosmo0920

Poem

🐰 I hopped through code at break of day,
Destroyed a session safely on my way,
No double frees to make me fret,
Cleanup neat — no memory debt! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The PR title 'filter_kubernetes: destroy TLS session to prevent SSL object leak' directly and accurately summarizes the main change: explicit TLS session destruction in the filter_kubernetes module to prevent SSL object accumulation.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@plugins/filter_kubernetes/kubernetes_aws.c`:
- Around line 249-251: The calls to flb_tls_session_destroy(u_conn->tls_session)
are unconditional but that function is only available when FLB_HAVE_TLS is
defined; wrap each TLS-session cleanup (the blocks referencing
u_conn->tls_session) in `#ifdef` FLB_HAVE_TLS / `#endif` guards (same pattern used
in src/flb_upstream.c) so non-TLS builds won't reference
flb_tls_session_destroy; apply this to the three cleanup sites that currently
call flb_tls_session_destroy(u_conn->tls_session).

In `@tests/internal/upstream_tls.c`:
- Line 165: The test-list entry string and function name on the single line
exceed the 120-char limit; split the entry into two indented lines so the string
literal and the function identifier are on separate lines (e.g., keep
"tls_session_destroy_before_conn_release_prevents_double_free" on the first line
and place test_tls_session_destroy_before_conn_release_prevents_double_free on
the next indented line) to ensure the line length is under the limit while
preserving the array entry syntax.
- Around line 113-153: The test currently uses a stack-allocated struct
flb_connection (conn) so the dynamically_allocated flag prevents flb_free from
running and the test doesn't exercise the real cleanup path; change the test to
heap-allocate the connection (e.g. conn = flb_calloc(1, sizeof(struct
flb_connection)) and check non-NULL), update uses of conn to the pointer, set
conn->dynamically_allocated = FLB_TRUE before calling flb_upstream_conn_release
/ flb_upstream_conn_pending_destroy, and ensure any allocated resources are
closed/freed at the end so the test exercises the real dynamic destroy path in
flb_upstream_conn_pending_destroy().
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 89969ef3-9bfc-4109-9e09-effce6d9c25e

📥 Commits

Reviewing files that changed from the base of the PR and between 29deec9 and b7bfb0c.

📒 Files selected for processing (2)
  • plugins/filter_kubernetes/kubernetes_aws.c
  • tests/internal/upstream_tls.c

Comment thread plugins/filter_kubernetes/kubernetes_aws.c
Comment thread tests/internal/upstream_tls.c Outdated
Comment thread tests/internal/upstream_tls.c Outdated
ShelbyZ added 2 commits April 21, 2026 02:01
Signed-off-by: Shelby Hagman <shelbyzh@amazon.com>
…uble-free

Signed-off-by: Shelby Hagman <shelbyzh@amazon.com>
@ShelbyZ ShelbyZ changed the title filter_kubernetes: destroy TLS session explicitly to prevent SSL object accumulation in background thread filter_kubernetes: destroy TLS session to prevent SSL object leak Apr 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants