filter_kubernetes: destroy TLS session to prevent SSL object leak by ShelbyZ · Pull Request #11730 · fluent/fluent-bit

ShelbyZ · 2026-04-20T23:45:14Z

Summary

Problem:

The fetch_pod_service_map background thread creates its own event loop but never drains the destroy_queue, so TLS sessions released via conn_release accumulate indefinitely — one per fetch interval.

Fix:

Explicitly call flb_tls_session_destroy before conn_release on all exit paths in fetch_pod_service_map. flb_tls_session_destroy self-nulls connection->tls_session, so the deferred destroy_conn safely skips it with no double-free.

Testing
Before we can approve your change; please submit the following in a comment:

Example configuration file for the change - using EKS Addon - https://github.com/aws-observability/helm-charts/blob/main/charts/amazon-cloudwatch-observability/values.yaml
[N/A] Debug log output from testing the change - Logs were a bit too large for the long run 15GB 👎
Attached Valgrind output that shows no leaks or memory corruption was found

Valgrind/fluent-bit - https://gist.github.com/ShelbyZ/35d555b041fd1693f60aabf7c05d3616

Early Testing

Valgrind reported only 157 bytes definitely lost — unchanged across all runs. The SSL* objects remained reachable via the destroy_queue at all times, so memcheck classified them as "still reachable" at exit rather than leaked. This class of growth is invisible to memcheck; it only shows up as unbounded RSS growth during runtime.

How it was discovered

Fluent Bit was built with jemalloc --enable-prof and run with:

MALLOC_CONF=prof:true,prof_active:true,prof_thread_active_init:true,prof_leak:true,
prof_final:true,lg_prof_interval:32,prof_accum:true,prof_prefix:/tmp/jeprof,
lg_prof_sample:17,background_thread:true,abort_conf:true

Mid-run heap diffs via jeprof --base isolated runtime growth from init allocations and pointed directly at update_pod_service_map → fetch_pod_service_map as 84–115% of heap growth. RSS comparison across four clusters confirmed the fix — all four showed negative RSS delta after the change.

Introduced changes to AWS for Fluent Bit debug image to capture heap/rss during run - aws/aws-for-fluent-bit#1115,

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

[N/A] Run local packaging test showing all targets (including any new ones) build.
[N/A] Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

[N/A] Documentation required for this feature

Backporting

[N/A] Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

Summary by CodeRabbit

Bug Fixes
- Strengthened TLS session cleanup in the Kubernetes AWS filter to ensure sessions are reliably destroyed on error and normal paths, preventing resource leaks and potential crashes.
Tests
- Added tests validating TLS session lifecycle and ensuring session destruction is performed exactly once to avoid double-free and stability regressions.

coderabbitai · 2026-04-20T23:45:40Z

📝 Walkthrough

Walkthrough

Added explicit flb_tls_session_destroy() calls in Kubernetes AWS plugin error and cleanup paths to ensure TLS sessions are freed reliably. Expanded upstream/TLS tests with a shared connection setup helper and a new test that verifies no double-free when a TLS session is destroyed before connection release.

Changes

Cohort / File(s)	Summary
Kubernetes AWS plugin `plugins/filter_kubernetes/kubernetes_aws.c`	Call `flb_tls_session_destroy(u_conn->tls_session)` (with null checks) on additional error/cleanup paths: HTTP client creation failure, HTTP request failure/non-200 response, and after normal response parsing. Updated in-code comment about TLS draining behavior.
Upstream TLS tests `tests/internal/upstream_tls.c`	Added `destroy_calls` counter and `test_session_destroy()` callback; introduced `setup_conn()` helper to centralize test connection creation; refactored an existing test and added `test_tls_session_destroy_no_double_free()` which heap-allocates `flb_connection`/`flb_tls_session`, calls `flb_tls_session_destroy()` before release, and asserts a single backend destroy invocation. Updated `TEST_LIST`.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

filter_kubernetes: don't recycle connections in fetch_pod_service_map #11600 — Adjusts fetch_pod_service_map upstream/TLS teardown sequencing to avoid TLS session leaks/doubles.
filter_kubernetes: Adjust cleanup ordering to avoid use-after-free [4.2 backport] #11445 — Changes teardown ordering in Kubernetes AWS TLS handling to prevent use-after-free of TLS/session resources.
upstream: Move clearing TLS session from prepare phase to destroy phase #10886 — Modifies TLS session lifecycle and destruction timing to prevent double-free/UAF issues.

Suggested labels

backport to v4.2.x

Suggested reviewers

edsiper
cosmo0920

Poem

🐰 I hopped through code at break of day,
Destroyed a session safely on my way,
No double frees to make me fret,
Cleanup neat — no memory debt! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title 'filter_kubernetes: destroy TLS session to prevent SSL object leak' directly and accurately summarizes the main change: explicit TLS session destruction in the filter_kubernetes module to prevent SSL object accumulation.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@plugins/filter_kubernetes/kubernetes_aws.c`:
- Around line 249-251: The calls to flb_tls_session_destroy(u_conn->tls_session)
are unconditional but that function is only available when FLB_HAVE_TLS is
defined; wrap each TLS-session cleanup (the blocks referencing
u_conn->tls_session) in `#ifdef` FLB_HAVE_TLS / `#endif` guards (same pattern used
in src/flb_upstream.c) so non-TLS builds won't reference
flb_tls_session_destroy; apply this to the three cleanup sites that currently
call flb_tls_session_destroy(u_conn->tls_session).

In `@tests/internal/upstream_tls.c`:
- Line 165: The test-list entry string and function name on the single line
exceed the 120-char limit; split the entry into two indented lines so the string
literal and the function identifier are on separate lines (e.g., keep
"tls_session_destroy_before_conn_release_prevents_double_free" on the first line
and place test_tls_session_destroy_before_conn_release_prevents_double_free on
the next indented line) to ensure the line length is under the limit while
preserving the array entry syntax.
- Around line 113-153: The test currently uses a stack-allocated struct
flb_connection (conn) so the dynamically_allocated flag prevents flb_free from
running and the test doesn't exercise the real cleanup path; change the test to
heap-allocate the connection (e.g. conn = flb_calloc(1, sizeof(struct
flb_connection)) and check non-NULL), update uses of conn to the pointer, set
conn->dynamically_allocated = FLB_TRUE before calling flb_upstream_conn_release
/ flb_upstream_conn_pending_destroy, and ensure any allocated resources are
closed/freed at the end so the test exercises the real dynamic destroy path in
flb_upstream_conn_pending_destroy().

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 89969ef3-9bfc-4109-9e09-effce6d9c25e

📥 Commits

Reviewing files that changed from the base of the PR and between 29deec9 and b7bfb0c.

📒 Files selected for processing (2)

plugins/filter_kubernetes/kubernetes_aws.c
tests/internal/upstream_tls.c

Signed-off-by: Shelby Hagman <shelbyzh@amazon.com>

…uble-free Signed-off-by: Shelby Hagman <shelbyzh@amazon.com>

ShelbyZ requested review from cosmo0920 and edsiper as code owners April 20, 2026 23:45

github-actions Bot added the docs-required label Apr 20, 2026

ShelbyZ temporarily deployed to pr April 20, 2026 23:45 — with GitHub Actions Inactive

coderabbitai Bot reviewed Apr 20, 2026

View reviewed changes

Comment thread plugins/filter_kubernetes/kubernetes_aws.c

Comment thread tests/internal/upstream_tls.c Outdated

Comment thread tests/internal/upstream_tls.c Outdated

ShelbyZ temporarily deployed to pr April 21, 2026 00:16 — with GitHub Actions Inactive

ShelbyZ added 2 commits April 21, 2026 02:01

filter_kubernetes: destroy TLS session to prevent SSL object leak

044568c

Signed-off-by: Shelby Hagman <shelbyzh@amazon.com>

tests: internal: upstream_tls: verify TLS session destroy prevents do…

4379b9f

…uble-free Signed-off-by: Shelby Hagman <shelbyzh@amazon.com>

ShelbyZ force-pushed the filter-ssl-fix branch from b7bfb0c to 4379b9f Compare April 21, 2026 02:04

ShelbyZ changed the title ~~filter_kubernetes: destroy TLS session explicitly to prevent SSL object accumulation in background thread~~ filter_kubernetes: destroy TLS session to prevent SSL object leak Apr 21, 2026

ShelbyZ temporarily deployed to pr April 21, 2026 02:04 — with GitHub Actions Inactive

ShelbyZ temporarily deployed to pr April 21, 2026 02:24 — with GitHub Actions Inactive

cosmo0920 approved these changes Apr 21, 2026

View reviewed changes

cosmo0920 added this to the Fluent Bit v5.0.4 milestone Apr 21, 2026

ShelbyZ mentioned this pull request Apr 22, 2026

filter_kubernetes: destroy upstream and TLS context on happy path exit #11738

Open

1 task

singholt approved these changes Apr 22, 2026

View reviewed changes

cosmo0920 mentioned this pull request Apr 23, 2026

Kubernetes memory leak with tail input plugin, http and es output plugins #10974

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

filter_kubernetes: destroy TLS session to prevent SSL object leak#11730

filter_kubernetes: destroy TLS session to prevent SSL object leak#11730
ShelbyZ wants to merge 2 commits intofluent:masterfrom
ShelbyZ:filter-ssl-fix

ShelbyZ commented Apr 20, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Apr 20, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ShelbyZ commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Early Testing

How it was discovered

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ShelbyZ commented Apr 20, 2026 •

edited

Loading

coderabbitai Bot commented Apr 20, 2026 •

edited

Loading