filter_kubernetes: destroy TLS session to prevent SSL object leak#11730
filter_kubernetes: destroy TLS session to prevent SSL object leak#11730ShelbyZ wants to merge 2 commits intofluent:masterfrom
Conversation
📝 WalkthroughWalkthroughAdded explicit Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@plugins/filter_kubernetes/kubernetes_aws.c`:
- Around line 249-251: The calls to flb_tls_session_destroy(u_conn->tls_session)
are unconditional but that function is only available when FLB_HAVE_TLS is
defined; wrap each TLS-session cleanup (the blocks referencing
u_conn->tls_session) in `#ifdef` FLB_HAVE_TLS / `#endif` guards (same pattern used
in src/flb_upstream.c) so non-TLS builds won't reference
flb_tls_session_destroy; apply this to the three cleanup sites that currently
call flb_tls_session_destroy(u_conn->tls_session).
In `@tests/internal/upstream_tls.c`:
- Line 165: The test-list entry string and function name on the single line
exceed the 120-char limit; split the entry into two indented lines so the string
literal and the function identifier are on separate lines (e.g., keep
"tls_session_destroy_before_conn_release_prevents_double_free" on the first line
and place test_tls_session_destroy_before_conn_release_prevents_double_free on
the next indented line) to ensure the line length is under the limit while
preserving the array entry syntax.
- Around line 113-153: The test currently uses a stack-allocated struct
flb_connection (conn) so the dynamically_allocated flag prevents flb_free from
running and the test doesn't exercise the real cleanup path; change the test to
heap-allocate the connection (e.g. conn = flb_calloc(1, sizeof(struct
flb_connection)) and check non-NULL), update uses of conn to the pointer, set
conn->dynamically_allocated = FLB_TRUE before calling flb_upstream_conn_release
/ flb_upstream_conn_pending_destroy, and ensure any allocated resources are
closed/freed at the end so the test exercises the real dynamic destroy path in
flb_upstream_conn_pending_destroy().
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 89969ef3-9bfc-4109-9e09-effce6d9c25e
📒 Files selected for processing (2)
plugins/filter_kubernetes/kubernetes_aws.ctests/internal/upstream_tls.c
Signed-off-by: Shelby Hagman <shelbyzh@amazon.com>
…uble-free Signed-off-by: Shelby Hagman <shelbyzh@amazon.com>
Summary
Problem:
The fetch_pod_service_map background thread creates its own event loop but never drains the destroy_queue, so TLS sessions released via conn_release accumulate indefinitely — one per fetch interval.
Fix:
Explicitly call flb_tls_session_destroy before conn_release on all exit paths in fetch_pod_service_map. flb_tls_session_destroy self-nulls connection->tls_session, so the deferred destroy_conn safely skips it with no double-free.
Testing
Before we can approve your change; please submit the following in a comment:
Valgrind/fluent-bit - https://gist.github.com/ShelbyZ/35d555b041fd1693f60aabf7c05d3616
Early Testing
Valgrind reported only 157 bytes definitely lost — unchanged across all runs. The SSL* objects remained reachable via the destroy_queue at all times, so memcheck classified them as "still reachable" at exit rather than leaked. This class of growth is invisible to memcheck; it only shows up as unbounded RSS growth during runtime.
How it was discovered
Fluent Bit was built with jemalloc
--enable-profand run with:Mid-run heap diffs via jeprof --base isolated runtime growth from init allocations and pointed directly at update_pod_service_map → fetch_pod_service_map as 84–115% of heap growth. RSS comparison across four clusters confirmed the fix — all four showed negative RSS delta after the change.
Introduced changes to AWS for Fluent Bit debug image to capture heap/rss during run - aws/aws-for-fluent-bit#1115,
If this is a change to packaging of containers or native binaries then please confirm it works for all targets.
ok-package-testlabel to test for all targets (requires maintainer to do).Documentation
Backporting
Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.
Summary by CodeRabbit
Bug Fixes
Tests