fix(tracing): properly shutdown tracer provider by siavashs · Pull Request #5131 · prometheus/alertmanager

siavashs · 2026-03-30T10:02:44Z

Fix tracer provider reloads in different scenarios:

disabled -> enabled: install new tracer provider
enabled -> enabled: keep provider up and swap with new one
enabled -> disabled: shutdown old tracer provider, use noop provider
build failure: keep current provider

Pull Request Checklist

Please check all the applicable boxes.

Which user-facing changes does this PR introduce?

Summary by CodeRabbit

Refactor
- Improved tracing configuration flow to serialize enable/disable, safely handle reconfiguration, and avoid races during provider replacement and shutdown.
- Simplified tracer provider shutdown handling.
Bug Fixes
- Shutdown errors are now logged as warnings and no longer interrupt configuration or stop operations.
Tests
- Added tests for tracing enable/disable, reloads, invalid config handling, and state preservation.

coderabbitai · 2026-03-30T10:03:02Z

📝 Walkthrough

Walkthrough

Manager now serializes tracer-provider install/uninstall and shutdown with a mutex. ApplyConfig swaps global provider to noop when disabling or builds-and-swaps when enabling, then shuts down the previous provider while logging shutdown errors as warnings. Stop also serializes and disables before shutdown. Tests cover enable/disable, reload, and failure preservation.

Changes

Cohort / File(s)	Summary
Tracer provider lifecycle `tracing/tracing.go`	Add `mtx sync.Mutex` to serialize `ApplyConfig` and `Stop`; disable flow swaps global provider to noop then shuts down previous provider; enable/reconfigure builds new provider before swapping, updates state, then shuts down previous provider; shutdown errors logged as warnings; simplify shutdown wrapper.
Tracing tests `tracing/tracing_test.go`	Add tests covering noop-span behavior when tracing disabled, enabled behavior after ApplyConfig, `tracingEnabled` transitions, reload behavior (provider swap), and apply-failure preserving prior provider/shutdownFunc; add imports (`context`, `reflect`, `attribute`).

Sequence Diagram(s)

sequenceDiagram
    participant Manager as Manager (ApplyConfig / Stop)
    participant Builder as buildTracerProvider
    participant Global as Global TracerProvider
    participant Prev as Previous TracerProvider

    Manager->>Manager: lock mtx
    alt Enable / Reconfigure
        Manager->>Builder: build new tracer provider (returns TP + shutdown)
        Builder-->>Manager: new TP + shutdown func
        Manager->>Global: swap global TP -> new TP
        Manager->>Prev: record previous TP
        Manager->>Prev: Prev.Shutdown(ctx) (log warnings on error)
        Manager->>Manager: set tracingEnabled=true, set shutdownFunc, update config
    else Disable
        Manager->>Global: swap global TP -> noop TP
        Manager->>Prev: record previous TP
        Manager->>Prev: Prev.Shutdown(ctx) (log warnings on error)
        Manager->>Manager: set tracingEnabled=false, clear shutdownFunc, update config
    end
    Manager->>Manager: unlock mtx

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

fix(tracing): use noop spans when tracing disabled #5118: Modifies tracing Manager logic (ApplyConfig / tracing enable-disable) in the same area.

Suggested labels

kind/enhancement

Suggested reviewers

ultrotter
TheMeier

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The description clearly outlines the fix across multiple scenarios, the bugfix checklist is marked as incomplete (no tests checkbox selected), and the release-notes section is empty despite this being a bugfix affecting behavior.	Mark the bugfix checkbox indicating tests were added (5 new test cases are present), and provide a release-notes entry using [BUGFIX] prefix to document the tracer provider shutdown behavior changes.
Docstring Coverage	⚠️ Warning	Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'fix(tracing): properly shutdown tracer provider' directly aligns with the core change: fixing tracer provider shutdown behavior during reloads in various scenarios.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

tracing/tracing.go (1)

103-136: Please pin the reload state machine with focused tests.

This method now has four intentionally different transitions: disabled→enabled, enabled→enabled, enabled→disabled, and build failure keeping the current provider. A table-driven test around ApplyConfig would make future ordering changes here much safer.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tracing/tracing.go` around lines 103 - 136, Add table-driven unit tests for
ApplyConfig that pin the four state transitions: disabled→enabled,
enabled→enabled, enabled→disabled, and build failure (which must leave the
current provider/config intact). For each case, arrange initial state on the
manager (tracingEnabled, m.config, m.shutdownFunc) and stub or mock
buildTracerProvider to return a new tracer provider+shutdown or an error; then
call ApplyConfig and assert the expected outcomes: whether tracingEnabled was
toggled, otel tracer provider swapped, m.shutdownFunc set/cleared, old shutdown
invoked on swap, config updated on success, and that on build failure the
previous provider/config/shutdownFunc remain unchanged. Use table-driven
subtests to enumerate cases so future reorderings of ApplyConfig are covered.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tracing/tracing.go`:
- Around line 125-136: Add a sync.Mutex to the Manager struct and use it to
serialize all reads/writes of m.shutdownFunc and m.config in ApplyConfig and
Stop: in ApplyConfig lock the mutex before reading oldShutdown, setting
otel.SetTracerProvider(tp), setting tracingEnabled, assigning m.shutdownFunc =
shutdownFunc and m.config = cfg, then unlock before calling oldShutdown(); in
Stop lock the mutex while checking and clearing m.shutdownFunc and m.config
(capture the shutdownFunc into a local variable), clear tracingEnabled and reset
provider, unlock, then call the captured shutdownFunc outside the lock to avoid
double shutdown and races; ensure all references to m.shutdownFunc and m.config
in both methods are protected by this mutex.

---

Nitpick comments:
In `@tracing/tracing.go`:
- Around line 103-136: Add table-driven unit tests for ApplyConfig that pin the
four state transitions: disabled→enabled, enabled→enabled, enabled→disabled, and
build failure (which must leave the current provider/config intact). For each
case, arrange initial state on the manager (tracingEnabled, m.config,
m.shutdownFunc) and stub or mock buildTracerProvider to return a new tracer
provider+shutdown or an error; then call ApplyConfig and assert the expected
outcomes: whether tracingEnabled was toggled, otel tracer provider swapped,
m.shutdownFunc set/cleared, old shutdown invoked on swap, config updated on
success, and that on build failure the previous provider/config/shutdownFunc
remain unchanged. Use table-driven subtests to enumerate cases so future
reorderings of ApplyConfig are covered.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a7d5f532-b15c-49c4-b1fd-ed1d86ae3035

📥 Commits

Reviewing files that changed from the base of the PR and between b8725fe and eadfab0.

📒 Files selected for processing (1)

tracing/tracing.go

tracing/tracing.go

Fix tracer provider reloads in different scenarios: - disabled -> enabled: install new tracer provider - enabled -> enabled: keep provider up and swap with new one - enabled -> disabled: shutdown old tracer provider, use noop provider - build failure: keep current provider Signed-off-by: Siavash Safi <siavash@cloudflare.com>

coderabbitai

🧹 Nitpick comments (2)

tracing/tracing_test.go (2)
146-147: Consider adding a comment explaining the pointer comparison technique.

The fmt.Sprintf("%p", ...) approach for comparing function pointers works correctly but is non-obvious. A brief comment would help future readers understand why value equality (require.Equal(t, shutdownBefore, m.shutdownFunc)) isn't used (functions aren't directly comparable in Go).
📝 Suggested documentation
 	require.NotNil(t, m.shutdownFunc)
-	// shutdownFunc should be the same as before the failed apply.
+	// shutdownFunc should be the same instance as before the failed apply.
+	// Functions aren't directly comparable in Go, so we compare pointer addresses.
 	require.Equal(t, fmt.Sprintf("%p", shutdownBefore), fmt.Sprintf("%p", m.shutdownFunc))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tracing/tracing_test.go` around lines 146 - 147, Add a brief inline comment
above the comparison that explains why fmt.Sprintf("%p", ...) is used to compare
function pointers (e.g., functions aren't directly comparable in Go) and that
the intent is to verify that shutdownBefore and m.shutdownFunc refer to the same
function address; reference the symbols shutdownBefore and m.shutdownFunc and
place the comment immediately before the require.Equal call so future readers
understand why value equality isn't used.
54-57: Cleanup should shut down the tracer provider to avoid resource leaks.

The cleanup sets m.shutdownFunc = nil without calling it first. This leaves the tracer provider's resources (e.g., exporters, connections) uncleaned. Compare with TestTracerProviderShutdown at line 246, which properly calls m.Stop().
♻️ Suggested cleanup pattern
 	t.Cleanup(func() {
+		m.Stop()
 		tracingEnabled.Store(false)
-		m.shutdownFunc = nil
 	})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tracing/tracing_test.go` around lines 54 - 57, The cleanup currently just
sets tracingEnabled.Store(false) and m.shutdownFunc = nil, leaking tracer
resources; change the cleanup to invoke the tracer shutdown path before nil-ing
it (call m.Stop() or, if Stop is not exposed, call the stored m.shutdownFunc()
and handle its error) so the tracer provider and exporters are properly shut
down, then set m.shutdownFunc = nil; reference symbols: tracingEnabled,
m.shutdownFunc, and the existing TestTracerProviderShutdown pattern (m.Stop())
to mirror the correct shutdown behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tracing/tracing_test.go`:
- Around line 146-147: Add a brief inline comment above the comparison that
explains why fmt.Sprintf("%p", ...) is used to compare function pointers (e.g.,
functions aren't directly comparable in Go) and that the intent is to verify
that shutdownBefore and m.shutdownFunc refer to the same function address;
reference the symbols shutdownBefore and m.shutdownFunc and place the comment
immediately before the require.Equal call so future readers understand why value
equality isn't used.
- Around line 54-57: The cleanup currently just sets tracingEnabled.Store(false)
and m.shutdownFunc = nil, leaking tracer resources; change the cleanup to invoke
the tracer shutdown path before nil-ing it (call m.Stop() or, if Stop is not
exposed, call the stored m.shutdownFunc() and handle its error) so the tracer
provider and exporters are properly shut down, then set m.shutdownFunc = nil;
reference symbols: tracingEnabled, m.shutdownFunc, and the existing
TestTracerProviderShutdown pattern (m.Stop()) to mirror the correct shutdown
behavior.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 19ef68e4-471c-42b6-a95e-4257e67a7623

📥 Commits

Reviewing files that changed from the base of the PR and between eadfab0 and fd68082.

📒 Files selected for processing (2)

tracing/tracing.go
tracing/tracing_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

tracing/tracing.go

TheMeier

LGTM

tracing/tracing.go

tracing/tracing_test.go

Signed-off-by: Siavash Safi <siavash@cloudflare.com>

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

tracing/tracing_test.go (1)
54-57: Consider calling the shutdown function before clearing it.

The cleanup sets m.shutdownFunc = nil without invoking it, which skips the provider's graceful shutdown. While this likely doesn't cause issues in tests (no actual OTLP endpoint), it would be cleaner to call the shutdown or use m.Stop() to properly release resources.
Suggested cleanup pattern
 t.Cleanup(func() {
-	tracingEnabled.Store(false)
-	m.shutdownFunc = nil
+	if m.shutdownFunc != nil {
+		_ = m.shutdownFunc()
+	}
+	tracingEnabled.Store(false)
+	m.shutdownFunc = nil
 })
Or simply:
 t.Cleanup(func() {
-	tracingEnabled.Store(false)
-	m.shutdownFunc = nil
+	m.Stop()
 })
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tracing/tracing_test.go` around lines 54 - 57, The test cleanup currently
clears tracingEnabled and sets m.shutdownFunc = nil without invoking the
shutdown; change the cleanup to call the provider shutdown before clearing it
(e.g., if m.shutdownFunc != nil { m.shutdownFunc() } or call m.Stop()), then set
m.shutdownFunc = nil and tracingEnabled.Store(false) so the provider can
gracefully shut down; reference m.shutdownFunc and m.Stop() in the cleanup
logic.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tracing/tracing.go`:
- Line 127: The error string returned by the fmt.Errorf call that currently
reads "Failed to build a new tracer provider: %w" should start with a lowercase
letter per Go convention; update that fmt.Errorf invocation (the one returning
the "Failed to build a new tracer provider" error in tracing.go) to use "failed
to build a new tracer provider: %w" so the error string is not capitalized.

---

Nitpick comments:
In `@tracing/tracing_test.go`:
- Around line 54-57: The test cleanup currently clears tracingEnabled and sets
m.shutdownFunc = nil without invoking the shutdown; change the cleanup to call
the provider shutdown before clearing it (e.g., if m.shutdownFunc != nil {
m.shutdownFunc() } or call m.Stop()), then set m.shutdownFunc = nil and
tracingEnabled.Store(false) so the provider can gracefully shut down; reference
m.shutdownFunc and m.Stop() in the cleanup logic.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5ff65672-6c45-420a-b467-57051753cebd

📥 Commits

Reviewing files that changed from the base of the PR and between fd68082 and 3767cc3.

📒 Files selected for processing (2)

tracing/tracing.go
tracing/tracing_test.go

tracing/tracing.go

Signed-off-by: Siavash Safi <siavash@cloudflare.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tracing/tracing_test.go`:
- Around line 70-72: The test relies on package-global mutable state
tracingEnabled; before asserting require.False(t, tracingEnabled.Load())
explicitly set the global to a known value (e.g., call
tracingEnabled.Store(false) or the package helper that disables tracing) so the
assertion is independent of test order, and optionally restore the previous
value at test teardown; update the test around the require.False check to
explicitly reset tracingEnabled first and then assert.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1886d06a-d142-4d96-b09b-95bfbe28c09b

📥 Commits

Reviewing files that changed from the base of the PR and between 3767cc3 and bd8085a.

📒 Files selected for processing (1)

tracing/tracing_test.go

coderabbitai · 2026-04-03T14:05:06Z

tracing/tracing_test.go

+	// Initially disabled.
+	require.False(t, tracingEnabled.Load())
+


⚠️ Potential issue | 🟡 Minor

Make initial-state assertion independent from test order.

At Line 71, the test assumes tracingEnabled starts as false, but this is package-global mutable state touched by other tests. Please reset state explicitly in this test before asserting to avoid shuffle/order flakiness.

🔧 Minimal hardening patch

func TestTracingEnabledFlagTransitions(t *testing.T) { + tracingEnabled.Store(false) + otel.SetTracerProvider(noop.NewTracerProvider()) + t.Cleanup(func() { + tracingEnabled.Store(false) + otel.SetTracerProvider(noop.NewTracerProvider()) + }) + m := NewManager(promslog.NewNopLogger()) t.Cleanup(m.Stop) // Initially disabled. require.False(t, tracingEnabled.Load())

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// Initially disabled.

require.False(t, tracingEnabled.Load())

func TestTracingEnabledFlagTransitions(t *testing.T) {

tracingEnabled.Store(false)

otel.SetTracerProvider(noop.NewTracerProvider())

t.Cleanup(func() {

tracingEnabled.Store(false)

otel.SetTracerProvider(noop.NewTracerProvider())

})

m := NewManager(promslog.NewNopLogger())

t.Cleanup(m.Stop)

// Initially disabled.

require.False(t, tracingEnabled.Load())

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tracing/tracing_test.go` around lines 70 - 72, The test relies on package-global mutable state tracingEnabled; before asserting require.False(t, tracingEnabled.Load()) explicitly set the global to a known value (e.g., call tracingEnabled.Store(false) or the package helper that disables tracing) so the assertion is independent of test order, and optionally restore the previous value at test teardown; update the test around the require.False check to explicitly reset tracingEnabled first and then assert.

SoloJacobs · 2026-04-03T15:05:26Z

LGTM, I specifically like that the tests enforce what the change accomplishes 👍

siavashs requested a review from a team as a code owner March 30, 2026 10:02

siavashs requested a review from ultrotter March 30, 2026 10:03

coderabbitai bot reviewed Mar 30, 2026

View reviewed changes

tracing/tracing.go Show resolved Hide resolved

siavashs force-pushed the fix/tracing-reload branch from eadfab0 to fd68082 Compare March 30, 2026 10:14

coderabbitai bot reviewed Mar 30, 2026

View reviewed changes

TheMeier approved these changes Mar 30, 2026

View reviewed changes

tracing/tracing.go Show resolved Hide resolved

tracing/tracing_test.go Outdated Show resolved Hide resolved

use better func comparison in tests

e3d714a

Signed-off-by: Siavash Safi <siavash@cloudflare.com>

siavashs force-pushed the fix/tracing-reload branch from 3767cc3 to e3d714a Compare April 3, 2026 13:53

coderabbitai bot reviewed Apr 3, 2026

View reviewed changes

tracing/tracing.go Outdated Show resolved Hide resolved

fix(tracing): call stop for test cleanup

bd8085a

Signed-off-by: Siavash Safi <siavash@cloudflare.com>

coderabbitai bot reviewed Apr 3, 2026

View reviewed changes

SoloJacobs approved these changes Apr 4, 2026

View reviewed changes

		// Initially disabled.
		require.False(t, tracingEnabled.Load())

Conversation

siavashs commented Mar 30, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Checklist

Which user-facing changes does this PR introduce?

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

TheMeier left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

SoloJacobs commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

siavashs commented Mar 30, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 30, 2026 •

edited

Loading