Respect global config for OTel tracing and metrics enabled#4326
Respect global config for OTel tracing and metrics enabled#4326gkatz2 wants to merge 2 commits intostacklok:mainfrom
Conversation
thv config otel set-tracing-enabled false was silently ignored by thv run because getTelemetryFromFlags had no fallback for these two fields, and buildRunConfig bypassed the fallback entirely for the proxy runner config. Users could not globally disable telemetry without passing CLI flags on every invocation. Fixes stacklok#4323 Signed-off-by: Greg Katz <gkatz@indeed.com>
There was a problem hiding this comment.
Pull request overview
This PR fixes thv run ignoring global OpenTelemetry tracing-enabled / metrics-enabled settings by introducing proper config fallbacks and representing “unset vs explicitly false” in the global config model.
Changes:
- Update global
OpenTelemetryConfigto use*boolforTracingEnabled/MetricsEnabledto preserve “never set” vs “explicitly disabled”. - Extend
getTelemetryFromFlagsfallbacks to include tracing/metrics enabled, and use the resolved values when building telemetry config. - Update config CLI setters/getters/unsetters and expand unit test coverage for telemetry flag/config resolution.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
pkg/config/config.go |
Switch tracing/metrics enabled fields to *bool and add rationale comment. |
cmd/thv/app/run_flags.go |
Add tracing/metrics config fallback and adjust telemetry creation / legacy runconfig wiring. |
cmd/thv/app/otel.go |
Update config mutation and printing logic to work with *bool. |
cmd/thv/app/run_flags_test.go |
Add/adjust tests for tracing/metrics fallback behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| MetricsEnabled *bool `yaml:"metrics-enabled"` | ||
| TracingEnabled *bool `yaml:"tracing-enabled"` | ||
| Insecure bool `yaml:"insecure,omitempty"` | ||
| EnablePrometheusMetricsPath bool `yaml:"enable-prometheus-metrics-path,omitempty"` | ||
| UseLegacyAttributes *bool `yaml:"use-legacy-attributes"` |
There was a problem hiding this comment.
MetricsEnabled/TracingEnabled are now *bool but the YAML tags no longer use omitempty. When the OTEL block is present (e.g., endpoint set) and these pointers are nil, yaml.Marshal will emit metrics-enabled: null / tracing-enabled: null, which is user-facing noise and ambiguous given the new nil=never-set semantics. Consider adding omitempty back to the pointer fields (so nil is omitted but explicit false/true is preserved), or otherwise ensuring nil values don’t serialize as null.
| MetricsEnabled *bool `yaml:"metrics-enabled"` | |
| TracingEnabled *bool `yaml:"tracing-enabled"` | |
| Insecure bool `yaml:"insecure,omitempty"` | |
| EnablePrometheusMetricsPath bool `yaml:"enable-prometheus-metrics-path,omitempty"` | |
| UseLegacyAttributes *bool `yaml:"use-legacy-attributes"` | |
| MetricsEnabled *bool `yaml:"metrics-enabled,omitempty"` | |
| TracingEnabled *bool `yaml:"tracing-enabled,omitempty"` | |
| Insecure bool `yaml:"insecure,omitempty"` | |
| EnablePrometheusMetricsPath bool `yaml:"enable-prometheus-metrics-path,omitempty"` | |
| UseLegacyAttributes *bool `yaml:"use-legacy-attributes,omitempty"` |
| func getOtelMetricsEnabledCmdFunc(_ *cobra.Command, _ []string) error { | ||
| configProvider := config.NewDefaultProvider() | ||
| cfg := configProvider.GetConfig() | ||
|
|
||
| fmt.Printf("Current OpenTelemetry metrics enabled: %t\n", cfg.OTEL.MetricsEnabled) | ||
| metricsEnabled := cfg.OTEL.MetricsEnabled != nil && *cfg.OTEL.MetricsEnabled | ||
| fmt.Printf("Current OpenTelemetry metrics enabled: %t\n", metricsEnabled) | ||
| return nil | ||
| } | ||
|
|
||
| func unsetOtelMetricsEnabledCmdFunc(_ *cobra.Command, _ []string) error { | ||
| configProvider := config.NewDefaultProvider() | ||
| cfg := configProvider.GetConfig() | ||
|
|
||
| if !cfg.OTEL.MetricsEnabled { | ||
| if cfg.OTEL.MetricsEnabled == nil || !*cfg.OTEL.MetricsEnabled { | ||
| fmt.Println("OpenTelemetry metrics enabled is already disabled.") | ||
| return nil |
There was a problem hiding this comment.
The get-otel-metrics-enabled output treats nil as false, but with the new *bool semantics nil means “never set” (and thv run will fall back to the CLI default, which is true). This can mislead users into thinking metrics are globally disabled when the setting is actually unset. Also, unset-otel-metrics-enabled currently returns early when the value is explicitly set to false, which prevents unsetting a previously-disabled value back to “never set”. Adjust the getter to report the unset state (or effective default), and allow unsetting regardless of current value (only short-circuit when already nil).
| func getOtelTracingEnabledCmdFunc(_ *cobra.Command, _ []string) error { | ||
| configProvider := config.NewDefaultProvider() | ||
| cfg := configProvider.GetConfig() | ||
|
|
||
| fmt.Printf("Current OpenTelemetry tracing enabled: %t\n", cfg.OTEL.TracingEnabled) | ||
| tracingEnabled := cfg.OTEL.TracingEnabled != nil && *cfg.OTEL.TracingEnabled | ||
| fmt.Printf("Current OpenTelemetry tracing enabled: %t\n", tracingEnabled) | ||
| return nil | ||
| } | ||
|
|
||
| func unsetOtelTracingEnabledCmdFunc(_ *cobra.Command, _ []string) error { | ||
| configProvider := config.NewDefaultProvider() | ||
| cfg := configProvider.GetConfig() | ||
|
|
||
| if !cfg.OTEL.TracingEnabled { | ||
| if cfg.OTEL.TracingEnabled == nil || !*cfg.OTEL.TracingEnabled { | ||
| fmt.Println("OpenTelemetry tracing enabled is already disabled.") | ||
| return nil | ||
| } |
There was a problem hiding this comment.
Same issue as metrics: get-otel-tracing-enabled reports false when the config value is nil (unset), even though the effective behavior falls back to the CLI default true. Additionally, unset-otel-tracing-enabled currently refuses to unset when the value is explicitly false, so users can’t revert from “explicitly disabled” to “never set”. Update the getter to distinguish unset vs explicitly false, and change unset to only no-op when already nil.
| @@ -716,7 +728,7 @@ func configureMiddlewareAndOptions( | |||
| runFlags.JWKSAllowPrivateIP, runFlags.InsecureAllowHTTP, oidcScopes, | |||
| ), | |||
| runner.WithTelemetryConfigFromFlags(finalOtelEndpoint, runFlags.OtelEnablePrometheusMetricsPath, | |||
| runFlags.OtelTracingEnabled, runFlags.OtelMetricsEnabled, runFlags.OtelServiceName, | |||
| finalTracingEnabled, finalMetricsEnabled, runFlags.OtelServiceName, | |||
| finalOtelSamplingRate, runFlags.OtelHeaders, runFlags.OtelInsecure, finalOtelEnvironmentVariables, | |||
| runFlags.OtelUseLegacyAttributes, | |||
| ), | |||
There was a problem hiding this comment.
WithTelemetryConfigFromFlags is still receiving several raw runFlags values (e.g., otel-enable-prometheus-metrics-path, otel-insecure, otel-use-legacy-attributes) rather than the resolved values with global-config fallback. This can make the legacy RunConfig.TelemetryConfig diverge from the middleware telemetryConfig (e.g., global Prometheus metrics path enabled won’t be reflected, affecting exported runconfigs/K8s export). Consider deriving these values from telemetryConfig (when non-nil) or passing the resolved outputs from getTelemetryFromFlags into this call.
| // If both tracing and metrics are disabled, skip telemetry entirely. | ||
| // This allows users to disable telemetry via global config while keeping | ||
| // the endpoint configured for later re-enablement. | ||
| if !otelTracingEnabled && !otelMetricsEnabled && !otelEnablePrometheusMetricsPath { | ||
| return nil | ||
| } |
There was a problem hiding this comment.
createTelemetryConfig now has a new early-return path when both tracing and metrics are disabled. Given this is the core behavior change enabling global disablement, it should have a unit test asserting that telemetry config becomes nil when otelEndpoint is set but both signals (and Prometheus metrics path) are disabled, and that CLI flags still override config where applicable.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4326 +/- ##
==========================================
- Coverage 68.87% 68.45% -0.43%
==========================================
Files 478 479 +1
Lines 48306 48645 +339
==========================================
+ Hits 33272 33298 +26
- Misses 12320 12378 +58
- Partials 2714 2969 +255 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Cover the createTelemetryConfig early-return when both tracing and metrics are disabled with the endpoint still configured. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Greg Katz <gkatz@indeed.com>
| cfg := configProvider.GetConfig() | ||
|
|
||
| if !cfg.OTEL.MetricsEnabled { | ||
| if cfg.OTEL.MetricsEnabled == nil || !*cfg.OTEL.MetricsEnabled { |
There was a problem hiding this comment.
I think there's a subtle issue with the unset guard here. It treats "not configured" (nil) and "explicitly set to false" the same way — so if someone runs set-metrics-enabled false and later tries to unset it, they get "already disabled" and the explicit false stays in the config file, still overriding the CLI default.
Should we check just == nil instead? That way unset always removes the entry. Same thing applies to unsetOtelTracingEnabledCmdFunc at line 473.
| otelUseLegacyAttributes bool) ( | ||
| string, float64, []string, bool, bool, bool) { | ||
| otelUseLegacyAttributes bool, otelTracingEnabled bool, otelMetricsEnabled bool) ( | ||
| string, float64, []string, bool, bool, bool, bool, bool) { |
There was a problem hiding this comment.
This is now returning 8 positional values with 5 consecutive bools — it's getting pretty easy to accidentally swap them at call sites. Would you be up for pulling these into a small struct in this PR, or would you rather do that as a follow-up?
| // Extract resolved tracing/metrics values from the middleware telemetry config. | ||
| // These must match what setupTelemetryConfiguration resolved (with global config | ||
| // fallbacks) rather than the raw runFlags values, which ignore global config. | ||
| finalTracingEnabled := runFlags.OtelTracingEnabled |
There was a problem hiding this comment.
Heads up — when both signals get disabled via global config, createTelemetryConfig returns nil, and the fallback here ends up using the raw runFlags values (which default to true). So the resolved "disabled" state gets lost. Right now it doesn't cause problems because extractTelemetryValues(nil) returns an empty endpoint and MaybeMakeConfig bails out too, but it feels fragile.
Maybe we could pass the resolved values from getTelemetryFromFlags to both consumers, or just default to false here when telemetryConfig is nil?
Summary
thv config otel set-tracing-enabled falsewas silently ignored bythv runbecausegetTelemetryFromFlagshad no fallback forTracingEnabledorMetricsEnabled, andbuildRunConfigbypassed the fallback entirely for the proxy runner configFixes #4323
Type of change
Test plan
task test)task lint-fix)Deployed a local build and verified:
tracing-enabled: falseandmetrics-enabled: false,thv runproduces a runconfig withnulltelemetry (no telemetry initialized)--otel-tracing-enabled=true, CLI flag correctly overrides global configconfig.yamlnow showstracing-enabled: falseinstead of silently dropping the fieldChanges
pkg/config/config.goTracingEnabled/MetricsEnabledfromboolwithomitemptyto*boolwithoutomitempty, matching the existingUseLegacyAttributespatterncmd/thv/app/run_flags.gogetTelemetryFromFlags; use resolved values inbuildRunConfig; return nil fromcreateTelemetryConfigwhen both signals are disabledcmd/thv/app/otel.go*boolcmd/thv/app/run_flags_test.goDoes this introduce a user-facing change?
thv config otel set-tracing-enabledandset-metrics-enablednow take effect when starting servers withthv run. Previously these settings were silently ignored.Special notes for reviewers
The
*boolpattern (nil = never set, false = explicitly disabled) is necessary because the CLI defaults for these flags aretrue, unlikeInsecureandEnablePrometheusMetricsPathwhich default tofalse. A struct-level comment onOpenTelemetryConfigexplains this distinction.Generated with Claude Code