feat(trace-exporter): add fail-closed fallback to v04#2037
feat(trace-exporter): add fail-closed fallback to v04#2037anais-raison wants to merge 11 commits into
Conversation
📚 Documentation Check Results📦
|
Clippy Allow Annotation ReportComparing clippy allow annotations between branches:
Summary by Rule
Annotation Counts by File
Annotation Stats by Crate
About This ReportThis report tracks Clippy allow annotations for specific rules, showing how they've changed in this PR. Decreasing the number of these annotations generally improves code quality. |
🔒 Cargo Deny Results📦
|
🎉 All green!🧪 All tests passed 🎯 Code Coverage (details) 🔗 Commit SHA: 73cbd6d | Docs | Datadog PR Page | Give us feedback! |
Artifact Size Benchmark Reportaarch64-alpine-linux-musl
aarch64-unknown-linux-gnu
libdatadog-x64-windows
libdatadog-x86-windows
x86_64-alpine-linux-musl
x86_64-unknown-linux-gnu
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2037 +/- ##
==========================================
+ Coverage 72.75% 72.83% +0.08%
==========================================
Files 458 458
Lines 75827 76005 +178
==========================================
+ Hits 55168 55361 +193
+ Misses 20659 20644 -15
🚀 New features to boost your workflow:
|
…allback # Conflicts: # libdd-data-pipeline/src/trace_exporter/trace_serializer.rs # libdd-trace-utils/src/msgpack_encoder/v04/mod.rs # libdd-trace-utils/src/msgpack_encoder/v1/mod.rs
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 73cbd6de94
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| header_tags, | ||
| &self.metadata, | ||
| self.agent_payload_response_version.as_ref(), | ||
| self.effective_output_format(), |
There was a problem hiding this comment.
Snapshot effective output format per send
send_trace_chunks_inner uses self.effective_output_format() here for payload encoding, but the URL is resolved later through get_agent_url(), which reads v1_active again. During concurrent traffic, another thread can run check_agent_info and flip v1_active between those two reads (for example when /info changes), causing a v0.4 payload to be sent to /v1.0/traces or a v1 payload to /v0.4/traces. That protocol/path mismatch can produce avoidable request failures and dropped traces during agent capability transitions.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
@anais-raison I believe this is a valid concern.
What is the expected behavior? I assume the common scenario is that we don't have the response from /info yet so we fallback to v04 until agent_info is populated.
To solve for that scenario you probably want to capture the effective output at the start and pass that value through the call chain so we don't wind up serializing for one format but sending to the other format's endpoint.
Is it possible that after we get a response from /info that says V1 is supported that the agent could later change that to false? If that's possible, the fix becomes more complicated because we have to consider the retry logic, which currently doesn't re-check for v1 support.
| /// Returns true when the `DD_TRACE_AGENT_PROTOCOL_VERSION` environment variable opts in to | ||
| /// the V1 protocol. Accepts `"1"` and `"1.0"`. Other values (including unset) yield false. | ||
| fn env_requests_v1_protocol() -> bool { | ||
| Self::parse_v1_protocol_env(std::env::var("DD_TRACE_AGENT_PROTOCOL_VERSION").ok()) |
There was a problem hiding this comment.
We should not be checking env vars in libdatadog (normally). This should be handled by the SDK and passed via the builder.
| .info | ||
| .endpoints | ||
| .as_ref() | ||
| .is_some_and(|e| e.iter().any(|p| p == "/v1.0/traces")); |
There was a problem hiding this comment.
/v1.0/traces should probably be a named const.
| // and the payload is encoded as V1. | ||
| let start = std::time::Instant::now(); | ||
| while libdd_data_pipeline::agent_info::get_agent_info().is_none() { | ||
| if start.elapsed() > std::time::Duration::from_secs(5) { |
There was a problem hiding this comment.
I think this is probably ok, but we need to be careful anytime we introduce waiting to tests. Could we try to stress test this a bit to see if it's flaky on CI? I'd suggest temporarily modifying the CI workflow to run the tracing integration step 1000x on the various platforms in our matrix in a loop to see if any flakes appear.
| .as_ref() | ||
| .is_some_and(|e| e.iter().any(|p| p == "/v1.0/traces")); | ||
| let previous = self.v1_active.swap(supports_v1, Ordering::Relaxed); | ||
| match (previous, supports_v1) { |
There was a problem hiding this comment.
Should we also log (false, false)? Or is there somewhere else we should log the scenario where the SDK is requesting V1, but the agent doesn't support it? I'm not sure if we care about that?
| match (previous, supports_v1) { | ||
| (false, true) => debug!("V1 trace protocol enabled (agent advertises /v1.0/traces)"), | ||
| (true, false) => { | ||
| warn!("V1 trace protocol no longer advertised by agent; falling back to v0.4") |
There was a problem hiding this comment.
Related to this comment, when would this happen?
| @@ -347,6 +351,9 @@ impl<C: HttpClientCapability + SleepCapability + MaybeSend + Sync + 'static> Tra | |||
| fn check_agent_info(&self) { | |||
| if let Some(agent_info) = agent_info::get_agent_info() { | |||
| if self.has_agent_info_state_changed(&agent_info) { | |||
There was a problem hiding this comment.
I might be concern trolling here...but if a likely scenario we have to support is the agent advertising that it supports V1 to advertising it does not support v1 then it's possible we get trapped in a loop where we don't get a correct state hash from the agent.
- Agent says it's ok to send V1, we send V1.
- Agent changes its mind and does not support V1
- We check if agent state has changed based on the last successful response that happened before the agent changed its mind.
- We send V1, get a 404 because the endpoint isn't available. We don't get a new state hash on a 404.
- We keep sending V1, keep 404'ing, never get a new state hash.
We poll /info every 5 minutes independently of the state hash. But 5 minutes is a long time to be dropping traces.
I think this would also apply to CSS, which has the same logic. Is this a valid concern?
CC @VianneyRuhlmann and @ajgajg1134
What does this PR do?
Adds runtime V1 trace protocol negotiation with fail-closed fallback to V0.4. Opt in via
enable_v1_protocol()orDD_TRACE_AGENT_PROTOCOL_VERSION=1. V1 is only used after the agent advertises/v1.0/tracesin/info.Motivation
APMSP-2809
Makes V1 safe to enable: falls back to V0.4 against agents that don't support it.
Additional Notes
Dynamic rollback supported (if the agent stops advertising V1, the exporter switches back to V0.4).
How to test the change?
All test are working.
End to end tests against a real agent were done and successful as well.