Improve tracing for parallel fetch that blocks cloud agent init.#12843
Improve tracing for parallel fetch that blocks cloud agent init.#12843vorporeal wants to merge 2 commits into
Conversation
|
I'm starting a first review of this pull request. You can view the conversation on Warp. I completed the review and no human review was requested for this pull request. Comment Powered by Oz |
There was a problem hiding this comment.
Overview
This PR adds #[tracing::instrument] spans (tagged tags.cloud_agent) across the cloud-agent attachment and secret fetch paths, and refactors two closures into named inner async fns so they can be instrumented. The changes are observability-only and behavior-preserving.
Concerns
- Formatting will fail CI (must fix before merge): the new
async fn innersignature incrates/managed_secrets/src/manager.rsis ~135 columns andrustfmtrewraps it. I confirmedrustfmt --checkreformats this signature, so./script/format(required per WARP.md) will report a diff. See inline suggestion. - Per-attempt error noise (non-blocking):
download_attachment::attemptis instrumented witherrand runs insidewith_bounded_retry, so every transient failure that is later retried successfully will still be recorded at ERROR level. The outerdownload_attachmentalready haserrfor the terminal failure, so considererr(level = "warn")on the innerattemptto avoid logging recoverable transient failures as errors.
Verdict
Found: 0 critical, 1 important, 2 suggestions
Approve with nits — looks correct as an observability-only change; please run ./script/format so CI passes before taking this out of draft.
This review was generated with Oz.
There was a problem hiding this comment.
Overview
Pure observability change: adds #[tracing::instrument(skip_all, err, fields(tags.cloud_agent = true))] spans to the four parallel fetches that run during cloud agent init (attachments, handoff snapshot attachments, ambient task, task attachments, handoff snapshot, and managed secrets). The refactors in download_attachment (extracting an inner attempt fn) and ManagedSecretManager::get_task_secrets (extracting an inner inner fn) are needed because #[tracing::instrument] can't be applied to a bare async move block / retry closure, so the bodies are moved into instrumentable async fns. This matches the existing pattern in the repo (e.g. git_credentials::try_refresh, persistence::initialize). Logic is preserved and the changes look correct. Not user-facing, so no visual evidence is required.
Concerns
-
download_attachment—erron the innerattemptspan (app/src/ai/agent_sdk/driver/attachments.rs): the innerattemptfn is wrapped bywith_bounded_retry, so instrumenting it witherrwill record an ERROR status on a span for every transient failure that is subsequently retried and succeeds. Since you're triagingSTATUS_CODE_ERRORspans, this may add noise (errored child spans even when the overall download succeeds). If you only want true terminal failures flagged, consider keepingerron the outerdownload_attachmentspan only and dropping it fromattempt(still capturing per-attempt timing). Non-blocking — surfacing each failed attempt may well be the intent. -
Inner
attemptspan name (app/src/ai/agent_sdk/driver/attachments.rs): unlikeget_task_secretswhich setsname = "get_task_secrets", the innerattemptfn span will be namedattempt, which is ambiguous in traces. Considername = "download_attachment::attempt"to match themodule::fnconvention used elsewhere (e.g.git_credentials::try_refresh). -
Formatting: the new inner-fn region won't pass
cargo fmt --checkas written —crates/managed_secrets/src/manager.rshas an over-lengthasync fn inner(...)signature, and the blank line introduced beforewith_bounded_retryinattachments.rscarries trailing whitespace. Please run./script/formatbefore un-drafting.
Verdict
Found: 0 critical, 0 important, 3 suggestions
Approve with nits — clean, correct observability change following established conventions; please run ./script/format and consider the err/naming notes above.
There was a problem hiding this comment.
Overview
This PR adds tracing spans around cloud-agent initialization fetches for task metadata, attachment fetch/download, handoff snapshot attachments, and managed secret retrieval. The implementation is scoped to observability: it skips arguments, marks cloud-agent spans, and adds the missing tracing dependency for warp_managed_secrets.
Concerns
- No blocking correctness, security, or spec-alignment concerns found in the annotated diff.
Verdict
Found: 0 critical, 0 important, 0 suggestions
Approve
Comment /oz-review on this pull request to retrigger a review (up to 3 times on the same pull request).
Powered by Oz
There was a problem hiding this comment.
Overview
Adds #[tracing::instrument(skip_all, err, fields(tags.cloud_agent = true))] spans to the cloud-agent init fetch paths: the attachment download helpers (fetch_and_download_attachments, fetch_and_download_handoff_snapshot_attachments, download_attachment + an extracted attempt inner fn), three AIClient methods in ai.rs (get_ambient_agent_task, get_task_attachments, get_handoff_snapshot_attachments), and ManagedSecretManager::get_task_secrets (refactored to delegate to an instrumented inner async fn). Also wires tracing into the warp_managed_secrets crate.
The instrumentation follows the existing convention in this codebase (e.g. get_ai_conversation in ai.rs and the many tags.cloud_agent = true spans across app/). All instrumented functions return anyhow::Result, so the err directive is valid. The two structural refactors (extracting attempt in download_attachment and inner in get_task_secrets) preserve behavior — retry semantics and the + use<> capture bound are unchanged, and warp_managed_secrets compiles cleanly.
Concerns
None blocking.
- Minor / optional: in
download_attachment(app/src/ai/agent_sdk/driver/attachments.rs), both the outer fn and the innerattemptfn carryerr. On a terminal failure the error will be recorded twice in tracing — once in the per-attempt span and once in the outer span. This is fine if per-attempt error events are intentional; otherwise consider droppingerrfrom one of the two to avoid duplicate error events.
This is a backend-only observability change with no user-visible behavior, so no screenshots/recordings are needed.
Verdict
Found: 0 critical, 0 important, 1 suggestion
Approve

Description
I found a couple traces on staging where cloud agent initialization fails during a span that currently covers four parallel async fetches of various things the agent needs before it can start running.
Example 1: trace
Example 2: trace
In order to better understand exactly what's going wrong here, I've added additional trace spans so we can see the contribution of each of these components, and get better insight into what's slow and what fails.