Skip to content

Improve tracing for parallel fetch that blocks cloud agent init.#12843

Open
vorporeal wants to merge 2 commits into
masterfrom
david/improve-tracing-for-cloud-agent-init-parallel-fetch
Open

Improve tracing for parallel fetch that blocks cloud agent init.#12843
vorporeal wants to merge 2 commits into
masterfrom
david/improve-tracing-for-cloud-agent-init-parallel-fetch

Conversation

@vorporeal

@vorporeal vorporeal commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Description

I found a couple traces on staging where cloud agent initialization fails during a span that currently covers four parallel async fetches of various things the agent needs before it can start running.

Example 1: trace
Example 2: trace

In order to better understand exactly what's going wrong here, I've added additional trace spans so we can see the contribution of each of these components, and get better insight into what's slow and what fails.

@cla-bot cla-bot Bot added the cla-signed label Jun 19, 2026

Copy link
Copy Markdown
Contributor Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

@vorporeal vorporeal marked this pull request as ready for review June 19, 2026 23:52
@oz-for-oss

oz-for-oss Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

@vorporeal

I'm starting a first review of this pull request.

You can view the conversation on Warp.

I completed the review and no human review was requested for this pull request.

Comment /oz-review on this pull request to retrigger a review (up to 3 times on the same pull request).

Powered by Oz

@warp-dev-github-integration warp-dev-github-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overview

This PR adds #[tracing::instrument] spans (tagged tags.cloud_agent) across the cloud-agent attachment and secret fetch paths, and refactors two closures into named inner async fns so they can be instrumented. The changes are observability-only and behavior-preserving.

Concerns

  • Formatting will fail CI (must fix before merge): the new async fn inner signature in crates/managed_secrets/src/manager.rs is ~135 columns and rustfmt rewraps it. I confirmed rustfmt --check reformats this signature, so ./script/format (required per WARP.md) will report a diff. See inline suggestion.
  • Per-attempt error noise (non-blocking): download_attachment::attempt is instrumented with err and runs inside with_bounded_retry, so every transient failure that is later retried successfully will still be recorded at ERROR level. The outer download_attachment already has err for the terminal failure, so consider err(level = "warn") on the inner attempt to avoid logging recoverable transient failures as errors.

Verdict

Found: 0 critical, 1 important, 2 suggestions

Approve with nits — looks correct as an observability-only change; please run ./script/format so CI passes before taking this out of draft.

This review was generated with Oz.

Comment thread crates/managed_secrets/src/manager.rs
Comment thread app/src/ai/agent_sdk/driver/attachments.rs

@warp-dev-github-integration warp-dev-github-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overview

Pure observability change: adds #[tracing::instrument(skip_all, err, fields(tags.cloud_agent = true))] spans to the four parallel fetches that run during cloud agent init (attachments, handoff snapshot attachments, ambient task, task attachments, handoff snapshot, and managed secrets). The refactors in download_attachment (extracting an inner attempt fn) and ManagedSecretManager::get_task_secrets (extracting an inner inner fn) are needed because #[tracing::instrument] can't be applied to a bare async move block / retry closure, so the bodies are moved into instrumentable async fns. This matches the existing pattern in the repo (e.g. git_credentials::try_refresh, persistence::initialize). Logic is preserved and the changes look correct. Not user-facing, so no visual evidence is required.

Concerns

  • download_attachmenterr on the inner attempt span (app/src/ai/agent_sdk/driver/attachments.rs): the inner attempt fn is wrapped by with_bounded_retry, so instrumenting it with err will record an ERROR status on a span for every transient failure that is subsequently retried and succeeds. Since you're triaging STATUS_CODE_ERROR spans, this may add noise (errored child spans even when the overall download succeeds). If you only want true terminal failures flagged, consider keeping err on the outer download_attachment span only and dropping it from attempt (still capturing per-attempt timing). Non-blocking — surfacing each failed attempt may well be the intent.

  • Inner attempt span name (app/src/ai/agent_sdk/driver/attachments.rs): unlike get_task_secrets which sets name = "get_task_secrets", the inner attempt fn span will be named attempt, which is ambiguous in traces. Consider name = "download_attachment::attempt" to match the module::fn convention used elsewhere (e.g. git_credentials::try_refresh).

  • Formatting: the new inner-fn region won't pass cargo fmt --check as written — crates/managed_secrets/src/manager.rs has an over-length async fn inner(...) signature, and the blank line introduced before with_bounded_retry in attachments.rs carries trailing whitespace. Please run ./script/format before un-drafting.

Verdict

Found: 0 critical, 0 important, 3 suggestions

Approve with nits — clean, correct observability change following established conventions; please run ./script/format and consider the err/naming notes above.

@oz-for-oss oz-for-oss Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overview

This PR adds tracing spans around cloud-agent initialization fetches for task metadata, attachment fetch/download, handoff snapshot attachments, and managed secret retrieval. The implementation is scoped to observability: it skips arguments, marks cloud-agent spans, and adds the missing tracing dependency for warp_managed_secrets.

Concerns

  • No blocking correctness, security, or spec-alignment concerns found in the annotated diff.

Verdict

Found: 0 critical, 0 important, 0 suggestions

Approve

Comment /oz-review on this pull request to retrigger a review (up to 3 times on the same pull request).

Powered by Oz

@warp-dev-github-integration warp-dev-github-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overview

Adds #[tracing::instrument(skip_all, err, fields(tags.cloud_agent = true))] spans to the cloud-agent init fetch paths: the attachment download helpers (fetch_and_download_attachments, fetch_and_download_handoff_snapshot_attachments, download_attachment + an extracted attempt inner fn), three AIClient methods in ai.rs (get_ambient_agent_task, get_task_attachments, get_handoff_snapshot_attachments), and ManagedSecretManager::get_task_secrets (refactored to delegate to an instrumented inner async fn). Also wires tracing into the warp_managed_secrets crate.

The instrumentation follows the existing convention in this codebase (e.g. get_ai_conversation in ai.rs and the many tags.cloud_agent = true spans across app/). All instrumented functions return anyhow::Result, so the err directive is valid. The two structural refactors (extracting attempt in download_attachment and inner in get_task_secrets) preserve behavior — retry semantics and the + use<> capture bound are unchanged, and warp_managed_secrets compiles cleanly.

Concerns

None blocking.

  • Minor / optional: in download_attachment (app/src/ai/agent_sdk/driver/attachments.rs), both the outer fn and the inner attempt fn carry err. On a terminal failure the error will be recorded twice in tracing — once in the per-attempt span and once in the outer span. This is fine if per-attempt error events are intentional; otherwise consider dropping err from one of the two to avoid duplicate error events.

This is a backend-only observability change with no user-visible behavior, so no screenshots/recordings are needed.

Verdict

Found: 0 critical, 0 important, 1 suggestion

Approve

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant