Skip to content

feat(results): write normalized transcript artifacts#1521

Merged
christso merged 4 commits into
mainfrom
feat/av-wrf-transcript-artifacts
Jun 26, 2026
Merged

feat(results): write normalized transcript artifacts#1521
christso merged 4 commits into
mainfrom
feat/av-wrf-transcript-artifacts

Conversation

@christso

Copy link
Copy Markdown
Collaborator

Summary

AgentV run bundles now expose the ADR 0008 raw-plus-normalized transcript split: transcript.jsonl is the portable conversation transcript and transcript-raw.jsonl preserves native provider evidence when available. Result indexes, per-attempt manifests, projection bundles, combine/export paths, and Dashboard read models now carry explicit transcript_path, transcript_raw_path, and metrics_path fields instead of using the raw sidecar as the transcript discovery path.

Normalized transcript rows use the compact turn shape from docs/adr/0008-normalized-transcript-artifact-contract.md, including joined tool_use.result blocks for completed tool calls. Derived behavior summaries remain in metrics.json; the Dashboard transcript tab renders normalized text, tool names, inputs, statuses, durations, and results while keeping older transcript rows readable for compatibility.

Bead: av-wrf.1
Design source: docs/adr/0008-normalized-transcript-artifact-contract.md from ADR PR #1520.

Verification

  • bun run lint
  • bun run typecheck
  • bun run build
  • bun test apps/cli/test/commands/eval/artifact-writer.test.ts apps/cli/test/commands/results/export.test.ts
  • bun test packages/core/test/evaluation/orchestrator.test.ts apps/dashboard/src/components/transcript-timeline.test.tsx
  • bun test apps/cli/test/eval.integration.test.ts apps/cli/test/commands/results/serve.test.ts
  • Dashboard browser UAT on a temporary local run workspace at http://localhost:3127: verified normalized transcript text plus joined tool call input, success status, duration, and result render in the Transcript tab. Evidence screenshot was saved outside the public repo under /tmp/agentv-transcript-uat-evidence/ and is not committed.

Notes

  • Code review: skipped dedicated review because this Codex harness has no runnable Tier 1 review command, and the CE Tier 2 workflow is a procedural multi-agent prompt rather than a local command; the coordinator asked to finish the current coherent implementation PR and leave schema simplification to a follow-up worker.
  • Simplify pass: skipped per coordinator instruction not to spend additional time optimizing/splitting schema simplification concerns in this implementation pass.

Post-Deploy Monitoring & Validation

  • Watch CI for failures in artifact writer, result export, Dashboard, and result server tests.
  • Search logs or CI output for transcript.jsonl, transcript-raw.jsonl, transcript_path, transcript_raw_path, and Line 1 is not a transcript JSONL row.
  • Healthy signal: new run bundles include normalized run-N/transcript.jsonl, raw run-N/transcript-raw.jsonl, explicit path fields in index.jsonl/result.json, and Dashboard transcript views load without parse errors.
  • Failure signal: Dashboard transcript tab reports missing/dangling transcript artifacts for new runs, result exports omit transcript sidecars, or downstream tests still expect transcript_path to point at transcript-raw.jsonl.
  • Mitigation trigger: if new runs cannot be inspected in Dashboard or exported result bundles lose transcript discoverability, revert this PR or hotfix path-field propagation while preserving raw sidecar files.
  • Validation window: first CI run and the first dogfood provider run from Bead av-wrf.2; owner: AgentV maintainers/coordinator.

Compound Engineering
Codex

@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Jun 26, 2026

Copy link
Copy Markdown

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: c9837e4
Status: ✅  Deploy successful!
Preview URL: https://3bade894.agentv.pages.dev
Branch Preview URL: https://feat-av-wrf-transcript-artif.agentv.pages.dev

View logs

@christso

Copy link
Copy Markdown
Collaborator Author

Findings:

  1. P1 packages/core/src/evaluation/run-artifacts.ts:856 - metrics.json now records source_artifacts.trace_path: "trace.json", but the per-run writer only calls buildTraceEnvelopeSidecar() and never writes trace.json in run-N/. I reproduced this with a local sample run: run-1/metrics.json points at trace.json, while the directory contains grading.json, metrics.json, result.json, timing.json, transcript.jsonl, transcript-raw.jsonl, and outputs/answer.md, but no trace.json. This makes the metrics provenance path dangling. Either write the trace sidecar when advertising it, or do not emit trace_path in metrics.source_artifacts until the file exists.

  2. P2 packages/core/src/import/types.ts:489 - normalizedToolResult() drops tool_use.result whenever toolCall.output === undefined, even when toolCall.status is error, timeout, cancelled, or unknown and/or durationMs is known. The status then moves to non-contract metadata.status at lines 502-510, so consumers looking for the ADR 0008 tool_use.result.status will miss failed/cancelled tool outcomes. Create a result whenever status/duration/output is available, with output omitted when absent.

  3. P2 .agents/verification.md:130 - the agent-facing dogfood instructions still say repeat-run folders contain transcript.json and that per-run metrics.json should not be written. This PR changes the artifact contract to transcript.jsonl, transcript-raw.jsonl, and per-run metrics.json, so future agents following the always-read verification guide will validate the wrong layout. Update this guide with the ADR 0008 layout.

  4. P1 dogfood evidence gap - I do not see the required dogfood evidence in PR feat(results): write normalized transcript artifacts #1521, Bead av-wrf.2, or local/remote agentv-private evidence branches. The PR mentions a temporary screenshot under /tmp/agentv-transcript-uat-evidence/, but the clarified acceptance criteria require agent-browser visual evidence, local ~/.agentv/config.yaml project registration coverage/rationale, and live realistic eval coverage or documented infeasibility for the WTG and next-evals-oss-agentv evals. Without that evidence, this cannot be approved under the requested review standard.

Schema simplification assessment:

The raw/normalized/metrics split, transcript_path/transcript_raw_path fields, and joined tool_use.result shape are aligned with ADR 0008. I do not think the new normalized TypeScript interfaces and a trace-envelope-to-normalized converter are avoidable if AgentV is going to emit this contract from existing trace envelopes. The implementation mostly keeps optional fields optional: model is only emitted when discoverable, and raw_refs are type-only and not forced into generated rows.

The main simplification pressure is to avoid growing a second event/index layer. This PR does not add one. Dashboard does convert normalized rows into its existing timeline row shape for compatibility; that is acceptable as a transition, but the tool_use.result status bug above should be fixed so the compatibility conversion does not hide failed/cancelled tool results.

Verification run locally on PR head a5f462df in detached worktree ../transcript-review-pr1521:

  • bun install passed.
  • bun --filter @agentv/core build && bun --filter @agentv/sdk build passed.
  • bun test apps/cli/test/commands/eval/artifact-writer.test.ts apps/cli/test/commands/results/export.test.ts passed: 86 tests.
  • bun test packages/core/test/evaluation/orchestrator.test.ts apps/dashboard/src/components/transcript-timeline.test.tsx passed: 96 tests.
  • bun test apps/cli/test/eval.integration.test.ts apps/cli/test/commands/results/serve.test.ts had one transient failure in supports repeatable --test-id flags with OR matching from a missing temp diagnostics file; isolated rerun passed.
  • bun test apps/cli/test/eval.integration.test.ts -t "supports repeatable --test-id flags with OR matching" passed.
  • bun run typecheck passed.
  • bun run lint passed.
  • cd apps/dashboard && bun run build passed.
  • Generated and inspected /tmp/agentv-pr1521-sample: index.jsonl exposes transcript_path, transcript_raw_path, and metrics_path; transcript.jsonl is normalized turn JSONL; transcript-raw.jsonl contains raw/compatibility evidence. This sample also exposed the dangling metrics.source_artifacts.trace_path finding above.

CI is green on PR #1521, but review verdict is request changes because findings #1 and #4 are blocking for artifact correctness and release evidence.

@christso

Copy link
Copy Markdown
Collaborator Author

Addressed the concrete review findings in b642fe39:

  • Removed dangling metrics.json provenance by omitting source_artifacts.trace_path unless a trace sidecar path is actually supplied. Added regression coverage that run-1/trace.json is not present and metrics.json no longer advertises it.
  • Updated normalized transcript tool results so tool_use.result is emitted when status, duration, or output is available, including failed tool calls with output === undefined.
  • Updated .agents/verification.md repeat-run guidance to match ADR 0008 / this PR: transcript.jsonl, transcript-raw.jsonl, and per-run metrics.json under run-N/.

Verification run locally:

  • bun install
  • bun --filter @agentv/core build && bun --filter @agentv/sdk build
  • bun test apps/cli/test/commands/eval/artifact-writer.test.ts apps/cli/test/commands/results/export.test.ts
  • bun test apps/dashboard/src/components/transcript-timeline.test.tsx
  • bun run typecheck
  • bun run lint
  • git diff --check

Pushed to feat/av-wrf-transcript-artifacts. CI is running on the pushed commit.

@christso

Copy link
Copy Markdown
Collaborator Author

Implemented the Pi transcript join fix in 4c0cb6fc.

What I found:

  • Pi CLI raw JSONL can include separate tool_execution_start / tool_execution_end events.
  • tool_execution_end carries the result payload in result, linked back to the call by toolCallId. When timestamps are present, start/end timestamps also give duration.
  • The existing parser already reconstructed event-sourced tool calls with output, but injectEventToolCalls() treated a matching assistant tool_use as a duplicate and skipped the event-sourced call. That discarded the available result/status/timing before normalized transcript.jsonl was written.
  • This was a join gap, not a missing raw-data gap. The fix does not fabricate from metrics.json; it only joins payloads present in Pi raw stream events.

What changed:

  • Matching Pi event-sourced tool calls now enrich existing assistant tool calls by toolCallId or tool+input instead of being dropped.
  • Joined evidence preserves output, status, start/end time, and duration when available.
  • If Pi emits a tool_execution_end.result without an explicit status, the joined call is treated as successful (ok), which normalizes to tool_use.result.status: "success".
  • Added a Pi-style regression that starts with an assistant tool_use plus a separate tool_execution_end.result and asserts normalized transcript output includes joined tool_use.result with status, output, and duration.

Verification:

  • bun install passed.
  • bun --filter @agentv/core build && bun --filter @agentv/sdk build passed.
  • bun test packages/core/test/evaluation/providers/pi-cli-tool-extraction.test.ts passed: 8 tests.
  • bun test packages/core/test/evaluation/trace-envelope.test.ts passed: 11 tests.
  • bun test apps/cli/test/commands/eval/artifact-writer.test.ts passed: 56 tests.
  • bun test apps/dashboard/src/components/transcript-timeline.test.tsx passed: 4 tests.
  • bun run typecheck passed.
  • bun run lint passed.
  • git diff --check passed.

Pushed to feat/av-wrf-transcript-artifacts. No PR merge performed.

@christso

Copy link
Copy Markdown
Collaborator Author

Dogfood evidence for Bead av-wrf.2 is published here:

Verdict: FAIL for live joined-result contract evidence.

What passed:

  • bun run build passed on 4c0cb6fc.
  • Successful live Azure pi-cli fallback run exercised shell, file read, edit/write, and rerun behavior, scoring 100%.
  • Successful bundle contains transcript-raw.jsonl and normalized transcript.jsonl.
  • Normalized transcript is JSONL, has snake_case top-level keys, has 7 assistant tool_use blocks, and no transcript-line o11y blob.
  • Index/result artifacts expose transcript_path, transcript_raw_path, metrics_path, and grading/result fields; derived behavior summaries remain in metrics/result artifacts.
  • Dashboard UAT used agent-browser; screenshots are in screenshots/agent-browser/, including transcript-expanded-drawer-b642.png.

Blocking finding:

  • The current live Azure Pi run has normalized tool_use entries but no result/status sections: joined_results: 0. The raw Pi stream contains toolResult boundaries but no result payloads, so the Dashboard can render tool IDs and arguments but not result/status sections. Fixture-backed join coverage exists separately in 4c0cb6fc, but this dogfood pass does not provide live joined-result evidence.

Provider results with local OpenAI endpoint (http://127.0.0.1:10531/v1) where applicable:

  • pi-cli: command exited 0, but scored 0% and produced no substantive assistant/tool activity against the local endpoint.
  • codex-sdk: command exited 2, HTTP 401 on /v1/responses, token_expired.
  • copilot-sdk: command exited 2, HTTP 401 authentication failure against the local endpoint.

Requested external evals:

  • EntityProcess/next-evals-oss-agentv eval agentv/gpt-5.3-codex-xhigh-agents-md.eval.yaml, test agent-000-app-router-migration-simple: ran with local pi-cli, exited 0, scored 0%, no substantive assistant/tool activity.
  • WiseTechGlobal/WTG.AI.Prompts eval evals/cargowise/database/data-transformation-pr50857-e2e.eval.yaml, test pr50857-online-chunking-review: infeasible before provider execution because AgentV rejected workspace.repos[].source with workspace.repos[].source has been removed. Use workspace.repos[].repo.

Local Dashboard registration coverage:

  • Updated local ~/.agentv/config.yaml only, not committed.
  • Registered/verified agentv, next-evals-oss-agentv, wtg-ai-prompts, swe-bench, and temporary av-wrf-tool-eval so Dashboard could discover the public checkout, requested downstream eval repos, SWE-bench reference, and compact tool-capture dogfood project.

@christso

Copy link
Copy Markdown
Collaborator Author

Dogfood evidence refreshed and pushed to private branch EntityProcess/agentv-private:evidence/av-wrf-transcript-artifacts at 39dc188e9d349d4142c084e5d6661102c31f15ff.

Scope covered:

  • PR head: 4c0cb6fc6605764fe3ddae6d18edd3b03b57ad77.
  • Local ~/.agentv/config.yaml project registration: agentv, next-evals-oss-agentv, wtg-ai-prompts, swe-bench, and temporary av-wrf-tool-eval; rationale recorded in config/agentv-project-registration-summary.txt.
  • Local endpoint provider attempts on the compact repo-edit eval:
    • pi-cli: exits 0 but returns an empty assistant turn; one normalized line, zero tool calls/results.
    • codex-sdk: HTTP 401 token_expired from http://127.0.0.1:10531/v1/responses.
    • copilot-sdk: HTTP 401 authentication failure against http://127.0.0.1:10531/v1.
  • Requested eval coverage:
    • next-evals-oss-agentv eval ran with local pi-cli, scored 0%, and produced no substantive tool activity.
    • WTG.AI.Prompts eval is blocked before provider execution because the eval still uses removed workspace.repos[].source; current AgentV requires workspace.repos[].repo.
  • agent-browser screenshots are included under screenshots/agent-browser-b642/, including transcript-expanded-drawer-b642.png.

Important finding:

  • Current-head 4c0cb6fc live Pi reruns through Azure/local endpoint did not produce a fresh substantive tool transcript.
  • The earlier substantive Azure Pi run from this dogfood pass (b642fe39) produced seven normalized tool_use blocks but zero tool_use.result objects. The raw stream had tool-result boundaries but no result payload bytes, so the Dashboard renders tool IDs/arguments but no result/status sections.
  • Separately, 4c0cb6fc has fixture-backed Pi event join coverage for raw streams that do contain tool_execution_end.result payloads.

Verdict: evidence is complete, but this is not a full live pass for joined tool results. It validates artifact emission, dashboard rendering of the normalized transcript surface, local provider blockers, and the raw-stream limitation found in live Pi evidence.

@christso christso left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings:

  1. P1 apps/cli/src/commands/results/manifest.ts:227 - loadManifestResults() still hydrates traces by reading resolveTranscriptPath(record), and resolveTranscriptPath() now prioritizes record.transcript_path. After this PR, that path points at the new normalized { v, agent, type, content } rows, but hydration still calls traceFromTranscriptJsonLines() for the legacy agentv.transcript.v1 row shape. The exception is swallowed and the loader falls back to answer.md, dropping user turns and tool calls for any newly written run that is later consumed through results export, projection bundle generation, results report, results shared, retry-errors, or inspect helpers. I reproduced this at PR head with writeArtifactsFromResults() followed by loadManifestResults(): the written transcript.jsonl contained a user row and a joined tool_use.result, while the loaded trace became only [{ role: "assistant", content: "done" }] and trace.toolCalls was {}. Please either teach manifest hydration to parse the normalized transcript contract, or deliberately hydrate replay traces from transcript_raw_path/legacy rows while keeping Dashboard on normalized transcript_path.

Schema simplification assessment:

The raw/normalized/metrics split, explicit transcript_path/transcript_raw_path/metrics_path fields, and joined tool_use.result blocks match ADR 0008. I do not see a justified pre-merge simplification that removes behavior: the normalized TypeScript row/block types and trace-envelope converter are doing real boundary work, and optional model/raw_refs are not forced into emitted rows. The one simplification I would avoid is adding another event/index schema to fix the finding above; the smallest fix is a normalized-transcript-to-Trace reader or a clear fallback to the raw/legacy transcript for replay-only consumers.

Dogfood and residual risk:

I inspected the public dogfood comment and private evidence branch EntityProcess/agentv-private:evidence/av-wrf-transcript-artifacts at 39dc188e9d349d4142c084e5d6661102c31f15ff. Evidence is complete for artifact emission, Dashboard rendering of normalized rows, path fields, metrics separation, and provider blocker documentation. It is still not a full live pass for joined Pi tool results: current-head Pi attempts were empty, and the earlier substantive b642 Azure Pi run had seven tool_use blocks but zero tool_use.result objects because the raw stream lacked result payload bytes.

Verification:

  • GitHub CI for 4c0cb6fc6605764fe3ddae6d18edd3b03b57ad77 is green: Build, Typecheck, Lint, Test, Check Links, Validate Marketplace, Validate Evals, and Cloudflare Pages all succeeded.
  • Local: bun install passed.
  • Local: bun run build passed.
  • Local focused tests passed after build: bun test packages/core/test/evaluation/providers/pi-cli-tool-extraction.test.ts apps/dashboard/src/components/transcript-timeline.test.tsx apps/cli/test/commands/results/shared.test.ts apps/cli/test/commands/eval/artifact-writer.test.ts apps/cli/test/commands/results/export.test.ts (104 pass).

Verdict: request changes. The artifact writer, Dashboard parser, metrics separation, and Pi fixture join look good, but the manifest rehydration regression breaks normalized transcript replay/export consumers before merge.

Note: GitHub would not allow this account to submit a formal REQUEST_CHANGES review on its own PR, so this is posted as a COMMENT review with the same blocking verdict.

@christso

Copy link
Copy Markdown
Collaborator Author

Fixed the manifest hydration compatibility issue in c9837e4.

Summary:

  • loadManifestResults() no longer parses transcript_path, transcript_raw_path, or legacy artifact_pointers.transcript with the legacy TranscriptJsonLine parser.
  • trace hydration now uses explicit trace_path when it points to a valid AgentV trace envelope; otherwise it intentionally falls back to the existing minimal trace from input/output/summary.
  • Legacy raw transcript hydration is intentionally hard-deprecated per product direction. transcript_raw_path remains raw evidence only and is not a replay source.

Regression coverage:

  • normalized transcript_path + trace_path hydrates tool evidence from trace.json.
  • normalized-only transcript_path does not crash and falls back to minimal trace.
  • transcript_raw_path and legacy transcript artifact pointers do not hydrate replay traces.

Verification:

  • bun install
  • bun --filter @agentv/core build && bun --filter @agentv/sdk build
  • bun test apps/cli/test/commands/results/shared.test.ts apps/cli/test/commands/eval/artifact-writer.test.ts apps/cli/test/commands/results/export.test.ts
  • bun --filter agentv typecheck
  • bunx biome check apps/cli/src/commands/results/manifest.ts apps/cli/test/commands/results/shared.test.ts

@christso christso left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings: none blocking.

Post-fix review notes:

  • Confirmed PR head is c9837e4356f6f2f294e00904af8fa40be3f70fa9 and the PR is open/mergeable.
  • CI is green at that head: Build, Typecheck, Lint, Test, Check Links, Validate Evals, Validate Marketplace, and Cloudflare Pages all passed.
  • Inspected c9837e43, especially apps/cli/src/commands/results/manifest.ts and apps/cli/test/commands/results/shared.test.ts. loadManifestResults() no longer parses transcript_path, artifact_pointers.transcript, or transcript_raw_path through legacy traceFromTranscriptJsonLines; it now hydrates trace evidence only from explicit trace_path trace envelopes when available, otherwise falls back to minimal input/output/summary trace projection.
  • Focused local verification passed: bun install, bun --filter @agentv/core build && bun --filter @agentv/sdk build, and bun test apps/cli/test/commands/results/shared.test.ts apps/cli/test/commands/eval/artifact-writer.test.ts apps/cli/test/commands/results/export.test.ts (95 pass, 0 fail).

Schema simplification: the hard deprecation is sufficient for this PR. transcript.jsonl remains a normalized display/query transcript, transcript-raw.jsonl remains raw evidence, and manifest trace hydration is now tied to the explicit trace_path envelope. I would not require further pre-merge simplification beyond possible naming cleanup later, such as renaming the now-stale hydrateTranscriptTrace option/comment, because that is internal wording and does not preserve the legacy replay behavior.

Residual risk: dogfood evidence remains accepted as previously posted, but the joined Pi tool-result path still lacks a full live current-head pass because live current-head Pi attempts were empty and the earlier b642 raw stream lacked result payloads.

@christso christso merged commit 66c6024 into main Jun 26, 2026
8 checks passed
@christso christso deleted the feat/av-wrf-transcript-artifacts branch June 26, 2026 04:55

@christso christso left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings: none blocking.

Final review against the hard-deprecation contract:

  • loadManifestResults() no longer parses transcript_path, transcript_raw_path, or artifact_pointers.transcript with the legacy TranscriptJsonLine replay reader.
  • Explicit trace_path is now the only rich replay source in manifest loading, and it is read as an AgentV trace envelope via fromTraceEnvelopeWire().
  • Without a valid trace_path, manifest loading intentionally falls back to the minimal input/output/summary trace.
  • transcript.jsonl remains the normalized display/query artifact, transcript-raw.jsonl remains raw provider/harness evidence only, and metrics.json remains the behavior/o11y summary artifact.

Schema simplification assessment:

The c9837e4 fix is the right simplification for the updated product direction. It removes the compatibility path instead of adding a normalized-transcript shim or raw-transcript replay fallback. I do not see a further pre-merge simplification that preserves the requested behavior; the remaining normalized row/block types and converter are doing the artifact-boundary work from ADR 0008.

Verification:

  • PR head reviewed: c9837e4356f6f2f294e00904af8fa40be3f70fa9.
  • GitHub CI is green at this head: Build, Typecheck, Lint, Test, Check Links, Validate Marketplace, Validate Evals, and Cloudflare Pages succeeded.
  • Local focused tests passed: bun test apps/cli/test/commands/results/shared.test.ts apps/cli/test/commands/eval/artifact-writer.test.ts apps/cli/test/commands/results/export.test.ts packages/core/test/evaluation/providers/pi-cli-tool-extraction.test.ts apps/dashboard/src/components/transcript-timeline.test.tsx (107 pass).
  • Local direct round-trip check: normalized-only transcript_path falls back to minimal answer trace; adding explicit trace_path preserves tool-call evidence from the trace envelope.
  • Local bun --filter agentv typecheck passed.
  • Local bunx biome check apps/cli/src/commands/results/manifest.ts apps/cli/test/commands/results/shared.test.ts passed.

Dogfood status:

I carried forward the prior evidence review for EntityProcess/agentv-private:evidence/av-wrf-transcript-artifacts at 39dc188e9d349d4142c084e5d6661102c31f15ff. Evidence is complete for artifact emission, Dashboard rendering of normalized rows, explicit path fields, metrics separation, and provider blocker documentation. Residual risk remains: it is not a full live pass for joined Pi tool results because current-head Pi attempts were empty and the substantive b642 stream lacked result payload bytes. Fixture-backed Pi coverage for raw streams containing tool_execution_end.result payloads is present.

Verdict: ready from final review. GitHub does not allow this account to submit a formal approval on its own PR, so this is a COMMENT review with the final ready verdict.

@christso

Copy link
Copy Markdown
Collaborator Author

Follow-up for Bead av-wrf.1 review fixes is in #1523 because this PR was already merged before the final metrics fix could become part of its recorded head.

Completed:

  • Pushed feat/av-wrf-transcript-artifacts with 43035379, but GitHub kept this merged PR's head at c9837e43.
  • Opened fix(results): omit missing trace sidecar from metrics #1523 from a clean origin/main branch with the one-line fix: metrics.json no longer advertises source_artifacts.trace_path unless a trace sidecar path is explicitly provided.
  • Verified the other requested fixes are already present from the merged implementation: normalizedToolResult emits tool_use.result when status/duration/output exists, and .agents/verification.md documents transcript.jsonl, transcript-raw.jsonl, and per-run metrics.json.

Verification on #1523 branch:

  • bun --filter @agentv/core build
  • bun test apps/cli/test/commands/eval/artifact-writer.test.ts packages/core/test/evaluation/providers/pi-cli-tool-extraction.test.ts apps/cli/test/commands/results/shared.test.ts (73 pass)
  • bun run lint
  • bun run typecheck

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant