feat(results): write normalized transcript artifacts#1521
Conversation
Deploying agentv with
|
| Latest commit: |
c9837e4
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://3bade894.agentv.pages.dev |
| Branch Preview URL: | https://feat-av-wrf-transcript-artif.agentv.pages.dev |
|
Findings:
Schema simplification assessment: The raw/normalized/metrics split, The main simplification pressure is to avoid growing a second event/index layer. This PR does not add one. Dashboard does convert normalized rows into its existing timeline row shape for compatibility; that is acceptable as a transition, but the Verification run locally on PR head
CI is green on PR #1521, but review verdict is request changes because findings #1 and #4 are blocking for artifact correctness and release evidence. |
|
Addressed the concrete review findings in
Verification run locally:
Pushed to |
|
Implemented the Pi transcript join fix in What I found:
What changed:
Verification:
Pushed to |
|
Dogfood evidence for Bead
Verdict: FAIL for live joined-result contract evidence. What passed:
Blocking finding:
Provider results with local OpenAI endpoint (
Requested external evals:
Local Dashboard registration coverage:
|
|
Dogfood evidence refreshed and pushed to private branch Scope covered:
Important finding:
Verdict: evidence is complete, but this is not a full live pass for joined tool results. It validates artifact emission, dashboard rendering of the normalized transcript surface, local provider blockers, and the raw-stream limitation found in live Pi evidence. |
christso
left a comment
There was a problem hiding this comment.
Findings:
- P1
apps/cli/src/commands/results/manifest.ts:227-loadManifestResults()still hydrates traces by readingresolveTranscriptPath(record), andresolveTranscriptPath()now prioritizesrecord.transcript_path. After this PR, that path points at the new normalized{ v, agent, type, content }rows, but hydration still callstraceFromTranscriptJsonLines()for the legacyagentv.transcript.v1row shape. The exception is swallowed and the loader falls back toanswer.md, dropping user turns and tool calls for any newly written run that is later consumed throughresults export, projection bundle generation,results report,results shared, retry-errors, or inspect helpers. I reproduced this at PR head withwriteArtifactsFromResults()followed byloadManifestResults(): the writtentranscript.jsonlcontained a user row and a joinedtool_use.result, while the loaded trace became only[{ role: "assistant", content: "done" }]andtrace.toolCallswas{}. Please either teach manifest hydration to parse the normalized transcript contract, or deliberately hydrate replay traces fromtranscript_raw_path/legacy rows while keeping Dashboard on normalizedtranscript_path.
Schema simplification assessment:
The raw/normalized/metrics split, explicit transcript_path/transcript_raw_path/metrics_path fields, and joined tool_use.result blocks match ADR 0008. I do not see a justified pre-merge simplification that removes behavior: the normalized TypeScript row/block types and trace-envelope converter are doing real boundary work, and optional model/raw_refs are not forced into emitted rows. The one simplification I would avoid is adding another event/index schema to fix the finding above; the smallest fix is a normalized-transcript-to-Trace reader or a clear fallback to the raw/legacy transcript for replay-only consumers.
Dogfood and residual risk:
I inspected the public dogfood comment and private evidence branch EntityProcess/agentv-private:evidence/av-wrf-transcript-artifacts at 39dc188e9d349d4142c084e5d6661102c31f15ff. Evidence is complete for artifact emission, Dashboard rendering of normalized rows, path fields, metrics separation, and provider blocker documentation. It is still not a full live pass for joined Pi tool results: current-head Pi attempts were empty, and the earlier substantive b642 Azure Pi run had seven tool_use blocks but zero tool_use.result objects because the raw stream lacked result payload bytes.
Verification:
- GitHub CI for
4c0cb6fc6605764fe3ddae6d18edd3b03b57ad77is green: Build, Typecheck, Lint, Test, Check Links, Validate Marketplace, Validate Evals, and Cloudflare Pages all succeeded. - Local:
bun installpassed. - Local:
bun run buildpassed. - Local focused tests passed after build:
bun test packages/core/test/evaluation/providers/pi-cli-tool-extraction.test.ts apps/dashboard/src/components/transcript-timeline.test.tsx apps/cli/test/commands/results/shared.test.ts apps/cli/test/commands/eval/artifact-writer.test.ts apps/cli/test/commands/results/export.test.ts(104 pass).
Verdict: request changes. The artifact writer, Dashboard parser, metrics separation, and Pi fixture join look good, but the manifest rehydration regression breaks normalized transcript replay/export consumers before merge.
Note: GitHub would not allow this account to submit a formal REQUEST_CHANGES review on its own PR, so this is posted as a COMMENT review with the same blocking verdict.
|
Fixed the manifest hydration compatibility issue in c9837e4. Summary:
Regression coverage:
Verification:
|
christso
left a comment
There was a problem hiding this comment.
Findings: none blocking.
Post-fix review notes:
- Confirmed PR head is
c9837e4356f6f2f294e00904af8fa40be3f70fa9and the PR is open/mergeable. - CI is green at that head: Build, Typecheck, Lint, Test, Check Links, Validate Evals, Validate Marketplace, and Cloudflare Pages all passed.
- Inspected
c9837e43, especiallyapps/cli/src/commands/results/manifest.tsandapps/cli/test/commands/results/shared.test.ts.loadManifestResults()no longer parsestranscript_path,artifact_pointers.transcript, ortranscript_raw_paththrough legacytraceFromTranscriptJsonLines; it now hydrates trace evidence only from explicittrace_pathtrace envelopes when available, otherwise falls back to minimal input/output/summary trace projection. - Focused local verification passed:
bun install,bun --filter @agentv/core build && bun --filter @agentv/sdk build, andbun test apps/cli/test/commands/results/shared.test.ts apps/cli/test/commands/eval/artifact-writer.test.ts apps/cli/test/commands/results/export.test.ts(95 pass, 0 fail).
Schema simplification: the hard deprecation is sufficient for this PR. transcript.jsonl remains a normalized display/query transcript, transcript-raw.jsonl remains raw evidence, and manifest trace hydration is now tied to the explicit trace_path envelope. I would not require further pre-merge simplification beyond possible naming cleanup later, such as renaming the now-stale hydrateTranscriptTrace option/comment, because that is internal wording and does not preserve the legacy replay behavior.
Residual risk: dogfood evidence remains accepted as previously posted, but the joined Pi tool-result path still lacks a full live current-head pass because live current-head Pi attempts were empty and the earlier b642 raw stream lacked result payloads.
christso
left a comment
There was a problem hiding this comment.
Findings: none blocking.
Final review against the hard-deprecation contract:
loadManifestResults()no longer parsestranscript_path,transcript_raw_path, orartifact_pointers.transcriptwith the legacyTranscriptJsonLinereplay reader.- Explicit
trace_pathis now the only rich replay source in manifest loading, and it is read as an AgentV trace envelope viafromTraceEnvelopeWire(). - Without a valid
trace_path, manifest loading intentionally falls back to the minimal input/output/summary trace. transcript.jsonlremains the normalized display/query artifact,transcript-raw.jsonlremains raw provider/harness evidence only, andmetrics.jsonremains the behavior/o11y summary artifact.
Schema simplification assessment:
The c9837e4 fix is the right simplification for the updated product direction. It removes the compatibility path instead of adding a normalized-transcript shim or raw-transcript replay fallback. I do not see a further pre-merge simplification that preserves the requested behavior; the remaining normalized row/block types and converter are doing the artifact-boundary work from ADR 0008.
Verification:
- PR head reviewed:
c9837e4356f6f2f294e00904af8fa40be3f70fa9. - GitHub CI is green at this head: Build, Typecheck, Lint, Test, Check Links, Validate Marketplace, Validate Evals, and Cloudflare Pages succeeded.
- Local focused tests passed:
bun test apps/cli/test/commands/results/shared.test.ts apps/cli/test/commands/eval/artifact-writer.test.ts apps/cli/test/commands/results/export.test.ts packages/core/test/evaluation/providers/pi-cli-tool-extraction.test.ts apps/dashboard/src/components/transcript-timeline.test.tsx(107 pass). - Local direct round-trip check: normalized-only
transcript_pathfalls back to minimal answer trace; adding explicittrace_pathpreserves tool-call evidence from the trace envelope. - Local
bun --filter agentv typecheckpassed. - Local
bunx biome check apps/cli/src/commands/results/manifest.ts apps/cli/test/commands/results/shared.test.tspassed.
Dogfood status:
I carried forward the prior evidence review for EntityProcess/agentv-private:evidence/av-wrf-transcript-artifacts at 39dc188e9d349d4142c084e5d6661102c31f15ff. Evidence is complete for artifact emission, Dashboard rendering of normalized rows, explicit path fields, metrics separation, and provider blocker documentation. Residual risk remains: it is not a full live pass for joined Pi tool results because current-head Pi attempts were empty and the substantive b642 stream lacked result payload bytes. Fixture-backed Pi coverage for raw streams containing tool_execution_end.result payloads is present.
Verdict: ready from final review. GitHub does not allow this account to submit a formal approval on its own PR, so this is a COMMENT review with the final ready verdict.
|
Follow-up for Bead av-wrf.1 review fixes is in #1523 because this PR was already merged before the final metrics fix could become part of its recorded head. Completed:
Verification on #1523 branch:
|
Summary
AgentV run bundles now expose the ADR 0008 raw-plus-normalized transcript split:
transcript.jsonlis the portable conversation transcript andtranscript-raw.jsonlpreserves native provider evidence when available. Result indexes, per-attempt manifests, projection bundles, combine/export paths, and Dashboard read models now carry explicittranscript_path,transcript_raw_path, andmetrics_pathfields instead of using the raw sidecar as the transcript discovery path.Normalized transcript rows use the compact turn shape from
docs/adr/0008-normalized-transcript-artifact-contract.md, including joinedtool_use.resultblocks for completed tool calls. Derived behavior summaries remain inmetrics.json; the Dashboard transcript tab renders normalized text, tool names, inputs, statuses, durations, and results while keeping older transcript rows readable for compatibility.Bead:
av-wrf.1Design source:
docs/adr/0008-normalized-transcript-artifact-contract.mdfrom ADR PR #1520.Verification
bun run lintbun run typecheckbun run buildbun test apps/cli/test/commands/eval/artifact-writer.test.ts apps/cli/test/commands/results/export.test.tsbun test packages/core/test/evaluation/orchestrator.test.ts apps/dashboard/src/components/transcript-timeline.test.tsxbun test apps/cli/test/eval.integration.test.ts apps/cli/test/commands/results/serve.test.tshttp://localhost:3127: verified normalized transcript text plus joined tool call input, success status, duration, and result render in the Transcript tab. Evidence screenshot was saved outside the public repo under/tmp/agentv-transcript-uat-evidence/and is not committed.Notes
Post-Deploy Monitoring & Validation
transcript.jsonl,transcript-raw.jsonl,transcript_path,transcript_raw_path, andLine 1 is not a transcript JSONL row.run-N/transcript.jsonl, rawrun-N/transcript-raw.jsonl, explicit path fields inindex.jsonl/result.json, and Dashboard transcript views load without parse errors.transcript_pathto point attranscript-raw.jsonl.av-wrf.2; owner: AgentV maintainers/coordinator.