Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 0 additions & 3 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -143,8 +143,5 @@ jobs:
- name: Check evals directories have eval files
run: bun scripts/validate-eval-dirs.ts

- name: Run Phoenix adapter dry-run smoke
run: bun run phoenix:assert-smoke

- name: Validate eval schemas
run: bun apps/cli/dist/cli.js validate 'examples/features/**/evals/**/*.eval.yaml' 'examples/features/**/*.EVAL.yaml'
137 changes: 0 additions & 137 deletions bun.lock

Large diffs are not rendered by default.

13 changes: 9 additions & 4 deletions docs/adr/2026-06-11-phoenix-observability-adapter.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ Relevant existing seams already point in this direction:

- Provider and grader registries support narrow registration points.
- `.agentv/providers/`, `.agentv/assertions/`, and `.agentv/graders/` use convention-based local discovery instead of a broad plugin host.
- Earlier `packages/phoenix-adapter/` experiments kept Phoenix-specific behavior outside core and reported unsupported mappings explicitly. That experiment is not the supported product path for AgentV completed runs or transcripts.
- Earlier Phoenix adapter experiments kept Phoenix-specific behavior outside core and reported unsupported mappings explicitly. Those experiments are not the supported product path for AgentV completed runs or transcripts.
- The trace evaluation plan requires generic OTLP/OpenInference mapping without Phoenix-specific assumptions in core.

## Decision
Expand All @@ -35,7 +35,10 @@ AgentV core should own:
- generic OTLP/OpenInference import/export mapping where it is backend-neutral;
- small registry/discovery primitives for extension points.

Phoenix integration should live outside core behind an adapter boundary, currently `packages/phoenix-adapter/`. The first implementation does not need package loading or package naming; a local resolver module is enough. The adapter boundary may expose:
Phoenix integration should live outside core behind a narrow local adapter or
resolver boundary when needed. No maintained workspace package currently owns
that boundary. The first implementation does not need package loading or package
naming; a local resolver module is enough. Such a custom boundary may expose:

- a Phoenix OTel backend resolver;
- Phoenix/OpenInference span-kind mapping;
Expand Down Expand Up @@ -72,7 +75,9 @@ Registration/discovery should remain boring and local-first. In this ADR, "plugi
- keep `execution.otel_backend: <name>` and `--otel-backend <name>` as the user-facing selectors;
- do not add package names, package auto-installation, a remote marketplace, trust prompts, or a general-purpose plugin host for this need.

The earlier prototype exposed a resolver, for example `phoenixOtelBackend`, so users could opt in from project config or a local `.agentv/otel-backends/phoenix.mjs` file. Treat that as a custom/legacy path, not as the supported AgentV-to-Phoenix product boundary.
The earlier prototype exposed a resolver so users could opt in from project config
or a local `.agentv/otel-backends/phoenix.mjs` file. Treat that as a
custom/legacy path, not as the supported AgentV-to-Phoenix product boundary.

## Migration path for Phoenix

Expand All @@ -81,7 +86,7 @@ The earlier prototype exposed a resolver, for example `phoenixOtelBackend`, so u
- `OTEL_EXPORTER_OTLP_HEADERS`
- `--otel-file` for offline OTLP JSON export
2. Add a tiny backend resolver seam only if ergonomic backend names are needed.
3. Implement Phoenix endpoint/header/project routing in the Phoenix adapter boundary, not in core.
3. Keep any custom Phoenix endpoint/header/project routing outside core and outside the supported AgentV artifact path.
4. Keep Phoenix out of Dashboard runtime fetch paths; use safe external links instead.
5. Consider moving existing vendor-specific core presets to the same resolver model later, but do not couple that cleanup to the Phoenix decision unless the implementation already touches the preset registry.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ Use these as design references, not as feature mandates:
- Margin and Terminal-Bench: filesystem-native benchmark packaging, conventional task files, setup scripts, scoring scripts, and immutable artifacts. AgentV should document and template this shape instead of adding `workspace`, `oracle`, `variants`, or `expected_artifacts` as broad core fields.
- Pi coding agent: skills and extensions separate agent-facing procedural guidance from runtime code. Its docs show skills as portable `SKILL.md` directories with scripts/assets, and extensions as typed runtime hooks. AgentV should copy the progressive-disclosure authoring pattern for eval builders.
- Composio Agent Orchestrator: swappable TypeScript plugin interfaces for narrow responsibilities. Its plugin-slot model is useful as a boundary pattern, but AgentV should avoid a general orchestrator plugin host until concrete runtime extension gaps appear.
- Phoenix: official TypeScript packages (`@arizeai/phoenix-client`, `@arizeai/phoenix-evals`, `@arizeai/phoenix-otel`) make it a good private export/conversion target for result and trace integration.
- Phoenix: official TypeScript packages (`@arizeai/phoenix-client`, `@arizeai/phoenix-evals`, `@arizeai/phoenix-otel`) remain useful peer-framework research material, but the 2026-06-20 product boundary supersedes AgentV-to-Phoenix result or trace export.
- promptfoo: Node package and JavaScript assertion/provider hooks make it a good private conversion target, especially for YAML matrix configs and JS assertion migration.
- Braintrust: TypeScript SDK and `Eval(data, task, scores)` model make it a good private conversion target for dataset/task/score loops, experiment metadata, trial counts, and hosted result upload.

Expand Down Expand Up @@ -113,7 +113,7 @@ Source-backed findings from the initial code analysis:
- promptfoo can mirror simple AgentV rubric examples with `llm-rubric` and script assertions. AgentV `tool-trajectory` is the largest parity gap because promptfoo trace/trajectory assertions depend on promptfoo trace conventions rather than AgentV `Message[].toolCalls`; a custom provider/metadata adapter is required.
- Braintrust TypeScript `Eval(name, { data, task, scores })` maps cleanly to AgentV's case/task/score model. The lossy point is that AgentV rich assertion arrays with evidence/verdict/type become Braintrust score metadata unless a deeper adapter is built.
- Phoenix TypeScript is split across dataset creation, experiment running, evaluators, and OTel. It is strong for persisted datasets/experiments and traces, but less direct for local YAML wrapping because normal `runExperiment` flow expects a Phoenix dataset/server round trip.
- AgentV already has a Phoenix adapter package, but its support matrix is intentionally narrow and deterministic. Private experiments should use that as evidence, not widen public scope prematurely.
- AgentV previously carried a private Phoenix adapter experiment with an intentionally narrow and deterministic support matrix. Treat that as historical evidence, not a reason to widen public scope.

Workspace/container findings from Terminal-Bench, Harbor, and Margin:

Expand All @@ -139,7 +139,10 @@ Extend the existing `agentv create` scaffolding into reusable templates:
- `agentv create eval --template terminal-task`
- `agentv create eval --template promptfoo-adapter`
- `agentv create eval --template braintrust-export`
- `agentv create eval --template phoenix-export`

Do not add a Phoenix export template. The later Phoenix read-only correlation
boundary supersedes AgentV-to-Phoenix dataset, experiment, result, or trace
export templates.

The first implementation can stay static and local, similar to the current `EVAL_TEMPLATES` object in `apps/cli/src/commands/create/commands.ts`. Do not introduce remote template registries, package installation, trust prompts, or plugin loading yet.

Expand Down Expand Up @@ -185,7 +188,9 @@ Likely docs locations:

Add private examples, not core adapters, for:

- Phoenix: export AgentV results/traces into Phoenix using the TS packages.
- Phoenix: compare peer-framework DX around independently emitted traces and
safe `external_trace` link-out metadata. Do not export AgentV-owned results,
traces, transcripts, datasets, experiments, or indexes into Phoenix.
- promptfoo: convert promptfoo-style YAML or JS assertions into ordinary AgentV evals/assertions where feasible.
- Braintrust: export AgentV cases/results into Braintrust's TypeScript `Eval(data, task, scores)` shape.

Expand Down Expand Up @@ -257,7 +262,7 @@ framework-parity/
run-phoenix.ts
```

This subtree should be clearly marked private/internal and should not be mirrored into public AgentV examples until findings are scrubbed.
This subtree should be clearly marked private/internal and should not be mirrored into public AgentV examples until findings are scrubbed. Any Phoenix files in this historical peer-framework research tree must stay outside the supported AgentV product path and must not become AgentV-to-Phoenix artifact export guidance.

Initial reference evals to consider:

Expand Down Expand Up @@ -390,7 +395,7 @@ For private conversion work:

- Which AgentV evals should be mirrored first: one simple text/rubric eval plus one workspace/tool-trajectory eval, or only WTG-relevant prompt evals?
- Should promptfoo import/export be a CLI command later, or stay as documented conversion scripts until demand is proven?
- Should Phoenix/Braintrust integrations be examples only, or wrappers that consume AgentV JSONL output?
- Should Braintrust integrations be examples only, or wrappers that consume AgentV JSONL output? Phoenix work is superseded by the read-only external-trace correlation boundary.

## Decision

Expand All @@ -404,7 +409,7 @@ Proceed as a plan, not a brainstorm, because the product question is now concret
- `av-r0s.5.6` - analysis(private): compare peer native ports against AgentV
- `av-r0s.5.8` - design(private): minimal AgentV workspace/container primitive
- `av-r0s.5.1` - tooling(private): extract promptfoo exporter requirements after hand ports
- `av-r0s.5.2` - tooling(private): prototype Braintrust and Phoenix replay adapters
- `av-r0s.5.2` - tooling(private): prototype Braintrust replay adapters and historical Phoenix peer-framework research only
- `av-r0s.5.3` - docs(agentv): decide sanitized promotion path from private parity experiments
- `av-r0s.5.4` - closed as superseded by source-specific hand-port beads
- `av-w9p` - closed as superseded by `av-r0s.1`
15 changes: 9 additions & 6 deletions docs/plans/2026-06-21-001-feat-av-quf-results-storage-plan.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,11 @@ publication export, an append-only mutable-operation log, and an S3-compatible
object-storage tier.

The canonical AgentV run artifacts stay `benchmark.json`, `index.jsonl`, per-test
grading/timing files, `outputs/trace.json`, and derived transcript artifacts. GitHub,
Backblaze B2, Phoenix, Hugging Face, and Dashboard are projections, viewers, or storage
backends over those artifacts.
grading/timing files, `outputs/trace.json`, and derived transcript artifacts.
GitHub and Backblaze B2 are storage/publication targets over those artifacts.
Dashboard and Hugging Face are viewers or publication surfaces. Phoenix is only
a link-out viewer when safe `external_trace` metadata points at independently
emitted spans; it is not an AgentV artifact projection or storage backend.

---

Expand Down Expand Up @@ -58,7 +60,7 @@ without creating another hosted results platform inside AgentV.
- Implementing storage backends, S3, oplog, retention, or export code in this bead.
- Adding GitHub issues or tracker runtime state.
- Creating windowed branches, per-run branches, or a hosted Dashboard replacement.
- Making Phoenix, Hugging Face, B2, or GitHub the canonical results model.
- Making Phoenix canonical, making Phoenix an AgentV artifact projection target, or making Hugging Face, B2, or GitHub the canonical results model.

### Deferred to Follow-Up Work

Expand Down Expand Up @@ -852,8 +854,9 @@ results:
- [ ] The artifact sidecar is called `artifacts`, not `artifact-blobs` or `blob`.
- [ ] The plan has no windowed or per-run branches.
- [ ] Path sharding is deferred until realistic measurement proves need.
- [ ] AgentV artifacts remain canonical; Dashboard, Hugging Face, Phoenix, B2, and
GitHub are projections/viewers/storage backends.
- [ ] AgentV artifacts remain canonical; Dashboard and Hugging Face are viewers
or publication surfaces, B2 and GitHub are storage/publication targets, and
Phoenix is link-out correlation only when safe external trace metadata exists.
- [ ] File/function-level implementation guidance names current result repo, remote,
serve, export, artifact-writer, and Dashboard surfaces.
- [ ] Test plan covers core, CLI, Dashboard, and docs-facing behavior.
Expand Down
Loading
Loading