EntityProcess · christso · Jun 22, 2026 · Jun 22, 2026
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -143,8 +143,5 @@ jobs:
       - name: Check evals directories have eval files
         run: bun scripts/validate-eval-dirs.ts
 
-      - name: Run Phoenix adapter dry-run smoke
-        run: bun run phoenix:assert-smoke
-
       - name: Validate eval schemas
         run: bun apps/cli/dist/cli.js validate 'examples/features/**/evals/**/*.eval.yaml' 'examples/features/**/*.EVAL.yaml'
diff --git a/bun.lock b/bun.lock
diff --git a/docs/adr/2026-06-11-phoenix-observability-adapter.md b/docs/adr/2026-06-11-phoenix-observability-adapter.md
@@ -20,7 +20,7 @@ Relevant existing seams already point in this direction:
 
 - Provider and grader registries support narrow registration points.
 - `.agentv/providers/`, `.agentv/assertions/`, and `.agentv/graders/` use convention-based local discovery instead of a broad plugin host.
-- Earlier `packages/phoenix-adapter/` experiments kept Phoenix-specific behavior outside core and reported unsupported mappings explicitly. That experiment is not the supported product path for AgentV completed runs or transcripts.
+- Earlier Phoenix adapter experiments kept Phoenix-specific behavior outside core and reported unsupported mappings explicitly. Those experiments are not the supported product path for AgentV completed runs or transcripts.
 - The trace evaluation plan requires generic OTLP/OpenInference mapping without Phoenix-specific assumptions in core.
 
 ## Decision
@@ -35,7 +35,10 @@ AgentV core should own:
 - generic OTLP/OpenInference import/export mapping where it is backend-neutral;
 - small registry/discovery primitives for extension points.
 
-Phoenix integration should live outside core behind an adapter boundary, currently `packages/phoenix-adapter/`. The first implementation does not need package loading or package naming; a local resolver module is enough. The adapter boundary may expose:
+Phoenix integration should live outside core behind a narrow local adapter or
+resolver boundary when needed. No maintained workspace package currently owns
+that boundary. The first implementation does not need package loading or package
+naming; a local resolver module is enough. Such a custom boundary may expose:
 
 - a Phoenix OTel backend resolver;
 - Phoenix/OpenInference span-kind mapping;
@@ -72,7 +75,9 @@ Registration/discovery should remain boring and local-first. In this ADR, "plugi
 - keep `execution.otel_backend: <name>` and `--otel-backend <name>` as the user-facing selectors;
 - do not add package names, package auto-installation, a remote marketplace, trust prompts, or a general-purpose plugin host for this need.
 
-The earlier prototype exposed a resolver, for example `phoenixOtelBackend`, so users could opt in from project config or a local `.agentv/otel-backends/phoenix.mjs` file. Treat that as a custom/legacy path, not as the supported AgentV-to-Phoenix product boundary.
+The earlier prototype exposed a resolver so users could opt in from project config
+or a local `.agentv/otel-backends/phoenix.mjs` file. Treat that as a
+custom/legacy path, not as the supported AgentV-to-Phoenix product boundary.
 
 ## Migration path for Phoenix
 
@@ -81,7 +86,7 @@ The earlier prototype exposed a resolver, for example `phoenixOtelBackend`, so u
    - `OTEL_EXPORTER_OTLP_HEADERS`
    - `--otel-file` for offline OTLP JSON export
 2. Add a tiny backend resolver seam only if ergonomic backend names are needed.
-3. Implement Phoenix endpoint/header/project routing in the Phoenix adapter boundary, not in core.
+3. Keep any custom Phoenix endpoint/header/project routing outside core and outside the supported AgentV artifact path.
 4. Keep Phoenix out of Dashboard runtime fetch paths; use safe external links instead.
 5. Consider moving existing vendor-specific core presets to the same resolver model later, but do not couple that cleanup to the Phoenix decision unless the implementation already touches the preset registry.
 

diff --git a/docs/plans/2026-06-06-001-agentv-eval-authoring-extensibility-plan.md b/docs/plans/2026-06-06-001-agentv-eval-authoring-extensibility-plan.md
@@ -43,7 +43,7 @@ Use these as design references, not as feature mandates:
 - Margin and Terminal-Bench: filesystem-native benchmark packaging, conventional task files, setup scripts, scoring scripts, and immutable artifacts. AgentV should document and template this shape instead of adding `workspace`, `oracle`, `variants`, or `expected_artifacts` as broad core fields.
 - Pi coding agent: skills and extensions separate agent-facing procedural guidance from runtime code. Its docs show skills as portable `SKILL.md` directories with scripts/assets, and extensions as typed runtime hooks. AgentV should copy the progressive-disclosure authoring pattern for eval builders.
 - Composio Agent Orchestrator: swappable TypeScript plugin interfaces for narrow responsibilities. Its plugin-slot model is useful as a boundary pattern, but AgentV should avoid a general orchestrator plugin host until concrete runtime extension gaps appear.
-- Phoenix: official TypeScript packages (`@arizeai/phoenix-client`, `@arizeai/phoenix-evals`, `@arizeai/phoenix-otel`) make it a good private export/conversion target for result and trace integration.
+- Phoenix: official TypeScript packages (`@arizeai/phoenix-client`, `@arizeai/phoenix-evals`, `@arizeai/phoenix-otel`) remain useful peer-framework research material, but the 2026-06-20 product boundary supersedes AgentV-to-Phoenix result or trace export.
 - promptfoo: Node package and JavaScript assertion/provider hooks make it a good private conversion target, especially for YAML matrix configs and JS assertion migration.
 - Braintrust: TypeScript SDK and `Eval(data, task, scores)` model make it a good private conversion target for dataset/task/score loops, experiment metadata, trial counts, and hosted result upload.
 
@@ -113,7 +113,7 @@ Source-backed findings from the initial code analysis:
 - promptfoo can mirror simple AgentV rubric examples with `llm-rubric` and script assertions. AgentV `tool-trajectory` is the largest parity gap because promptfoo trace/trajectory assertions depend on promptfoo trace conventions rather than AgentV `Message[].toolCalls`; a custom provider/metadata adapter is required.
 - Braintrust TypeScript `Eval(name, { data, task, scores })` maps cleanly to AgentV's case/task/score model. The lossy point is that AgentV rich assertion arrays with evidence/verdict/type become Braintrust score metadata unless a deeper adapter is built.
 - Phoenix TypeScript is split across dataset creation, experiment running, evaluators, and OTel. It is strong for persisted datasets/experiments and traces, but less direct for local YAML wrapping because normal `runExperiment` flow expects a Phoenix dataset/server round trip.
-- AgentV already has a Phoenix adapter package, but its support matrix is intentionally narrow and deterministic. Private experiments should use that as evidence, not widen public scope prematurely.
+- AgentV previously carried a private Phoenix adapter experiment with an intentionally narrow and deterministic support matrix. Treat that as historical evidence, not a reason to widen public scope.
 
 Workspace/container findings from Terminal-Bench, Harbor, and Margin:
 
@@ -139,7 +139,10 @@ Extend the existing `agentv create` scaffolding into reusable templates:
 - `agentv create eval --template terminal-task`
 - `agentv create eval --template promptfoo-adapter`
 - `agentv create eval --template braintrust-export`
-- `agentv create eval --template phoenix-export`
+
+Do not add a Phoenix export template. The later Phoenix read-only correlation
+boundary supersedes AgentV-to-Phoenix dataset, experiment, result, or trace
+export templates.
 
 The first implementation can stay static and local, similar to the current `EVAL_TEMPLATES` object in `apps/cli/src/commands/create/commands.ts`. Do not introduce remote template registries, package installation, trust prompts, or plugin loading yet.
 
@@ -185,7 +188,9 @@ Likely docs locations:
 
 Add private examples, not core adapters, for:
 
-- Phoenix: export AgentV results/traces into Phoenix using the TS packages.
+- Phoenix: compare peer-framework DX around independently emitted traces and
+  safe `external_trace` link-out metadata. Do not export AgentV-owned results,
+  traces, transcripts, datasets, experiments, or indexes into Phoenix.
 - promptfoo: convert promptfoo-style YAML or JS assertions into ordinary AgentV evals/assertions where feasible.
 - Braintrust: export AgentV cases/results into Braintrust's TypeScript `Eval(data, task, scores)` shape.
 
@@ -257,7 +262,7 @@ framework-parity/
     run-phoenix.ts
 ```
 
-This subtree should be clearly marked private/internal and should not be mirrored into public AgentV examples until findings are scrubbed.
+This subtree should be clearly marked private/internal and should not be mirrored into public AgentV examples until findings are scrubbed. Any Phoenix files in this historical peer-framework research tree must stay outside the supported AgentV product path and must not become AgentV-to-Phoenix artifact export guidance.
 
 Initial reference evals to consider:
 
@@ -390,7 +395,7 @@ For private conversion work:
 
 - Which AgentV evals should be mirrored first: one simple text/rubric eval plus one workspace/tool-trajectory eval, or only WTG-relevant prompt evals?
 - Should promptfoo import/export be a CLI command later, or stay as documented conversion scripts until demand is proven?
-- Should Phoenix/Braintrust integrations be examples only, or wrappers that consume AgentV JSONL output?
+- Should Braintrust integrations be examples only, or wrappers that consume AgentV JSONL output? Phoenix work is superseded by the read-only external-trace correlation boundary.
 
 ## Decision
 
@@ -404,7 +409,7 @@ Proceed as a plan, not a brainstorm, because the product question is now concret
 - `av-r0s.5.6` - analysis(private): compare peer native ports against AgentV
 - `av-r0s.5.8` - design(private): minimal AgentV workspace/container primitive
 - `av-r0s.5.1` - tooling(private): extract promptfoo exporter requirements after hand ports
-- `av-r0s.5.2` - tooling(private): prototype Braintrust and Phoenix replay adapters
+- `av-r0s.5.2` - tooling(private): prototype Braintrust replay adapters and historical Phoenix peer-framework research only
 - `av-r0s.5.3` - docs(agentv): decide sanitized promotion path from private parity experiments
 - `av-r0s.5.4` - closed as superseded by source-specific hand-port beads
 - `av-w9p` - closed as superseded by `av-r0s.1`
diff --git a/docs/plans/2026-06-21-001-feat-av-quf-results-storage-plan.md b/docs/plans/2026-06-21-001-feat-av-quf-results-storage-plan.md
@@ -17,9 +17,11 @@ publication export, an append-only mutable-operation log, and an S3-compatible
 object-storage tier.
 
 The canonical AgentV run artifacts stay `benchmark.json`, `index.jsonl`, per-test
-grading/timing files, `outputs/trace.json`, and derived transcript artifacts. GitHub,
-Backblaze B2, Phoenix, Hugging Face, and Dashboard are projections, viewers, or storage
-backends over those artifacts.
+grading/timing files, `outputs/trace.json`, and derived transcript artifacts.
+GitHub and Backblaze B2 are storage/publication targets over those artifacts.
+Dashboard and Hugging Face are viewers or publication surfaces. Phoenix is only
+a link-out viewer when safe `external_trace` metadata points at independently
+emitted spans; it is not an AgentV artifact projection or storage backend.
 
 ---
 
@@ -58,7 +60,7 @@ without creating another hosted results platform inside AgentV.
 - Implementing storage backends, S3, oplog, retention, or export code in this bead.
 - Adding GitHub issues or tracker runtime state.
 - Creating windowed branches, per-run branches, or a hosted Dashboard replacement.
-- Making Phoenix, Hugging Face, B2, or GitHub the canonical results model.
+- Making Phoenix canonical, making Phoenix an AgentV artifact projection target, or making Hugging Face, B2, or GitHub the canonical results model.
 
 ### Deferred to Follow-Up Work
 
@@ -852,8 +854,9 @@ results:
 - [ ] The artifact sidecar is called `artifacts`, not `artifact-blobs` or `blob`.
 - [ ] The plan has no windowed or per-run branches.
 - [ ] Path sharding is deferred until realistic measurement proves need.
-- [ ] AgentV artifacts remain canonical; Dashboard, Hugging Face, Phoenix, B2, and
-  GitHub are projections/viewers/storage backends.
+- [ ] AgentV artifacts remain canonical; Dashboard and Hugging Face are viewers
+  or publication surfaces, B2 and GitHub are storage/publication targets, and
+  Phoenix is link-out correlation only when safe external trace metadata exists.
 - [ ] File/function-level implementation guidance names current result repo, remote,
   serve, export, artifact-writer, and Dashboard surfaces.
 - [ ] Test plan covers core, CLI, Dashboard, and docs-facing behavior.