feat: bundle all skills in CLI + single agentv-dev wrapper + subagent pipeline by tsoyangbot · Pull Request #1231 · EntityProcess/agentv

tsoyangbot · 2026-05-07T23:21:40Z

AgentV Bundled Skills + Subagent Pipeline

Supersedes #1226.

What changed

Single agentv-dev wrapper skill in plugins/agentv-dev/skills/agentv-dev/SKILL.md — replaces 7 individual skill wrappers. User installs once via npx skills add EntityProcess/agentv, gets one skill that lists all CLI skills.

All skills bundled into CLI dist via skills-data/ → apps/cli/dist/skills/:

agentv-bench — run evals, benchmark, optimize, autoresearch
agentv-eval-writer — write/edit eval YAML
agentv-eval-review — review/lint eval quality
agentv-governance — governance blocks (OWASP, MITRE, EU AI Act)
agentv-trace-analyst — analyze traces, find regressions

Agent loads wrapper → picks skill → agentv skills get <name> → CLI serves full content (version-matched).

Subagent pipeline improvements:

Added disk-read guidance after executor completion (prevents read_agent loops)
pipeline description in top-level --help for discoverability
Rubrics assertions include criteria array in llm_graders/ output

Architecture

plugins/agentv-dev/skills/agentv-dev/SKILL.md   ← thin wrapper (installed by user)
skills-data/agentv-bench/                        ← full content (bundled into CLI)
apps/cli/dist/skills/agentv-bench/               ← shipped in npm package

Verified

Sonnet 4.6: pipeline discovered, 7/7 PASS, 100%, 7m10s
GPT-5.4 (clean dir, no AGENTS.md): pipeline discovered, 6/7 PASS, 90.5%, 8m52s
GPT-5.4 (with AGENTS.md): missed pipeline, used eval run, 5/7 PASS, 81%

GPT-5.4 follows AGENTS.md before skills. Without AGENTS.md, skill discovery works correctly.

Skills are now bundled inside the CLI npm package (`apps/cli/skills/` → `dist/skills/` at build time), version-matched to the binary. A new `agentv skills` subcommand serves the bundled content without any separate plugin install step. - `agentv skills list` — list available skill names (--json) - `agentv skills get <name>` — print SKILL.md content (--full, --json) - `agentv skills get --all` — print all skills - `agentv skills path [<name>]` — print resolved skills directory Resolution walks upward from the module file, validating by SKILL.md presence to avoid false matches. Prefers `dist/skills/` (production layout) over bare `skills/` (source layout). The marketplace plugin SKILL.md files are converted to discovery stubs that redirect agents to `agentv skills get <name>`. Full skill content lives in `apps/cli/skills/` as the single source of truth. Docs: update installation.mdx so the canonical setup is `npm install -g agentv` alone; the allagents plugin step moves to an optional "Claude Code Plugin" section. Closes EntityProcess#1224 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Closes EntityProcess#1229. - skills get <name> --ref <file>: load a single reference without --full. Searches references/, templates/, agents/, then the skill root. Auto- appends .md if the caller passed a bare name. --ref is incompatible with --all and takes precedence over --full. - readSkill --full now also collects agents/ alongside references/ and templates/, so agent role definitions ship together with the skill. - Drop scripts/ and assets/ from every bundled skill. Scripts already duplicated CLI behavior (onboard-agentv.sh ↔ agentv init, trajectory.html / eval_review.html ↔ agentv studio); lint_eval.py is replaced by an inline structural checklist in agentv-eval-review's SKILL.md until a dedicated 'agentv eval lint' lands. - Refresh the affected SKILL.md files: agentv-onboarding now invokes agentv init directly (no platform script), agentv-eval-review inlines the deterministic checks the deleted lint script performed, and every skill documents 'skills get --ref <file>' / 'skills path' for selective reference loading. - Tests: extend the skills unit test fixture to exercise agents/ and bare-root files; assert findRefFile lookup order, .md auto-append, and miss path.

…er pattern Skills are now sourced from <repo-root>/skills-data/ instead of apps/cli/skills/. This mirrors agent-browser's top-level skill-data/ layout and keeps user-authored content out of the CLI workspace. - git mv apps/cli/skills → skills-data - tsup.config.ts: srcSkillsDir now resolves to ../../skills-data - skills-resolver in src/commands/skills/index.ts learns a third candidate name (skills-data/) so dev-mode source runs (bun apps/cli/src/cli.ts skills …) keep working without first building. Order at each ancestor: dist/skills/ → skills-data/ → skills/ (legacy fallback). - Build output stays at dist/skills/, so the npm tarball is unchanged. - Verified: bun run build, dist/skills/ populated, node dist/cli.js skills list / get --ref / path all return expected content. Source mode (no dist) also resolves via skills-data/.

… install)

When pipeline input or pipeline run detects a non-CLI target (subagent-as-target mode), print actionable next steps for the orchestrating agent: - Dispatch executor subagents per test case - Run code graders via pipeline grade - Dispatch LLM grader subagents (read agents/grader.md) - Merge scores via pipeline bench Also point to the full procedure reference: agentv skills get agentv-bench --ref subagent-pipeline This addresses the gap where agents running in subagent mode had no visibility into what to do after pipeline input extracted the test cases.

When the agent IS the target (subagent-as-target mode), the pipeline guidance now tells the agent to grade its own outputs against criteria rather than dispatching separate grader subagents. The agent already IS the LLM — it can read its own response.md, evaluate against criteria.md, and write llm_grader_results directly. Updated: - pipeline input: guidance says "grade your own responses" - pipeline run: same guidance for subagent mode - subagent-pipeline.md: clarifies self-grading in subagent mode

Revert over-correction — the main agent should NOT grade its own outputs. Instead it spawns grader subagents (one per test x LLM grader pair) using agents/grader.md as their instructions. The orchestrating agent dispatches: 1. Executor subagents (one per test case) 2. Grader subagents (one per test x LLM grader pair) 3. Runs pipeline bench to merge scores agents/grader.md defines the full grading procedure for spawned subagents.

…instructions The main agent reads agents/grader.md and embeds its full content as system instructions in each grader subagent prompt. Subagents do not self-discover the file — they need it passed to them.

rubrics assertions are normalized to type: llm-grader with a rubrics array by the grader parser. But writeGraderConfigs only wrote prompt_content (empty for rubrics) and dropped the rubrics array. Now includes the rubrics criteria array in llm_graders/<name>.json so grader subagents can evaluate each criterion directly.

- eval run: print TIP about pipeline when target is claude-cli/copilot-cli - pipeline --help: description now says use this for agent targets - pipeline run --help: hints about executor subagents for agent targets Previously Claude would default to eval run and never discover pipeline. Now both the top-level help and the eval run output guide toward pipeline.

…l --help Pipeline now shows: Subagent-mode eval pipeline (input → executor subagents → grade → bench) — use this when the eval target is an AI agent (Claude, Codex, etc.) This means Claude/Codex can discover pipeline from agentv --help without needing a nudge.

Agents read CLAUDE.md before running tasks. Without this note, they default to eval run instead of pipeline for agent targets.

…ility" This reverts commit 1431cc5.

…_agent loops

christso · 2026-05-11T01:47:52Z

Copilot Eval Pipeline Test — GPT-5.4 vs Sonnet 4.6

Setup: npx skills add tsoyangbot/agentv → installs agentv-bench skill to .agents/skills/. No CLAUDE.md/AGENTS.md modifications. Eval: evals/self/skills/output-correctness.eval.yaml (16 code graders + 8 LLM graders).

Sonnet 4.6 — Pipeline discovered, full subagent flow

Phase	Result
Skill discovery	`skill(agentv-bench)` loaded from `.agents/skills/`
pipeline input	7 test cases extracted
Executor subagents	7/7 spawned in parallel, all wrote response.md
Code graders	16/16 passed via `pipeline grade`
LLM graders	6 grader subagents dispatched with agents/grader.md embedded
pipeline bench + validate	Merged and validated
Final	7/7 PASS, 100% pass rate, 7m10s

GPT-5.4 — Pipeline not discovered, went to eval run

Phase	Result
Skill discovery	Read AGENTS.md first, found `bun eval`, never loaded skill
Command used	`agentv eval run ... --target claude` (CLI mode, not subagent)
Subagent flow	Skipped entirely — CLI handled execution + grading internally
Final	5/7 PASS, 81%, 6m25s

Failed tests (GPT-5.4):

cli-command-correct-syntax — 0% (output did not match expected agentv skills commands)
grader-config-valid — 67% (matched type: llm-grader but rubric judged response incomplete)

Root Cause

Sonnet reads the installed skill (.agents/skills/agentv-bench/SKILL.md) before AGENTS.md and discovers pipeline. GPT-5.4 follows CLAUDE.md instruction ("Read @AGENTS.md before any task") literally, reads AGENTS.md first, finds bun eval, and never checks the skill.

What This Means

The skill + CLI approach works — Sonnet proves it end-to-end
GPT-5.4 needs explicit guidance — either CLAUDE.md redirect or stronger skill description that triggers before AGENTS.md is read
eval run still produces valid results — just uses CLI mode instead of subagent orchestration

Commands Used

# Sonnet 4.6 (pipe mode)
timeout 600 copilot -p "run evals on evals/self/skills/output-correctness.eval.yaml using agentv" --yolo

# GPT-5.4 (pipe mode)
timeout 600 copilot -p "run evals on evals/self/skills/output-correctness.eval.yaml using agentv" --yolo --model gpt-5.4

# Skill install
npx skills add tsoyangbot/agentv -y

Artifacts

Sonnet: .agentv/results/runs/default/2026-05-11T01-06-08-924Z/
GPT-5.4: .agentv/results/runs/default/2026-05-11T01-42-13-535Z/

christso · 2026-05-11T02:03:34Z

Copilot Eval Pipeline Test — GPT-5.4 in Clean Directory (No AGENTS.md)

Setup: Clean temp dir (/tmp/agentv-test/) with npx skills add tsoyangbot/agentv -y. No AGENTS.md, no CLAUDE.md, no project docs. Eval: evals/self/skills/output-correctness.eval.yaml (16 code graders + 8 LLM graders).

Result: GPT-5.4 discovered pipeline from skill alone

Phase	Result
Skill discovery	Found `.agents/skills/agentv-bench/`, loaded onboarding, installed CLI
Pipeline discovery	`agentv --help` → found `pipeline` → read `subagent-pipeline.md`
pipeline input	7 test cases extracted
Executor subagents	7/7 spawned in parallel, all wrote response.md
Code graders	16/16 passed via `pipeline grade`
LLM graders	6 grader subagents dispatched with `agents/grader.md` embedded
pipeline bench	Merged scores, validated
Final	6/7 PASS, 90.5% pass rate, 8m52s

Failed test:

cli-command-correct-syntax — 66.7% (output missed agentv skills get agentv-bench --full)
grader-config-valid — 83.3% (used type: llm-grader but omitted prompt: field)

Key Finding

Without AGENTS.md, GPT-5.4 follows the skill correctly. The previous test (in the agentv repo with AGENTS.md) failed because GPT-5.4 read AGENTS.md first and went to eval run instead of pipeline. The clean directory test proves the skill + CLI approach works for GPT-5.4.

Implications

The skill is sufficient — no CLAUDE.md or AGENTS.md modifications needed
AGENTS.md is the problem — it redirects GPT-5.4 away from the skill
Fix options:
- Add a one-line redirect in AGENTS.md: "For running evals, use the agentv-bench skill"
- Or: make the skill description more aggressive so copilot loads it before AGENTS.md
- Or: document that users should run evals from a clean directory (not the repo root)

Commands Used

# Setup clean directory
mkdir -p /tmp/agentv-test/evals/self/skills
cp evals/self/skills/output-correctness.eval.yaml /tmp/agentv-test/evals/self/skills/
cp -r evals/self/skills/fixtures /tmp/agentv-test/evals/self/skills/
cd /tmp/agentv-test && npx skills add tsoyangbot/agentv -y

# Run eval with GPT-5.4
timeout 600 copilot -p "run evals on evals/self/skills/output-correctness.eval.yaml using agentv" --yolo --model gpt-5.4

Artifacts

Clean dir GPT-5.4: .agentv/results/runs/default/2026-05-11T01-56-43-040Z/
Previous repo GPT-5.4 (failed pipeline): .agentv/results/runs/default/2026-05-11T01-42-13-535Z/
Sonnet 4.6 (repo, worked): .agentv/results/runs/default/2026-05-11T01-06-08-924Z/

…only thin wrappers)

christso · 2026-05-11T04:11:36Z

WTG.AI.Prompts Repo Test — GPT-5.4 Subagent Mode

Repo: WiseTechGlobal/WTG.AI.Prompts (CargoWise domain-specific evals)
Eval: evals/neo/neo-wi-research.eval.yaml (6 code graders + 6 LLM graders)
Model: GPT-5.4
Mode: Subagent (explicitly requested)

Result

Metric	Value
Tests	6/6 PASS
Pass rate	100%
Mode	Subagent pipeline
Time	7m48s

Flow

Copilot read repo's run-evalcases skill → Section 5 directed to subagent mode
pipeline input → extracted 6 test cases
6 executor subagents dispatched in parallel → all wrote response.md
pipeline grade → code graders passed
6 LLM grader subagents dispatched with agents/grader.md embedded
pipeline bench + results validate → merged and validated

Key finding

When explicitly asked for subagent mode, GPT-5.4 follows the correct pipeline. The repo's run-evalcases skill (Section 5) redirects to agentv-bench for subagent mode. The previous CLI mode result was because copilot chose CLI mode when .env had API keys.

All tests this session

Test	Model	Mode	Result	Time
agentv repo (clean dir)	GPT-5.4	Subagent	6/7 PASS (90.5%)	8m52s
agentv repo (with AGENTS.md)	GPT-5.4	CLI	5/7 PASS (81%)	6m25s
agentv repo	Sonnet 4.6	Subagent	7/7 PASS (100%)	7m10s
WTG.AI.Prompts	GPT-5.4	CLI	6/6 PASS (100%)	3m59s
WTG.AI.Prompts	GPT-5.4	Subagent	6/6 PASS (100%)	7m48s

- Replace hardcoded [references, templates, agents] with dynamic iteration over ALL subdirectories in each skill folder - Add listSkillSubdirs() helper that reads directory entries at runtime - --full now includes scripts/, assets/, fixtures/, etc. automatically - --ref searches all subdirectories (not just the hardcoded three) - Bundle trajectory.html and lint_eval.py scripts in skills-data - No more code changes needed when adding new subdirectory types

Schema was missing after plugins/ cleanup during skills consolidation. Regenerated with bun packages/core/scripts/generate-eval-schema.ts. 1752 tests pass, 0 fail.

- Update generate script to write to skills-data/ instead of plugins/ - Update test to read from skills-data/ - Delete redundant plugins/ copy - skills-data/ is now the single source of truth for all skill content

christso and others added 6 commits May 7, 2026 03:15

fix(cli): resolve biome lint errors in skills command and tests

da04aad

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chore: drop agentv-onboarding from bundled skills (redundant with npm…

63b97fb

… install)

christso force-pushed the feat/1224-bundled-skills branch from 29d80bf to 23bbe3f Compare May 10, 2026 11:50

christso added 8 commits May 10, 2026 14:28

fix(cli): clarify grader subagents need grader.md embedded as system …

5ee5ae6

…instructions The main agent reads agents/grader.md and embeds its full content as system instructions in each grader subagent prompt. Subagents do not self-discover the file — they need it passed to them.

docs: add pipeline guidance to CLAUDE.md for agent discoverability

1431cc5

Agents read CLAUDE.md before running tasks. Without this note, they default to eval run instead of pipeline for agent targets.

Revert "docs: add pipeline guidance to CLAUDE.md for agent discoverab…

ceab053

…ility" This reverts commit 1431cc5.

christso mentioned this pull request May 11, 2026

feat(cli): bundle agentv-dev skills and add agentv skills subcommand #1226

Closed

8 tasks

fix: add disk-read guidance after executor completion to prevent read…

3b1b8d9

…_agent loops

christso added 3 commits May 11, 2026 04:30

chore: remove duplicate content from plugins/agentv-dev/skills (keep …

45e072f

…only thin wrappers)

chore: consolidate plugins skills into single agentv-dev wrapper

8d841ca

fix: remove dead agentv-onboarding reference from wrapper

f4df645

christso changed the title ~~chore: trim bundle to agentv-bench only (supersedes #1226)~~ feat: bundle all skills in CLI + single agentv-dev wrapper + subagent pipeline May 11, 2026

christso added 3 commits May 11, 2026 07:28

fix: regenerate eval-schema.json from Zod definition

da93048

Schema was missing after plugins/ cleanup during skills consolidation. Regenerated with bun packages/core/scripts/generate-eval-schema.ts. 1752 tests pass, 0 fail.

fix: make agentv-dev plugin message client-agnostic

abd08cb

christso force-pushed the feat/1224-bundled-skills branch from 187c697 to abd08cb Compare May 11, 2026 06:17

refactor: move eval-schema.json from plugins/ to skills-data/

db4a5c2

- Update generate script to write to skills-data/ instead of plugins/ - Update test to read from skills-data/ - Delete redundant plugins/ copy - skills-data/ is now the single source of truth for all skill content

christso merged commit 5bdb960 into EntityProcess:main May 11, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: bundle all skills in CLI + single agentv-dev wrapper + subagent pipeline#1231

feat: bundle all skills in CLI + single agentv-dev wrapper + subagent pipeline#1231
christso merged 22 commits into
EntityProcess:mainfrom
tsoyangbot:feat/1224-bundled-skills

tsoyangbot commented May 7, 2026 •

edited by christso

Loading

Uh oh!

christso commented May 11, 2026

Uh oh!

christso commented May 11, 2026

Uh oh!

christso commented May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tsoyangbot commented May 7, 2026 • edited by christso Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AgentV Bundled Skills + Subagent Pipeline

What changed

Architecture

Verified

Uh oh!

christso commented May 11, 2026

Copilot Eval Pipeline Test — GPT-5.4 vs Sonnet 4.6

Sonnet 4.6 — Pipeline discovered, full subagent flow

GPT-5.4 — Pipeline not discovered, went to eval run

Root Cause

What This Means

Commands Used

Artifacts

Uh oh!

christso commented May 11, 2026

Copilot Eval Pipeline Test — GPT-5.4 in Clean Directory (No AGENTS.md)

Result: GPT-5.4 discovered pipeline from skill alone

Key Finding

Implications

Commands Used

Artifacts

Uh oh!

christso commented May 11, 2026

WTG.AI.Prompts Repo Test — GPT-5.4 Subagent Mode

Result

Flow

Key finding

All tests this session

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tsoyangbot commented May 7, 2026 •

edited by christso

Loading