Skip to content

feat: bundle all skills in CLI + single agentv-dev wrapper + subagent pipeline#1231

Merged
christso merged 22 commits into
EntityProcess:mainfrom
tsoyangbot:feat/1224-bundled-skills
May 11, 2026
Merged

feat: bundle all skills in CLI + single agentv-dev wrapper + subagent pipeline#1231
christso merged 22 commits into
EntityProcess:mainfrom
tsoyangbot:feat/1224-bundled-skills

Conversation

@tsoyangbot
Copy link
Copy Markdown
Collaborator

@tsoyangbot tsoyangbot commented May 7, 2026

AgentV Bundled Skills + Subagent Pipeline

Supersedes #1226.

What changed

Single agentv-dev wrapper skill in plugins/agentv-dev/skills/agentv-dev/SKILL.md — replaces 7 individual skill wrappers. User installs once via npx skills add EntityProcess/agentv, gets one skill that lists all CLI skills.

All skills bundled into CLI dist via skills-data/apps/cli/dist/skills/:

  • agentv-bench — run evals, benchmark, optimize, autoresearch
  • agentv-eval-writer — write/edit eval YAML
  • agentv-eval-review — review/lint eval quality
  • agentv-governance — governance blocks (OWASP, MITRE, EU AI Act)
  • agentv-trace-analyst — analyze traces, find regressions

Agent loads wrapper → picks skill → agentv skills get <name> → CLI serves full content (version-matched).

Subagent pipeline improvements:

  • Added disk-read guidance after executor completion (prevents read_agent loops)
  • pipeline description in top-level --help for discoverability
  • Rubrics assertions include criteria array in llm_graders/ output

Architecture

plugins/agentv-dev/skills/agentv-dev/SKILL.md   ← thin wrapper (installed by user)
skills-data/agentv-bench/                        ← full content (bundled into CLI)
apps/cli/dist/skills/agentv-bench/               ← shipped in npm package

Verified

  • Sonnet 4.6: pipeline discovered, 7/7 PASS, 100%, 7m10s
  • GPT-5.4 (clean dir, no AGENTS.md): pipeline discovered, 6/7 PASS, 90.5%, 8m52s
  • GPT-5.4 (with AGENTS.md): missed pipeline, used eval run, 5/7 PASS, 81%

GPT-5.4 follows AGENTS.md before skills. Without AGENTS.md, skill discovery works correctly.

christso and others added 6 commits May 7, 2026 03:15
Skills are now bundled inside the CLI npm package (`apps/cli/skills/`
→ `dist/skills/` at build time), version-matched to the binary. A new
`agentv skills` subcommand serves the bundled content without any
separate plugin install step.

- `agentv skills list` — list available skill names (--json)
- `agentv skills get <name>` — print SKILL.md content (--full, --json)
- `agentv skills get --all` — print all skills
- `agentv skills path [<name>]` — print resolved skills directory

Resolution walks upward from the module file, validating by SKILL.md
presence to avoid false matches. Prefers `dist/skills/` (production
layout) over bare `skills/` (source layout).

The marketplace plugin SKILL.md files are converted to discovery stubs
that redirect agents to `agentv skills get <name>`. Full skill content
lives in `apps/cli/skills/` as the single source of truth.

Docs: update installation.mdx so the canonical setup is
`npm install -g agentv` alone; the allagents plugin step moves to an
optional "Claude Code Plugin" section.

Closes EntityProcess#1224

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Closes EntityProcess#1229.

- skills get <name> --ref <file>: load a single reference without --full.
  Searches references/, templates/, agents/, then the skill root. Auto-
  appends .md if the caller passed a bare name. --ref is incompatible
  with --all and takes precedence over --full.
- readSkill --full now also collects agents/ alongside references/ and
  templates/, so agent role definitions ship together with the skill.
- Drop scripts/ and assets/ from every bundled skill. Scripts already
  duplicated CLI behavior (onboard-agentv.sh ↔ agentv init,
  trajectory.html / eval_review.html ↔ agentv studio); lint_eval.py
  is replaced by an inline structural checklist in agentv-eval-review's
  SKILL.md until a dedicated 'agentv eval lint' lands.
- Refresh the affected SKILL.md files: agentv-onboarding now invokes
  agentv init directly (no platform script), agentv-eval-review
  inlines the deterministic checks the deleted lint script performed,
  and every skill documents 'skills get --ref <file>' / 'skills path'
  for selective reference loading.
- Tests: extend the skills unit test fixture to exercise agents/ and
  bare-root files; assert findRefFile lookup order, .md auto-append,
  and miss path.
…er pattern

Skills are now sourced from <repo-root>/skills-data/ instead of
apps/cli/skills/. This mirrors agent-browser's top-level skill-data/
layout and keeps user-authored content out of the CLI workspace.

- git mv apps/cli/skills → skills-data
- tsup.config.ts: srcSkillsDir now resolves to ../../skills-data
- skills-resolver in src/commands/skills/index.ts learns a third
  candidate name (skills-data/) so dev-mode source runs
  (bun apps/cli/src/cli.ts skills …) keep working without first
  building. Order at each ancestor: dist/skills/ → skills-data/ →
  skills/ (legacy fallback).
- Build output stays at dist/skills/, so the npm tarball is unchanged.
- Verified: bun run build, dist/skills/ populated, node dist/cli.js
  skills list / get --ref / path all return expected content. Source
  mode (no dist) also resolves via skills-data/.
When pipeline input or pipeline run detects a non-CLI target (subagent-as-target
mode), print actionable next steps for the orchestrating agent:

- Dispatch executor subagents per test case
- Run code graders via pipeline grade
- Dispatch LLM grader subagents (read agents/grader.md)
- Merge scores via pipeline bench

Also point to the full procedure reference:
  agentv skills get agentv-bench --ref subagent-pipeline

This addresses the gap where agents running in subagent mode had no visibility
into what to do after pipeline input extracted the test cases.
@christso christso force-pushed the feat/1224-bundled-skills branch from 29d80bf to 23bbe3f Compare May 10, 2026 11:50
christso added 8 commits May 10, 2026 14:28
When the agent IS the target (subagent-as-target mode), the pipeline
guidance now tells the agent to grade its own outputs against criteria
rather than dispatching separate grader subagents.

The agent already IS the LLM — it can read its own response.md,
evaluate against criteria.md, and write llm_grader_results directly.

Updated:
- pipeline input: guidance says "grade your own responses"
- pipeline run: same guidance for subagent mode
- subagent-pipeline.md: clarifies self-grading in subagent mode
Revert over-correction — the main agent should NOT grade its own outputs.
Instead it spawns grader subagents (one per test x LLM grader pair) using
agents/grader.md as their instructions.

The orchestrating agent dispatches:
1. Executor subagents (one per test case)
2. Grader subagents (one per test x LLM grader pair)
3. Runs pipeline bench to merge scores

agents/grader.md defines the full grading procedure for spawned subagents.
…instructions

The main agent reads agents/grader.md and embeds its full content as
system instructions in each grader subagent prompt. Subagents do not
self-discover the file — they need it passed to them.
rubrics assertions are normalized to type: llm-grader with a rubrics
array by the grader parser. But writeGraderConfigs only wrote
prompt_content (empty for rubrics) and dropped the rubrics array.

Now includes the rubrics criteria array in llm_graders/<name>.json so
grader subagents can evaluate each criterion directly.
- eval run: print TIP about pipeline when target is claude-cli/copilot-cli
- pipeline --help: description now says use this for agent targets
- pipeline run --help: hints about executor subagents for agent targets

Previously Claude would default to eval run and never discover pipeline.
Now both the top-level help and the eval run output guide toward pipeline.
…l --help

Pipeline now shows: Subagent-mode eval pipeline (input → executor
subagents → grade → bench) — use this when the eval target is an
AI agent (Claude, Codex, etc.)

This means Claude/Codex can discover pipeline from agentv --help
without needing a nudge.
Agents read CLAUDE.md before running tasks. Without this note,
they default to eval run instead of pipeline for agent targets.
@christso
Copy link
Copy Markdown
Collaborator

Copilot Eval Pipeline Test — GPT-5.4 vs Sonnet 4.6

Setup: npx skills add tsoyangbot/agentv → installs agentv-bench skill to .agents/skills/. No CLAUDE.md/AGENTS.md modifications. Eval: evals/self/skills/output-correctness.eval.yaml (16 code graders + 8 LLM graders).

Sonnet 4.6 — Pipeline discovered, full subagent flow

Phase Result
Skill discovery skill(agentv-bench) loaded from .agents/skills/
pipeline input 7 test cases extracted
Executor subagents 7/7 spawned in parallel, all wrote response.md
Code graders 16/16 passed via pipeline grade
LLM graders 6 grader subagents dispatched with agents/grader.md embedded
pipeline bench + validate Merged and validated
Final 7/7 PASS, 100% pass rate, 7m10s

GPT-5.4 — Pipeline not discovered, went to eval run

Phase Result
Skill discovery Read AGENTS.md first, found bun eval, never loaded skill
Command used agentv eval run ... --target claude (CLI mode, not subagent)
Subagent flow Skipped entirely — CLI handled execution + grading internally
Final 5/7 PASS, 81%, 6m25s

Failed tests (GPT-5.4):

  • cli-command-correct-syntax — 0% (output did not match expected agentv skills commands)
  • grader-config-valid — 67% (matched type: llm-grader but rubric judged response incomplete)

Root Cause

Sonnet reads the installed skill (.agents/skills/agentv-bench/SKILL.md) before AGENTS.md and discovers pipeline. GPT-5.4 follows CLAUDE.md instruction ("Read @AGENTS.md before any task") literally, reads AGENTS.md first, finds bun eval, and never checks the skill.

What This Means

  • The skill + CLI approach works — Sonnet proves it end-to-end
  • GPT-5.4 needs explicit guidance — either CLAUDE.md redirect or stronger skill description that triggers before AGENTS.md is read
  • eval run still produces valid results — just uses CLI mode instead of subagent orchestration

Commands Used

# Sonnet 4.6 (pipe mode)
timeout 600 copilot -p "run evals on evals/self/skills/output-correctness.eval.yaml using agentv" --yolo

# GPT-5.4 (pipe mode)
timeout 600 copilot -p "run evals on evals/self/skills/output-correctness.eval.yaml using agentv" --yolo --model gpt-5.4

# Skill install
npx skills add tsoyangbot/agentv -y

Artifacts

  • Sonnet: .agentv/results/runs/default/2026-05-11T01-06-08-924Z/
  • GPT-5.4: .agentv/results/runs/default/2026-05-11T01-42-13-535Z/

@christso
Copy link
Copy Markdown
Collaborator

Copilot Eval Pipeline Test — GPT-5.4 in Clean Directory (No AGENTS.md)

Setup: Clean temp dir (/tmp/agentv-test/) with npx skills add tsoyangbot/agentv -y. No AGENTS.md, no CLAUDE.md, no project docs. Eval: evals/self/skills/output-correctness.eval.yaml (16 code graders + 8 LLM graders).

Result: GPT-5.4 discovered pipeline from skill alone

Phase Result
Skill discovery Found .agents/skills/agentv-bench/, loaded onboarding, installed CLI
Pipeline discovery agentv --help → found pipeline → read subagent-pipeline.md
pipeline input 7 test cases extracted
Executor subagents 7/7 spawned in parallel, all wrote response.md
Code graders 16/16 passed via pipeline grade
LLM graders 6 grader subagents dispatched with agents/grader.md embedded
pipeline bench Merged scores, validated
Final 6/7 PASS, 90.5% pass rate, 8m52s

Failed test:

  • cli-command-correct-syntax — 66.7% (output missed agentv skills get agentv-bench --full)
  • grader-config-valid — 83.3% (used type: llm-grader but omitted prompt: field)

Key Finding

Without AGENTS.md, GPT-5.4 follows the skill correctly. The previous test (in the agentv repo with AGENTS.md) failed because GPT-5.4 read AGENTS.md first and went to eval run instead of pipeline. The clean directory test proves the skill + CLI approach works for GPT-5.4.

Implications

  1. The skill is sufficient — no CLAUDE.md or AGENTS.md modifications needed
  2. AGENTS.md is the problem — it redirects GPT-5.4 away from the skill
  3. Fix options:
    • Add a one-line redirect in AGENTS.md: "For running evals, use the agentv-bench skill"
    • Or: make the skill description more aggressive so copilot loads it before AGENTS.md
    • Or: document that users should run evals from a clean directory (not the repo root)

Commands Used

# Setup clean directory
mkdir -p /tmp/agentv-test/evals/self/skills
cp evals/self/skills/output-correctness.eval.yaml /tmp/agentv-test/evals/self/skills/
cp -r evals/self/skills/fixtures /tmp/agentv-test/evals/self/skills/
cd /tmp/agentv-test && npx skills add tsoyangbot/agentv -y

# Run eval with GPT-5.4
timeout 600 copilot -p "run evals on evals/self/skills/output-correctness.eval.yaml using agentv" --yolo --model gpt-5.4

Artifacts

  • Clean dir GPT-5.4: .agentv/results/runs/default/2026-05-11T01-56-43-040Z/
  • Previous repo GPT-5.4 (failed pipeline): .agentv/results/runs/default/2026-05-11T01-42-13-535Z/
  • Sonnet 4.6 (repo, worked): .agentv/results/runs/default/2026-05-11T01-06-08-924Z/

@christso christso changed the title chore: trim bundle to agentv-bench only (supersedes #1226) feat: bundle all skills in CLI + single agentv-dev wrapper + subagent pipeline May 11, 2026
@christso
Copy link
Copy Markdown
Collaborator

WTG.AI.Prompts Repo Test — GPT-5.4 Subagent Mode

Repo: WiseTechGlobal/WTG.AI.Prompts (CargoWise domain-specific evals)
Eval: evals/neo/neo-wi-research.eval.yaml (6 code graders + 6 LLM graders)
Model: GPT-5.4
Mode: Subagent (explicitly requested)

Result

Metric Value
Tests 6/6 PASS
Pass rate 100%
Mode Subagent pipeline
Time 7m48s

Flow

  1. Copilot read repo's run-evalcases skill → Section 5 directed to subagent mode
  2. pipeline input → extracted 6 test cases
  3. 6 executor subagents dispatched in parallel → all wrote response.md
  4. pipeline grade → code graders passed
  5. 6 LLM grader subagents dispatched with agents/grader.md embedded
  6. pipeline bench + results validate → merged and validated

Key finding

When explicitly asked for subagent mode, GPT-5.4 follows the correct pipeline. The repo's run-evalcases skill (Section 5) redirects to agentv-bench for subagent mode. The previous CLI mode result was because copilot chose CLI mode when .env had API keys.

All tests this session

Test Model Mode Result Time
agentv repo (clean dir) GPT-5.4 Subagent 6/7 PASS (90.5%) 8m52s
agentv repo (with AGENTS.md) GPT-5.4 CLI 5/7 PASS (81%) 6m25s
agentv repo Sonnet 4.6 Subagent 7/7 PASS (100%) 7m10s
WTG.AI.Prompts GPT-5.4 CLI 6/6 PASS (100%) 3m59s
WTG.AI.Prompts GPT-5.4 Subagent 6/6 PASS (100%) 7m48s

christso added 3 commits May 11, 2026 07:28
- Replace hardcoded [references, templates, agents] with dynamic
  iteration over ALL subdirectories in each skill folder
- Add listSkillSubdirs() helper that reads directory entries at runtime
- --full now includes scripts/, assets/, fixtures/, etc. automatically
- --ref searches all subdirectories (not just the hardcoded three)
- Bundle trajectory.html and lint_eval.py scripts in skills-data
- No more code changes needed when adding new subdirectory types
Schema was missing after plugins/ cleanup during skills consolidation.
Regenerated with bun packages/core/scripts/generate-eval-schema.ts.
1752 tests pass, 0 fail.
@christso christso force-pushed the feat/1224-bundled-skills branch from 187c697 to abd08cb Compare May 11, 2026 06:17
- Update generate script to write to skills-data/ instead of plugins/
- Update test to read from skills-data/
- Delete redundant plugins/ copy
- skills-data/ is now the single source of truth for all skill content
@christso christso merged commit 5bdb960 into EntityProcess:main May 11, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants