feat(evaluation): inline experiment runtime in eval files by christso · Pull Request #1524 · EntityProcess/agentv

christso · 2026-06-26T07:42:53Z

Summary

Eval files now carry experiment runtime directly, so a suite can be promoted into a runnable experiment without a separate unpublished experiment.yaml artifact. The canonical runtime block is experiment:, legacy execution: remains a compatibility alias, and using both now fails validation.

This also moves the previous experiment composition behavior into tests:. Include entries use type: suite | tests, support glob expansion and selection by test id, tags, and metadata, and allow scoped run: overrides for repeat/threshold-style behavior. Suite imports preserve child suite task context while the parent runtime controls execution; raw-case shorthand remains raw-case only.

Result artifacts now write under .agentv/results/<eval-name>/<timestamp>/.... Imported suite cases are nested under their source suite name, and circular suite imports are rejected before execution with the import chain while non-cyclic deep chains still work.

Design notes

Area	Decision
Runtime config	`experiment:` in `*.eval.yaml`; `execution:` is legacy alias only
Composition	`tests[].include` plus `type: suite
Selection	`select.test_ids`, `select.tags`, and `select.metadata`
Scoped runtime	`run:` overrides at include/test level for threshold, repeat, timeout, and budget fields
Suite imports	Recursive, deterministic, cycle-checked by canonical eval file path
Standalone experiments	Removed unpublished experiment schema, fixtures, and examples

Validation

bun packages/core/scripts/generate-eval-schema.ts
bun test packages/core/test/evaluation/eval-inline-experiment.test.ts packages/core/test/evaluation/trials.test.ts packages/core/test/evaluation/validation/eval-file-schema.test.ts packages/core/test/evaluation/validation/eval-validator.test.ts packages/core/test/evaluation/validation/eval-schema-sync.test.ts apps/cli/test/commands/eval/artifact-writer.test.ts apps/cli/test/commands/eval/result-layout.test.ts apps/cli/test/eval.integration.test.ts
bun --filter '@agentv/core' typecheck
bun run typecheck
bun run lint
bun run validate:examples
bun run build
bun test packages/core/test/evaluation/eval-inline-experiment.test.ts --test-name-pattern "(direct circular|indirect circular|deep non-cyclic)"

Not run: full bun run test; live provider/LLM-grader dogfood.

cloudflare-workers-and-pages · 2026-06-26T07:43:40Z

Deploying agentv with Cloudflare Pages

Latest commit:	`61a2639`
Status:	✅ Deploy successful!
Preview URL:	https://0290700a.agentv.pages.dev
Branch Preview URL:	https://impl-inline-experiment.agentv.pages.dev

View logs

christso · 2026-06-26T07:47:11Z

Review: not ready to merge

I found and fixed one review issue in 09c411f3: include-object entries now require explicit type: suite | tests, matching the docs and ADR direction.

Remaining blocker:

Severity	File	Issue
P1	`apps/cli/src/commands/eval/run-eval.ts:1599`	The first eval file's `experiment:` is applied to global options before every eval file is prepared.

runEvalCommand() loads primarySuite and applies primarySuite.experimentConfig into options globally. Later, each file calls applyExperimentOptions(options, suite.experimentConfig), but that helper preserves already-populated fields such as cliTargets, agentTimeoutSeconds, budgetUsd, and threshold. In a multi-file run where a.eval.yaml and b.eval.yaml both define experiment:, b can inherit a's target/runtime and fail to apply its own experiment settings.

Recommended fix: do not globalize the first suite's experiment runtime for multi-file runs. Either apply each eval file's experiment: only inside per-file metadata/execution, or explicitly reject multiple eval files with inline experiment: until per-file setup/runtime semantics are implemented. Add an integration test with two eval files that use different experiment targets to prevent regressions.

Validation run after the small review fix:

bun packages/core/scripts/generate-eval-schema.ts
bun test packages/core/test/evaluation/eval-inline-experiment.test.ts packages/core/test/evaluation/validation/eval-file-schema.test.ts packages/core/test/evaluation/validation/eval-schema-sync.test.ts packages/core/test/evaluation/validation/eval-validator.test.ts
bun run lint
bun --filter '@agentv/core' typecheck

feat(evaluation): inline experiment runtime in eval files

ab12bd3

fix(review): require explicit include entry type

09c411f

christso force-pushed the impl-inline-experiment branch from acbb9a3 to 5f85536 Compare June 26, 2026 08:52

docs(eval): prefer rubric assertion shorthand

1dddabf

christso force-pushed the impl-inline-experiment branch from 5f85536 to 1dddabf Compare June 26, 2026 08:53

christso added 4 commits June 26, 2026 11:12

Merge remote-tracking branch 'origin/main' into impl-inline-experiment

287e79e

docs(eval): clarify lifecycle ownership

7e4f459

fix(eval): isolate inline experiment runtime per file

02eef0b

fix(eval): nest imported suite artifacts

61a2639

christso merged commit 5686100 into main Jun 26, 2026
8 checks passed

christso deleted the impl-inline-experiment branch June 26, 2026 09:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(evaluation): inline experiment runtime in eval files#1524

feat(evaluation): inline experiment runtime in eval files#1524
christso merged 7 commits into
mainfrom
impl-inline-experiment

christso commented Jun 26, 2026

Uh oh!

cloudflare-workers-and-pages Bot commented Jun 26, 2026 •

edited

Loading

Uh oh!

christso commented Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

christso commented Jun 26, 2026

Summary

Design notes

Validation

Uh oh!

cloudflare-workers-and-pages Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying agentv with Cloudflare Pages

Uh oh!

christso commented Jun 26, 2026

Review: not ready to merge

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cloudflare-workers-and-pages Bot commented Jun 26, 2026 •

edited

Loading