Skip to content

feat(evaluation): inline experiment runtime in eval files#1524

Merged
christso merged 7 commits into
mainfrom
impl-inline-experiment
Jun 26, 2026
Merged

feat(evaluation): inline experiment runtime in eval files#1524
christso merged 7 commits into
mainfrom
impl-inline-experiment

Conversation

@christso

Copy link
Copy Markdown
Collaborator

Summary

Eval files now carry experiment runtime directly, so a suite can be promoted into a runnable experiment without a separate unpublished experiment.yaml artifact. The canonical runtime block is experiment:, legacy execution: remains a compatibility alias, and using both now fails validation.

This also moves the previous experiment composition behavior into tests:. Include entries use type: suite | tests, support glob expansion and selection by test id, tags, and metadata, and allow scoped run: overrides for repeat/threshold-style behavior. Suite imports preserve child suite task context while the parent runtime controls execution; raw-case shorthand remains raw-case only.

Result artifacts now write under .agentv/results/<eval-name>/<timestamp>/.... Imported suite cases are nested under their source suite name, and circular suite imports are rejected before execution with the import chain while non-cyclic deep chains still work.

Design notes

Area Decision
Runtime config experiment: in *.eval.yaml; execution: is legacy alias only
Composition tests[].include plus `type: suite
Selection select.test_ids, select.tags, and select.metadata
Scoped runtime run: overrides at include/test level for threshold, repeat, timeout, and budget fields
Suite imports Recursive, deterministic, cycle-checked by canonical eval file path
Standalone experiments Removed unpublished experiment schema, fixtures, and examples

Validation

  • bun packages/core/scripts/generate-eval-schema.ts
  • bun test packages/core/test/evaluation/eval-inline-experiment.test.ts packages/core/test/evaluation/trials.test.ts packages/core/test/evaluation/validation/eval-file-schema.test.ts packages/core/test/evaluation/validation/eval-validator.test.ts packages/core/test/evaluation/validation/eval-schema-sync.test.ts apps/cli/test/commands/eval/artifact-writer.test.ts apps/cli/test/commands/eval/result-layout.test.ts apps/cli/test/eval.integration.test.ts
  • bun --filter '@agentv/core' typecheck
  • bun run typecheck
  • bun run lint
  • bun run validate:examples
  • bun run build
  • bun test packages/core/test/evaluation/eval-inline-experiment.test.ts --test-name-pattern "(direct circular|indirect circular|deep non-cyclic)"

Not run: full bun run test; live provider/LLM-grader dogfood.


Compound Engineering
Codex

@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Jun 26, 2026

Copy link
Copy Markdown

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: 61a2639
Status: ✅  Deploy successful!
Preview URL: https://0290700a.agentv.pages.dev
Branch Preview URL: https://impl-inline-experiment.agentv.pages.dev

View logs

@christso

Copy link
Copy Markdown
Collaborator Author

Review: not ready to merge

I found and fixed one review issue in 09c411f3: include-object entries now require explicit type: suite | tests, matching the docs and ADR direction.

Remaining blocker:

Severity File Issue
P1 apps/cli/src/commands/eval/run-eval.ts:1599 The first eval file's experiment: is applied to global options before every eval file is prepared.

runEvalCommand() loads primarySuite and applies primarySuite.experimentConfig into options globally. Later, each file calls applyExperimentOptions(options, suite.experimentConfig), but that helper preserves already-populated fields such as cliTargets, agentTimeoutSeconds, budgetUsd, and threshold. In a multi-file run where a.eval.yaml and b.eval.yaml both define experiment:, b can inherit a's target/runtime and fail to apply its own experiment settings.

Recommended fix: do not globalize the first suite's experiment runtime for multi-file runs. Either apply each eval file's experiment: only inside per-file metadata/execution, or explicitly reject multiple eval files with inline experiment: until per-file setup/runtime semantics are implemented. Add an integration test with two eval files that use different experiment targets to prevent regressions.

Validation run after the small review fix:

  • bun packages/core/scripts/generate-eval-schema.ts
  • bun test packages/core/test/evaluation/eval-inline-experiment.test.ts packages/core/test/evaluation/validation/eval-file-schema.test.ts packages/core/test/evaluation/validation/eval-schema-sync.test.ts packages/core/test/evaluation/validation/eval-validator.test.ts
  • bun run lint
  • bun --filter '@agentv/core' typecheck

@christso christso force-pushed the impl-inline-experiment branch from acbb9a3 to 5f85536 Compare June 26, 2026 08:52
@christso christso force-pushed the impl-inline-experiment branch from 5f85536 to 1dddabf Compare June 26, 2026 08:53
@christso christso merged commit 5686100 into main Jun 26, 2026
8 checks passed
@christso christso deleted the impl-inline-experiment branch June 26, 2026 09:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant