feat(evaluation): inline experiment runtime in eval files#1524
Conversation
Deploying agentv with
|
| Latest commit: |
61a2639
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://0290700a.agentv.pages.dev |
| Branch Preview URL: | https://impl-inline-experiment.agentv.pages.dev |
Review: not ready to mergeI found and fixed one review issue in Remaining blocker:
Recommended fix: do not globalize the first suite's experiment runtime for multi-file runs. Either apply each eval file's Validation run after the small review fix:
|
acbb9a3 to
5f85536
Compare
5f85536 to
1dddabf
Compare
Summary
Eval files now carry experiment runtime directly, so a suite can be promoted into a runnable experiment without a separate unpublished
experiment.yamlartifact. The canonical runtime block isexperiment:, legacyexecution:remains a compatibility alias, and using both now fails validation.This also moves the previous experiment composition behavior into
tests:. Include entries usetype: suite | tests, support glob expansion and selection by test id, tags, and metadata, and allow scopedrun:overrides for repeat/threshold-style behavior. Suite imports preserve child suite task context while the parent runtime controls execution; raw-case shorthand remains raw-case only.Result artifacts now write under
.agentv/results/<eval-name>/<timestamp>/.... Imported suite cases are nested under their source suite name, and circular suite imports are rejected before execution with the import chain while non-cyclic deep chains still work.Design notes
experiment:in*.eval.yaml;execution:is legacy alias onlytests[].includeplus `type: suiteselect.test_ids,select.tags, andselect.metadatarun:overrides at include/test level for threshold, repeat, timeout, and budget fieldsValidation
bun packages/core/scripts/generate-eval-schema.tsbun test packages/core/test/evaluation/eval-inline-experiment.test.ts packages/core/test/evaluation/trials.test.ts packages/core/test/evaluation/validation/eval-file-schema.test.ts packages/core/test/evaluation/validation/eval-validator.test.ts packages/core/test/evaluation/validation/eval-schema-sync.test.ts apps/cli/test/commands/eval/artifact-writer.test.ts apps/cli/test/commands/eval/result-layout.test.ts apps/cli/test/eval.integration.test.tsbun --filter '@agentv/core' typecheckbun run typecheckbun run lintbun run validate:examplesbun run buildbun test packages/core/test/evaluation/eval-inline-experiment.test.ts --test-name-pattern "(direct circular|indirect circular|deep non-cyclic)"Not run: full
bun run test; live provider/LLM-grader dogfood.