-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Objective
Add a --threshold CLI flag to agentv eval that fails (exit 1) if the mean score across all tests falls below the specified threshold. This enables CI/CD quality gating without needing agentv compare --baseline.
Design Latitude
- Add
--threshold <number>flag (0-1 scale) toagentv eval - Also support
execution.thresholdin EVAL.yaml for per-suite defaults - CLI flag overrides YAML value
- After all tests complete, compute mean score; if below threshold, exit code 1 with summary
- Integrate with existing JUnit XML output (test-level pass/fail based on threshold)
Acceptance Signals
agentv eval evals/ --threshold 0.6exits 1 if mean score < 0.6execution.threshold: 0.6in YAML has the same effect- CLI
--thresholdoverrides YAMLexecution.threshold - Summary line printed: "Suite score: 0.53 (threshold: 0.6) — FAIL"
- Exit code 0 when score meets threshold
Non-Goals
- Not a replacement for
agentv compareregression gating (different use case) - Not per-test threshold override (use
requiredfor that) - Not severity levels (feat(eval): composable quality gates with auto-remediation triggers #334 covers that separately)
Context
Identified via microsoft/skills harness research. The skills harness uses --threshold 60 (0-100 scale) for CI quality gates. AgentV currently has per-test required gates and agentv compare --baseline regression gating, but no suite-level threshold.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels