Skip to content

feat(cli): --threshold flag for suite-level quality gates #698

@christso

Description

@christso

Objective

Add a --threshold CLI flag to agentv eval that fails (exit 1) if the mean score across all tests falls below the specified threshold. This enables CI/CD quality gating without needing agentv compare --baseline.

Design Latitude

  • Add --threshold <number> flag (0-1 scale) to agentv eval
  • Also support execution.threshold in EVAL.yaml for per-suite defaults
  • CLI flag overrides YAML value
  • After all tests complete, compute mean score; if below threshold, exit code 1 with summary
  • Integrate with existing JUnit XML output (test-level pass/fail based on threshold)

Acceptance Signals

  • agentv eval evals/ --threshold 0.6 exits 1 if mean score < 0.6
  • execution.threshold: 0.6 in YAML has the same effect
  • CLI --threshold overrides YAML execution.threshold
  • Summary line printed: "Suite score: 0.53 (threshold: 0.6) — FAIL"
  • Exit code 0 when score meets threshold

Non-Goals

Context

Identified via microsoft/skills harness research. The skills harness uses --threshold 60 (0-100 scale) for CI quality gates. AgentV currently has per-test required gates and agentv compare --baseline regression gating, but no suite-level threshold.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions