Skip to content

feat(eval): Ralph Loop — iterative improvement with feedback injection #699

@christso

Description

@christso

Objective

Add an iterative generate→evaluate→feedback→regenerate loop (Ralph Loop) to AgentV. When enabled, a failing test case gets re-prompted with structured feedback about what went wrong, up to N iterations, until the quality threshold is met.

Design Latitude

YAML configuration (suggested shape, flexible):

execution:
  ralph:
    max_iterations: 3
    threshold: 0.8
    improvement_threshold: 0.05
    feedback_template: ./feedback.md  # optional custom template

CLI flag: agentv eval evals/ --ralph --max-iterations 3

Key components

  1. Orchestrator loop — wraps existing evaluate-single-case flow with retry logic
  2. Feedback builder — converts assertion failures into structured LLM-actionable feedback injected into the next prompt
  3. Stop conditions (from microsoft/skills harness):
    • quality_threshold_met — score >= threshold
    • perfect_score — score >= 1.0
    • max_iterations_reached — exhausted budget
    • no_improvement — improvement < improvement_threshold
    • score_regression — score went down
  4. Per-iteration result tracking — score trajectory, which iteration passed, stop reason
  5. Feedback template system — default template that formats failures by severity with suggestions; user can override with custom markdown template

Result schema extension

{
  "ralph": {
    "iterations": 3,
    "scores": [0.4, 0.7, 0.9],
    "stop_reason": "quality_threshold_met",
    "improvement": 0.5
  }
}

Acceptance Signals

  • agentv eval evals/ --ralph re-prompts failing tests with feedback
  • Feedback includes assertion failures grouped by severity
  • Stops when threshold met, max iterations reached, or no improvement
  • Results include per-iteration scores and stop reason
  • Works with all target types (CLI agents, API providers)

Non-Goals

  • Not multi-agent orchestration (single agent, iterative refinement)
  • Not automatic prompt rewriting (feedback is appended, original prompt preserved)
  • Not a replacement for trials (trials = same prompt N times; Ralph = feedback-augmented retries)

Context

Core pattern from the microsoft/skills eval harness. Named after the "Sensei" technique by Shayne Boyer. The skills harness uses this across 1114+ scenarios and reports significant quality improvements (often 40-60 point score jumps in 2-3 iterations).

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions