Skip to content

feat(evaluation): migrate evaluation harnesses from playground#279

Draft
miguelgfierro wants to merge 56 commits into
mainfrom
feat/evaluation-framework
Draft

feat(evaluation): migrate evaluation harnesses from playground#279
miguelgfierro wants to merge 56 commits into
mainfrom
feat/evaluation-framework

Conversation

@miguelgfierro

@miguelgfierro miguelgfierro commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Problem

The first migration of evaluation code into fireflyframework_agentic/evaluation/ brought over too much infrastructure alongside the metrics: a CLI (flyeval), a five-gate framework (G1–G5), a registry/corpus/matcher pipeline, a scorecard renderer, champion persistence, run-config snapshotting, and statistical helpers. The actual value — the LLM judge metric functions — was buried under 10 supporting files.

The goal of this PR (revised) is to gut that infrastructure and keep only the measurement code.

What We Keep

All metric functions from both evaluation systems:

Flyradar G4:

  • [D] deterministic: source_coverage, excerpt_fill_rate
  • [E] embedding-based: semantic_recovery
  • [J] LLM judge: faithfulness, numeric_temporal_fidelity, citation_relevance, nc_semantic_precision, fabricated_entity, contradiction, open_gap, actionability, severity_calibration, answer_relevancy, surface_deduplication, comparative_vs_champion (champion passed as optional parameter — no persistence)

Flycanon:

  • Custom: contains_answer, addresses_question (median of N LLM calls per item)
  • RAGAS: answer_correctness, answer_relevancy, faithfulness, context_recall, context_precision

Retrieval (lab/retrieval_metrics.py): unchanged.

What We Delete

From fireflyframework_agentic/evaluation/:

File Reason
cli.py flyeval CLI — experiment orchestration, not measurement
gates.py G1–G5 gate framework — pipeline infrastructure
corpus.py Corpus loader — pipeline infrastructure
registry.py Registry management — pipeline infrastructure
matcher.py Anchored matching utilities — pipeline infrastructure
scorecard.py Scorecard renderer — reporting, not measurement
run_config_snapshot.py Run config capture — pipeline infrastructure
models.py EvalConfig, GateVerdict — only used by deleted files
stats.py aa_band, aggregate_grounding — only used by deleted files
champion.py Champion persistence — comparative_vs_champion accepts champion data as a parameter instead

Tests for deleted modules also removed: test_champion.py, test_gates.py, test_matcher.py, test_stats.py.

Target Package Layout

fireflyframework_agentic/evaluation/
├── __init__.py       # exports: EvalContext, AdvisoryReport, all metric functions
├── judge_client.py   # JudgeClient — async LLM scoring client (httpx.AsyncClient)
└── judge.py          # ALL metric functions + EvalContext + AdvisoryReport

Three files. No CLI. No gates. No registry.

Unified Interface

Every metric — flyradar [D], [E], [J], flycanon custom, and RAGAS — shares the same async signature:

async def metric_name(item: dict, ctx: EvalContext) -> float | None

item is a plain dict with a normalized schema:

{
    "question": str,
    "answer": str,
    "reference": str,
    "contexts": list[str],
    # flyradar extras (optional):
    "sources": list[str],
    "excerpts": list[str],
    # for comparative_vs_champion (optional):
    "champion_answer": str | None,
}

EvalContext is a Pydantic model carrying all dependencies:

class EvalContext(BaseModel):
    model_config = ConfigDict(arbitrary_types_allowed=True)

    client: JudgeClient                     # async LLM call client (all metrics)
    embedder: OllamaEmbedder | None = None  # [E] metrics + RAGAS embeddings
    runs: int = 3                           # flycanon multi-run median

No ragas_llm / ragas_embeddings — RAGAS metrics wrap ctx.client and ctx.embedder in LangChain adapters internally. Callers see one client, one embedder.

Composable type alias:

Metric = Callable[[dict, EvalContext], Awaitable[float | None]]

Example:

ctx = EvalContext(client=JudgeClient(model="claude-sonnet-4-6", api_key=KEY))
metrics: list[Metric] = [faithfulness, contains_answer, answer_correctness]
scores = await asyncio.gather(*[m(item, ctx) for m in metrics])

judge_client.py

Contains only JudgeClient — a thin async HTTP client for Anthropic/OpenAI/Ollama scoring calls:

class JudgeClient:
    async def chat_json(self, system: str, user: str, max_tokens: int = 200) -> dict: ...

Uses httpx.AsyncClient. Handles 429/5xx retry with Retry-After parsing. No embedding logic — embeddings come from embeddings/providers/ollama.py (existing async OllamaEmbedder). cosine_similarity imported from embeddings/similarity.py.

Dependencies

evaluation optional extra changes:

  • Remove: scipy (only used by deleted stats.py)
  • Add: ragas, langchain-anthropic, langchain-ollama
  • Keep: numpy

No changes to embeddings/ — existing async OllamaEmbedder used as-is.

Test plan

  • pytest tests/unit/evaluation/test_judge.py tests/unit/lab/test_retrieval_metrics.py — all passing
  • Each metric callable independently with a mocked EvalContext
  • asyncio.gather(*[m(item, ctx) for m in metrics]) composes correctly across families

miguelgfierro and others added 11 commits June 18, 2026 23:33
…try point (#268)

* feat(evaluation): add evaluation subpackage __init__ with gate/champion/judge/retrieval exports

* feat(evaluation): add EvalConfig and GateVerdict models

* feat(evaluation): add evaluation optional-deps and flyeval CLI entry point to pyproject.toml

* feat(evaluation): note evaluation as optional subpackage in top-level __init__ docstring

---------

Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
* feat(evaluation): add matcher primitives (anchored, matches, source_stem, tokens)

* feat(evaluation): add statistics helpers (aa_band, aggregate_grounding, left_skew_flag)

* feat(evaluation): export matcher and stats primitives from evaluation package

---------

Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
* feat(evaluation): add corpus loader and evidence verification module

* feat(evaluation): add lean-1 registry loader and RegistryItem/Registry models

* feat(evaluation): re-export corpus and registry symbols from evaluation package

---------

Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
* feat(evaluation): add G1-G5 gate framework (GateResult, run_gates, g2_recall_precision)

* feat(evaluation): export g2_recall_precision from evaluation package

---------

Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
* feat(evaluation): add scorecard renderer

* feat(evaluation): export render_scorecard, verdict, VERDICT_PROMOTE/HOLD from scorecard module

---------

Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
* feat(evaluation): add JudgeClient and OllamaEmbedder (judge_client.py)

* feat(evaluation): add AdvisoryReport and run_judge with [D]/[E]/[J] metric families (judge.py)

* feat(evaluation): import cosine from judge_client in matcher.py

* feat(evaluation): export JudgeClient, OllamaEmbedder, build_embedder, cosine from evaluation package

---------

Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
* feat(evaluation): add ChampionRecord and champion management functions

* feat(evaluation): add run_config_snapshot for flyradar run configuration capture

* feat(evaluation): add flyeval CLI with gate, aa-band, day-zero, invalidate subcommands

---------

Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
)

* feat(lab): add retrieval_metrics module with compute_retrieval_metrics and RetrieverMetrics

* feat(lab): export RetrieverMetrics and compute_retrieval_metrics from lab package

* feat(evaluation): import RetrieverMetrics and compute_retrieval_metrics from lab.retrieval_metrics

---------

Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
* feat(evaluation): add flyradar gate evaluation example

* feat(evaluation): add flycanon RAG retrieval evaluation example

---------

Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
… metrics (#277)

* feat(evaluation): add tests/unit/evaluation package init

* feat(evaluation): add unit tests for matcher (anchored, source_stem, tokens, matches)

* feat(evaluation): add unit tests for stats (aa_band, aggregate_grounding, left_skew_flag)

* feat(evaluation): add unit tests for gates (GateResult, verdict, render_scorecard, g5_no_regression)

* feat(evaluation): add unit tests for champion (ChampionRecord, load/save/invalidate, input_hash)

* feat(evaluation): add unit tests for retrieval_metrics (compute_retrieval_metrics, RetrieverMetrics)

* feat(evaluation): fix boundary test for left_skew_flag (floating-point precision)

* feat(evaluation): fix no_answer_rate test to match implementation behaviour

---------

Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
* feat(evaluation): add evaluation package documentation

* docs(evaluation): mention evaluation subpackage in README

---------

Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
Comment thread fireflyframework_agentic/evaluation/corpus.py Fixed
Comment thread fireflyframework_agentic/evaluation/scorecard.py Fixed
Comment thread examples/flycanon_eval_example.py Fixed
Comment thread tests/unit/evaluation/test_champion.py Fixed
Comment thread tests/unit/evaluation/test_matcher.py Fixed
Comment thread tests/unit/lab/test_retrieval_metrics.py Fixed
Comment thread tests/unit/lab/test_retrieval_metrics.py Fixed
fix(evaluation): resolve PR gate failures (lint, CI extras, remove flyradar example)
Comment thread fireflyframework_agentic/evaluation/corpus.py Fixed
miguelgfierro and others added 20 commits June 19, 2026 13:52
fix: resolve CI gate failures — lint, typecheck, ruff
…-metrics

refactor(evaluation): move retrieval_metrics from lab/ to evaluation/
…ics-as-functions

refactor(evaluation): replace RetrieverMetrics class with plain functions
…etrieval-metrics

Remove compute_retrieval_metrics() aggregate from evaluation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant