feat(evaluation): migrate evaluation harnesses from playground by miguelgfierro · Pull Request #279 · fireflyframework/fireflyframework-agentic

miguelgfierro · 2026-06-19T07:14:34Z

Problem

The first migration of evaluation code into fireflyframework_agentic/evaluation/ brought over too much infrastructure alongside the metrics: a CLI (flyeval), a five-gate framework (G1–G5), a registry/corpus/matcher pipeline, a scorecard renderer, champion persistence, run-config snapshotting, and statistical helpers. The actual value — the LLM judge metric functions — was buried under 10 supporting files.

The goal of this PR (revised) is to gut that infrastructure and keep only the measurement code.

What We Keep

All metric functions from both evaluation systems:

Flyradar G4:

[D] deterministic: source_coverage, excerpt_fill_rate
[E] embedding-based: semantic_recovery
[J] LLM judge: faithfulness, numeric_temporal_fidelity, citation_relevance, nc_semantic_precision, fabricated_entity, contradiction, open_gap, actionability, severity_calibration, answer_relevancy, surface_deduplication, comparative_vs_champion (champion passed as optional parameter — no persistence)

Flycanon:

Custom: contains_answer, addresses_question (median of N LLM calls per item)
RAGAS: answer_correctness, answer_relevancy, faithfulness, context_recall, context_precision

Retrieval (lab/retrieval_metrics.py): unchanged.

What We Delete

From fireflyframework_agentic/evaluation/:

File	Reason
`cli.py`	`flyeval` CLI — experiment orchestration, not measurement
`gates.py`	G1–G5 gate framework — pipeline infrastructure
`corpus.py`	Corpus loader — pipeline infrastructure
`registry.py`	Registry management — pipeline infrastructure
`matcher.py`	Anchored matching utilities — pipeline infrastructure
`scorecard.py`	Scorecard renderer — reporting, not measurement
`run_config_snapshot.py`	Run config capture — pipeline infrastructure
`models.py`	`EvalConfig`, `GateVerdict` — only used by deleted files
`stats.py`	`aa_band`, `aggregate_grounding` — only used by deleted files
`champion.py`	Champion persistence — `comparative_vs_champion` accepts champion data as a parameter instead

Tests for deleted modules also removed: test_champion.py, test_gates.py, test_matcher.py, test_stats.py.

Target Package Layout

fireflyframework_agentic/evaluation/
├── __init__.py       # exports: EvalContext, AdvisoryReport, all metric functions
├── judge_client.py   # JudgeClient — async LLM scoring client (httpx.AsyncClient)
└── judge.py          # ALL metric functions + EvalContext + AdvisoryReport

Three files. No CLI. No gates. No registry.

Unified Interface

Every metric — flyradar [D], [E], [J], flycanon custom, and RAGAS — shares the same async signature:

async def metric_name(item: dict, ctx: EvalContext) -> float | None

item is a plain dict with a normalized schema:

{
    "question": str,
    "answer": str,
    "reference": str,
    "contexts": list[str],
    # flyradar extras (optional):
    "sources": list[str],
    "excerpts": list[str],
    # for comparative_vs_champion (optional):
    "champion_answer": str | None,
}

EvalContext is a Pydantic model carrying all dependencies:

class EvalContext(BaseModel):
    model_config = ConfigDict(arbitrary_types_allowed=True)

    client: JudgeClient                     # async LLM call client (all metrics)
    embedder: OllamaEmbedder | None = None  # [E] metrics + RAGAS embeddings
    runs: int = 3                           # flycanon multi-run median

No ragas_llm / ragas_embeddings — RAGAS metrics wrap ctx.client and ctx.embedder in LangChain adapters internally. Callers see one client, one embedder.

Composable type alias:

Metric = Callable[[dict, EvalContext], Awaitable[float | None]]

Example:

ctx = EvalContext(client=JudgeClient(model="claude-sonnet-4-6", api_key=KEY))
metrics: list[Metric] = [faithfulness, contains_answer, answer_correctness]
scores = await asyncio.gather(*[m(item, ctx) for m in metrics])

`judge_client.py`

Contains only JudgeClient — a thin async HTTP client for Anthropic/OpenAI/Ollama scoring calls:

class JudgeClient:
    async def chat_json(self, system: str, user: str, max_tokens: int = 200) -> dict: ...

Uses httpx.AsyncClient. Handles 429/5xx retry with Retry-After parsing. No embedding logic — embeddings come from embeddings/providers/ollama.py (existing async OllamaEmbedder). cosine_similarity imported from embeddings/similarity.py.

Dependencies

evaluation optional extra changes:

Remove: scipy (only used by deleted stats.py)
Add: ragas, langchain-anthropic, langchain-ollama
Keep: numpy

No changes to embeddings/ — existing async OllamaEmbedder used as-is.

Test plan

pytest tests/unit/evaluation/test_judge.py tests/unit/lab/test_retrieval_metrics.py — all passing
Each metric callable independently with a mocked EvalContext
asyncio.gather(*[m(item, ctx) for m in metrics]) composes correctly across families

…try point (#268) * feat(evaluation): add evaluation subpackage __init__ with gate/champion/judge/retrieval exports * feat(evaluation): add EvalConfig and GateVerdict models * feat(evaluation): add evaluation optional-deps and flyeval CLI entry point to pyproject.toml * feat(evaluation): note evaluation as optional subpackage in top-level __init__ docstring --------- Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

* feat(evaluation): add matcher primitives (anchored, matches, source_stem, tokens) * feat(evaluation): add statistics helpers (aa_band, aggregate_grounding, left_skew_flag) * feat(evaluation): export matcher and stats primitives from evaluation package --------- Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

* feat(evaluation): add corpus loader and evidence verification module * feat(evaluation): add lean-1 registry loader and RegistryItem/Registry models * feat(evaluation): re-export corpus and registry symbols from evaluation package --------- Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

* feat(evaluation): add G1-G5 gate framework (GateResult, run_gates, g2_recall_precision) * feat(evaluation): export g2_recall_precision from evaluation package --------- Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

* feat(evaluation): add scorecard renderer * feat(evaluation): export render_scorecard, verdict, VERDICT_PROMOTE/HOLD from scorecard module --------- Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

* feat(evaluation): add JudgeClient and OllamaEmbedder (judge_client.py) * feat(evaluation): add AdvisoryReport and run_judge with [D]/[E]/[J] metric families (judge.py) * feat(evaluation): import cosine from judge_client in matcher.py * feat(evaluation): export JudgeClient, OllamaEmbedder, build_embedder, cosine from evaluation package --------- Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

* feat(evaluation): add ChampionRecord and champion management functions * feat(evaluation): add run_config_snapshot for flyradar run configuration capture * feat(evaluation): add flyeval CLI with gate, aa-band, day-zero, invalidate subcommands --------- Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

) * feat(lab): add retrieval_metrics module with compute_retrieval_metrics and RetrieverMetrics * feat(lab): export RetrieverMetrics and compute_retrieval_metrics from lab package * feat(evaluation): import RetrieverMetrics and compute_retrieval_metrics from lab.retrieval_metrics --------- Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

* feat(evaluation): add flyradar gate evaluation example * feat(evaluation): add flycanon RAG retrieval evaluation example --------- Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

… metrics (#277) * feat(evaluation): add tests/unit/evaluation package init * feat(evaluation): add unit tests for matcher (anchored, source_stem, tokens, matches) * feat(evaluation): add unit tests for stats (aa_band, aggregate_grounding, left_skew_flag) * feat(evaluation): add unit tests for gates (GateResult, verdict, render_scorecard, g5_no_regression) * feat(evaluation): add unit tests for champion (ChampionRecord, load/save/invalidate, input_hash) * feat(evaluation): add unit tests for retrieval_metrics (compute_retrieval_metrics, RetrieverMetrics) * feat(evaluation): fix boundary test for left_skew_flag (floating-point precision) * feat(evaluation): fix no_answer_rate test to match implementation behaviour --------- Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

* feat(evaluation): add evaluation package documentation * docs(evaluation): mention evaluation subpackage in README --------- Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

…905, N806, UP035)

fix(evaluation): resolve PR gate failures (lint, CI extras, remove flyradar example)

…ycanon + RAGAS

…, remove flyeval entrypoint

feat(evaluation): simplify to 3-file metrics-only package

…val_metrics

…l_metrics

…in calls in judge.py

fix: resolve CI gate failures — lint, typecheck, ruff

…ation/

…aluation/

…-metrics refactor(evaluation): move retrieval_metrics from lab/ to evaluation/

…ions

…etrics with individual functions

…c functions

…ics-as-functions refactor(evaluation): replace RetrieverMetrics class with plain functions

…rics

… RetrieverMetrics

…etrieval-metrics Remove compute_retrieval_metrics() aggregate from evaluation

miguelgfierro and others added 11 commits June 18, 2026 23:33

feat(evaluation): add scorecard renderer (#272)

d964ba1

* feat(evaluation): add scorecard renderer * feat(evaluation): export render_scorecard, verdict, VERDICT_PROMOTE/HOLD from scorecard module --------- Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

feat(examples): add flyradar and flycanon evaluation examples (#276)

0acac37

* feat(evaluation): add flyradar gate evaluation example * feat(evaluation): add flycanon RAG retrieval evaluation example --------- Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

docs(evaluation): add evaluation package documentation (#278)

f79439b

* feat(evaluation): add evaluation package documentation * docs(evaluation): mention evaluation subpackage in README --------- Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

github-code-quality Bot found potential problems Jun 19, 2026

View reviewed changes

miguelgfierro added 3 commits June 19, 2026 09:24

remove examples/flyradar_eval_example.py

a1d28a5

ci: add --extra evaluation to typecheck and test sync steps

6161718

fix(evaluation): resolve all ruff lint errors (import sort, SIM108, B…

203134c

…905, N806, UP035)

miguelgfierro mentioned this pull request Jun 19, 2026

fix(evaluation): resolve PR gate failures (lint, CI extras, remove flyradar example) #280

Merged

Merge pull request #280 from fireflyframework/fix/eval-ci-gate

ceaba78

fix(evaluation): resolve PR gate failures (lint, CI extras, remove flyradar example)

github-code-quality Bot found potential problems Jun 19, 2026

View reviewed changes

Comment thread fireflyframework_agentic/evaluation/corpus.py Fixed

miguelgfierro added 12 commits June 19, 2026 13:18

chore(evaluation): delete cli.py

9c3555d

chore(evaluation): delete gates.py

e9fd965

chore(evaluation): delete corpus.py

38c3f60

chore(evaluation): delete registry.py

f819923

chore(evaluation): delete matcher.py

3bc0786

chore(evaluation): delete scorecard.py

9c43a32

chore(evaluation): delete run_config_snapshot.py

a3673b5

chore(evaluation): delete models.py

a51115e

chore(evaluation): delete stats.py

5074d14

chore(evaluation): delete champion.py

8716be9

chore(evaluation): delete test_champion.py

5c8fe8e

chore(evaluation): delete test_gates.py

fdc0277

miguelgfierro and others added 9 commits June 19, 2026 13:19

chore(evaluation): delete test_matcher.py

0732f85

chore(evaluation): delete test_stats.py

f769ef1

feat(evaluation): rewrite judge_client.py as async (httpx.AsyncClient)

2516052

feat(evaluation): rewrite judge.py — async metrics + EvalContext + fl…

5609ab6

…ycanon + RAGAS

feat(evaluation): slim __init__.py to 3-file exports

7799185

chore(evaluation): update pyproject.toml — drop scipy, add ragas deps…

9526f43

…, remove flyeval entrypoint

test(evaluation): add unit tests for judge.py metrics

d567552

chore: merge feat/evaluation-framework, keep simplification

0dd9bac

Merge pull request #282 from fireflyframework/feat/eval-simplification

561f9b5

feat(evaluation): simplify to 3-file metrics-only package

miguelgfierro mentioned this pull request Jun 19, 2026

docs: evaluation simplification design spec #281

Closed

1 task

miguelgfierro and others added 20 commits June 19, 2026 13:52

fix(lab): type-annotate out dict, remove quoted return type in retrie…

5646974

…val_metrics

fix(lab): remove unused import math, fix import sort in test_retrieva…

582d1c0

…l_metrics

fix(evaluation): add type: ignore for pyright errors on RAGAS/langcha…

3e62b1f

…in calls in judge.py

Merge pull request #283 from fireflyframework/chore/eval-ci-fixes

a7e44d1

fix: resolve CI gate failures — lint, typecheck, ruff

Merge remote-tracking branch 'origin/main' into chore/sync-dev-with-main

6dd8575

refactor(evaluation): move retrieval_metrics.py from lab/ to evaluation/

3679dbc

refactor(evaluation): update imports — retrieval_metrics now in evalu…

6bce374

…ation/

refactor(evaluation): move test_retrieval_metrics.py to tests/unit/ev…

9229c43

…aluation/

Merge pull request #284 from fireflyframework/refactor/move-retrieval…

4d9353d

…-metrics refactor(evaluation): move retrieval_metrics from lab/ to evaluation/

refactor(evaluation): replace RetrieverMetrics class with plain funct…

6cdd3db

…ions

refactor(evaluation): update __init__.py exports — replace RetrieverM…

3a3c35f

…etrics with individual functions

test(evaluation): rewrite test_retrieval_metrics for individual metri…

26bfe3b

…c functions

Merge pull request #285 from fireflyframework/refactor/retrieval-metr…

b029d36

…ics-as-functions refactor(evaluation): replace RetrieverMetrics class with plain functions

Remove compute_retrieval_metrics() and KS constant from retrieval_met…

feadcbd

…rics

Remove compute_retrieval_metrics export from evaluation __init__

d54814f

Remove test_compute_retrieval_metrics_* tests

0853698

Update flycanon_eval_example to use plain metric functions instead of…

a7b1b91

… RetrieverMetrics

Apply ruff format to retrieval_metrics.py

0c911b3

Apply ruff format to test_retrieval_metrics.py

ef16882

Merge pull request #286 from fireflyframework/refactor/drop-compute-r…

5a9926b

…etrieval-metrics Remove compute_retrieval_metrics() aggregate from evaluation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(evaluation): migrate evaluation harnesses from playground#279

feat(evaluation): migrate evaluation harnesses from playground#279
miguelgfierro wants to merge 56 commits into
mainfrom
feat/evaluation-framework

miguelgfierro commented Jun 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

miguelgfierro commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

What We Keep

What We Delete

Target Package Layout

Unified Interface

judge_client.py

Dependencies

Test plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

miguelgfierro commented Jun 19, 2026 •

edited

Loading

`judge_client.py`