feat(evaluation): migrate evaluation harnesses from playground#279
Draft
miguelgfierro wants to merge 56 commits into
Draft
feat(evaluation): migrate evaluation harnesses from playground#279miguelgfierro wants to merge 56 commits into
miguelgfierro wants to merge 56 commits into
Conversation
…try point (#268) * feat(evaluation): add evaluation subpackage __init__ with gate/champion/judge/retrieval exports * feat(evaluation): add EvalConfig and GateVerdict models * feat(evaluation): add evaluation optional-deps and flyeval CLI entry point to pyproject.toml * feat(evaluation): note evaluation as optional subpackage in top-level __init__ docstring --------- Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
* feat(evaluation): add matcher primitives (anchored, matches, source_stem, tokens) * feat(evaluation): add statistics helpers (aa_band, aggregate_grounding, left_skew_flag) * feat(evaluation): export matcher and stats primitives from evaluation package --------- Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
* feat(evaluation): add corpus loader and evidence verification module * feat(evaluation): add lean-1 registry loader and RegistryItem/Registry models * feat(evaluation): re-export corpus and registry symbols from evaluation package --------- Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
* feat(evaluation): add G1-G5 gate framework (GateResult, run_gates, g2_recall_precision) * feat(evaluation): export g2_recall_precision from evaluation package --------- Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
* feat(evaluation): add scorecard renderer * feat(evaluation): export render_scorecard, verdict, VERDICT_PROMOTE/HOLD from scorecard module --------- Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
* feat(evaluation): add JudgeClient and OllamaEmbedder (judge_client.py) * feat(evaluation): add AdvisoryReport and run_judge with [D]/[E]/[J] metric families (judge.py) * feat(evaluation): import cosine from judge_client in matcher.py * feat(evaluation): export JudgeClient, OllamaEmbedder, build_embedder, cosine from evaluation package --------- Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
* feat(evaluation): add ChampionRecord and champion management functions * feat(evaluation): add run_config_snapshot for flyradar run configuration capture * feat(evaluation): add flyeval CLI with gate, aa-band, day-zero, invalidate subcommands --------- Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
) * feat(lab): add retrieval_metrics module with compute_retrieval_metrics and RetrieverMetrics * feat(lab): export RetrieverMetrics and compute_retrieval_metrics from lab package * feat(evaluation): import RetrieverMetrics and compute_retrieval_metrics from lab.retrieval_metrics --------- Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
* feat(evaluation): add flyradar gate evaluation example * feat(evaluation): add flycanon RAG retrieval evaluation example --------- Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
… metrics (#277) * feat(evaluation): add tests/unit/evaluation package init * feat(evaluation): add unit tests for matcher (anchored, source_stem, tokens, matches) * feat(evaluation): add unit tests for stats (aa_band, aggregate_grounding, left_skew_flag) * feat(evaluation): add unit tests for gates (GateResult, verdict, render_scorecard, g5_no_regression) * feat(evaluation): add unit tests for champion (ChampionRecord, load/save/invalidate, input_hash) * feat(evaluation): add unit tests for retrieval_metrics (compute_retrieval_metrics, RetrieverMetrics) * feat(evaluation): fix boundary test for left_skew_flag (floating-point precision) * feat(evaluation): fix no_answer_rate test to match implementation behaviour --------- Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
* feat(evaluation): add evaluation package documentation * docs(evaluation): mention evaluation subpackage in README --------- Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
fix(evaluation): resolve PR gate failures (lint, CI extras, remove flyradar example)
…, remove flyeval entrypoint
feat(evaluation): simplify to 3-file metrics-only package
1 task
…in calls in judge.py
fix: resolve CI gate failures — lint, typecheck, ruff
…-metrics refactor(evaluation): move retrieval_metrics from lab/ to evaluation/
…etrics with individual functions
…ics-as-functions refactor(evaluation): replace RetrieverMetrics class with plain functions
… RetrieverMetrics
…etrieval-metrics Remove compute_retrieval_metrics() aggregate from evaluation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The first migration of evaluation code into
fireflyframework_agentic/evaluation/brought over too much infrastructure alongside the metrics: a CLI (flyeval), a five-gate framework (G1–G5), a registry/corpus/matcher pipeline, a scorecard renderer, champion persistence, run-config snapshotting, and statistical helpers. The actual value — the LLM judge metric functions — was buried under 10 supporting files.The goal of this PR (revised) is to gut that infrastructure and keep only the measurement code.
What We Keep
All metric functions from both evaluation systems:
Flyradar G4:
[D]deterministic:source_coverage,excerpt_fill_rate[E]embedding-based:semantic_recovery[J]LLM judge:faithfulness,numeric_temporal_fidelity,citation_relevance,nc_semantic_precision,fabricated_entity,contradiction,open_gap,actionability,severity_calibration,answer_relevancy,surface_deduplication,comparative_vs_champion(champion passed as optional parameter — no persistence)Flycanon:
contains_answer,addresses_question(median of N LLM calls per item)answer_correctness,answer_relevancy,faithfulness,context_recall,context_precisionRetrieval (
lab/retrieval_metrics.py): unchanged.What We Delete
From
fireflyframework_agentic/evaluation/:cli.pyflyevalCLI — experiment orchestration, not measurementgates.pycorpus.pyregistry.pymatcher.pyscorecard.pyrun_config_snapshot.pymodels.pyEvalConfig,GateVerdict— only used by deleted filesstats.pyaa_band,aggregate_grounding— only used by deleted fileschampion.pycomparative_vs_championaccepts champion data as a parameter insteadTests for deleted modules also removed:
test_champion.py,test_gates.py,test_matcher.py,test_stats.py.Target Package Layout
Three files. No CLI. No gates. No registry.
Unified Interface
Every metric — flyradar [D], [E], [J], flycanon custom, and RAGAS — shares the same async signature:
itemis a plain dict with a normalized schema:{ "question": str, "answer": str, "reference": str, "contexts": list[str], # flyradar extras (optional): "sources": list[str], "excerpts": list[str], # for comparative_vs_champion (optional): "champion_answer": str | None, }EvalContextis a Pydantic model carrying all dependencies:No
ragas_llm/ragas_embeddings— RAGAS metrics wrapctx.clientandctx.embedderin LangChain adapters internally. Callers see one client, one embedder.Composable type alias:
Example:
judge_client.pyContains only
JudgeClient— a thin async HTTP client for Anthropic/OpenAI/Ollama scoring calls:Uses
httpx.AsyncClient. Handles 429/5xx retry withRetry-Afterparsing. No embedding logic — embeddings come fromembeddings/providers/ollama.py(existing asyncOllamaEmbedder).cosine_similarityimported fromembeddings/similarity.py.Dependencies
evaluationoptional extra changes:scipy(only used by deletedstats.py)ragas,langchain-anthropic,langchain-ollamanumpyNo changes to
embeddings/— existing asyncOllamaEmbedderused as-is.Test plan
pytest tests/unit/evaluation/test_judge.py tests/unit/lab/test_retrieval_metrics.py— all passingEvalContextasyncio.gather(*[m(item, ctx) for m in metrics])composes correctly across families