Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
a2d6770
feat(evaluation): add evaluation subpackage skeleton and pyproject en…
miguelgfierro Jun 18, 2026
8676b6a
feat(evaluation): add matcher primitives and statistics helpers (#269)
miguelgfierro Jun 18, 2026
8eb2110
feat(evaluation): add corpus loader and registry modules (#270)
miguelgfierro Jun 18, 2026
ee64cfa
feat(evaluation): add G1-G5 gate framework (#271)
miguelgfierro Jun 18, 2026
d964ba1
feat(evaluation): add scorecard renderer (#272)
miguelgfierro Jun 18, 2026
09cfc34
feat(evaluation): add LLM-as-judge and judge client (#273)
miguelgfierro Jun 18, 2026
1906ede
feat(evaluation): add champion tracking and flyeval CLI (#274)
miguelgfierro Jun 18, 2026
4ab1d85
feat(lab): add retrieval metrics (hit@k, recall@k, MRR, MAP, nDCG) (#…
miguelgfierro Jun 18, 2026
0acac37
feat(examples): add flyradar and flycanon evaluation examples (#276)
miguelgfierro Jun 18, 2026
cc048cf
test(evaluation): add unit tests for evaluation package and retrieval…
miguelgfierro Jun 18, 2026
f79439b
docs(evaluation): add evaluation package documentation (#278)
miguelgfierro Jun 18, 2026
a1d28a5
remove examples/flyradar_eval_example.py
miguelgfierro Jun 19, 2026
6161718
ci: add --extra evaluation to typecheck and test sync steps
miguelgfierro Jun 19, 2026
203134c
fix(evaluation): resolve all ruff lint errors (import sort, SIM108, B…
miguelgfierro Jun 19, 2026
ceaba78
Merge pull request #280 from fireflyframework/fix/eval-ci-gate
miguelgfierro Jun 19, 2026
9c3555d
chore(evaluation): delete cli.py
miguelgfierro Jun 19, 2026
e9fd965
chore(evaluation): delete gates.py
miguelgfierro Jun 19, 2026
38c3f60
chore(evaluation): delete corpus.py
miguelgfierro Jun 19, 2026
f819923
chore(evaluation): delete registry.py
miguelgfierro Jun 19, 2026
3bc0786
chore(evaluation): delete matcher.py
miguelgfierro Jun 19, 2026
9c43a32
chore(evaluation): delete scorecard.py
miguelgfierro Jun 19, 2026
a3673b5
chore(evaluation): delete run_config_snapshot.py
miguelgfierro Jun 19, 2026
a51115e
chore(evaluation): delete models.py
miguelgfierro Jun 19, 2026
5074d14
chore(evaluation): delete stats.py
miguelgfierro Jun 19, 2026
8716be9
chore(evaluation): delete champion.py
miguelgfierro Jun 19, 2026
5c8fe8e
chore(evaluation): delete test_champion.py
miguelgfierro Jun 19, 2026
fdc0277
chore(evaluation): delete test_gates.py
miguelgfierro Jun 19, 2026
0732f85
chore(evaluation): delete test_matcher.py
miguelgfierro Jun 19, 2026
f769ef1
chore(evaluation): delete test_stats.py
miguelgfierro Jun 19, 2026
2516052
feat(evaluation): rewrite judge_client.py as async (httpx.AsyncClient)
miguelgfierro Jun 19, 2026
5609ab6
feat(evaluation): rewrite judge.py — async metrics + EvalContext + fl…
miguelgfierro Jun 19, 2026
7799185
feat(evaluation): slim __init__.py to 3-file exports
miguelgfierro Jun 19, 2026
9526f43
chore(evaluation): update pyproject.toml — drop scipy, add ragas deps…
miguelgfierro Jun 19, 2026
d567552
test(evaluation): add unit tests for judge.py metrics
miguelgfierro Jun 19, 2026
0dd9bac
chore: merge feat/evaluation-framework, keep simplification
miguelgfierro Jun 19, 2026
561f9b5
Merge pull request #282 from fireflyframework/feat/eval-simplification
miguelgfierro Jun 19, 2026
5646974
fix(lab): type-annotate out dict, remove quoted return type in retrie…
miguelgfierro Jun 19, 2026
582d1c0
fix(lab): remove unused import math, fix import sort in test_retrieva…
miguelgfierro Jun 19, 2026
3e62b1f
fix(evaluation): add type: ignore for pyright errors on RAGAS/langcha…
miguelgfierro Jun 19, 2026
a7e44d1
Merge pull request #283 from fireflyframework/chore/eval-ci-fixes
miguelgfierro Jun 19, 2026
6dd8575
Merge remote-tracking branch 'origin/main' into chore/sync-dev-with-main
miguelgfierro Jun 19, 2026
3679dbc
refactor(evaluation): move retrieval_metrics.py from lab/ to evaluation/
miguelgfierro Jun 19, 2026
6bce374
refactor(evaluation): update imports — retrieval_metrics now in evalu…
miguelgfierro Jun 19, 2026
9229c43
refactor(evaluation): move test_retrieval_metrics.py to tests/unit/ev…
miguelgfierro Jun 19, 2026
4d9353d
Merge pull request #284 from fireflyframework/refactor/move-retrieval…
miguelgfierro Jun 19, 2026
6cdd3db
refactor(evaluation): replace RetrieverMetrics class with plain funct…
miguelgfierro Jun 19, 2026
3a3c35f
refactor(evaluation): update __init__.py exports — replace RetrieverM…
miguelgfierro Jun 19, 2026
26bfe3b
test(evaluation): rewrite test_retrieval_metrics for individual metri…
miguelgfierro Jun 19, 2026
b029d36
Merge pull request #285 from fireflyframework/refactor/retrieval-metr…
miguelgfierro Jun 19, 2026
feadcbd
Remove compute_retrieval_metrics() and KS constant from retrieval_met…
miguelgfierro Jun 19, 2026
d54814f
Remove compute_retrieval_metrics export from evaluation __init__
miguelgfierro Jun 19, 2026
0853698
Remove test_compute_retrieval_metrics_* tests
miguelgfierro Jun 19, 2026
a7b1b91
Update flycanon_eval_example to use plain metric functions instead of…
miguelgfierro Jun 19, 2026
0c911b3
Apply ruff format to retrieval_metrics.py
miguelgfierro Jun 19, 2026
ef16882
Apply ruff format to test_retrieval_metrics.py
miguelgfierro Jun 19, 2026
5a9926b
Merge pull request #286 from fireflyframework/refactor/drop-compute-r…
miguelgfierro Jun 19, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/workflows/pr-gate.yml
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ jobs:
- uses: actions/setup-python@v6
with:
python-version: '3.13'
- run: uv sync --extra dev --extra binary --extra vectorstores-sqlite-vec --extra openai-embeddings
- run: uv sync --extra dev --extra binary --extra vectorstores-sqlite-vec --extra openai-embeddings --extra evaluation
- run: uv run pyright

test:
Expand All @@ -72,7 +72,7 @@ jobs:
- uses: actions/setup-python@v6
with:
python-version: '3.13'
- run: uv sync --extra dev --extra binary --extra vectorstores-sqlite-vec --extra vectorstores-pgvector --extra openai-embeddings
- run: uv sync --extra dev --extra binary --extra vectorstores-sqlite-vec --extra vectorstores-pgvector --extra openai-embeddings --extra evaluation
- run: uv run pytest -m "not nightly" --cov --cov-report=term-missing

build:
Expand Down
7 changes: 7 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -412,6 +412,12 @@ classDiagram
`EvalDataset` loads/saves test cases from JSON. `ModelComparison` runs the
same prompts across multiple agents for side-by-side analysis.

- **Evaluation** — Gate-based quality gates (G1–G5), LLM-as-judge advisory scoring,
champion/challenger tracking, and deterministic retrieval metrics for assessing
agent and pipeline outputs. The `flyeval` CLI drives the full gate pipeline from
the command line. Install with `pip install "fireflyframework-agentic[evaluation]"`.
See [docs/evaluation.md](docs/evaluation.md) for the full guide.

> **Optional developer tooling.** `fireflyframework_agentic.experiments` (A/B
> experiments) and `fireflyframework_agentic.lab` (offline evaluation /
> benchmarking) are leaf modules — nothing in the core imports them and they add
Expand Down Expand Up @@ -817,6 +823,7 @@ Detailed guides for each module:
- [Security](docs/security.md) — Prompt/output guards, at-rest encryption
- [Experiments](docs/experiments.md) — A/B testing, variant comparison
- [Lab](docs/lab.md) — Benchmarks, datasets, evaluators
- [Evaluation](docs/evaluation.md) — Gate pipeline, flyeval CLI, champion/challenger, retrieval metrics
- Studio — moved to [fireflyframework-agentic-studio](https://github.com/fireflyframework/fireflyframework-agentic-studio)
---

Expand Down
Loading
Loading