diff --git a/.github/workflows/pr-gate.yml b/.github/workflows/pr-gate.yml
index d9767c01..399b45ca 100644
--- a/.github/workflows/pr-gate.yml
+++ b/.github/workflows/pr-gate.yml
@@ -57,7 +57,7 @@ jobs:
       - uses: actions/setup-python@v6
         with:
           python-version: '3.13'
-      - run: uv sync --extra dev --extra binary --extra vectorstores-sqlite-vec --extra openai-embeddings
+      - run: uv sync --extra dev --extra binary --extra vectorstores-sqlite-vec --extra openai-embeddings --extra evaluation
       - run: uv run pyright
 
   test:
@@ -72,7 +72,7 @@ jobs:
       - uses: actions/setup-python@v6
         with:
           python-version: '3.13'
-      - run: uv sync --extra dev --extra binary --extra vectorstores-sqlite-vec --extra vectorstores-pgvector --extra openai-embeddings
+      - run: uv sync --extra dev --extra binary --extra vectorstores-sqlite-vec --extra vectorstores-pgvector --extra openai-embeddings --extra evaluation
       - run: uv run pytest -m "not nightly" --cov --cov-report=term-missing
 
   build:
diff --git a/README.md b/README.md
index 9d005b23..904237da 100644
--- a/README.md
+++ b/README.md
@@ -412,6 +412,12 @@ classDiagram
   `EvalDataset` loads/saves test cases from JSON. `ModelComparison` runs the
   same prompts across multiple agents for side-by-side analysis.
 
+- **Evaluation** — Gate-based quality gates (G1–G5), LLM-as-judge advisory scoring,
+  champion/challenger tracking, and deterministic retrieval metrics for assessing
+  agent and pipeline outputs. The `flyeval` CLI drives the full gate pipeline from
+  the command line. Install with `pip install "fireflyframework-agentic[evaluation]"`.
+  See [docs/evaluation.md](docs/evaluation.md) for the full guide.
+
   > **Optional developer tooling.** `fireflyframework_agentic.experiments` (A/B
   > experiments) and `fireflyframework_agentic.lab` (offline evaluation /
   > benchmarking) are leaf modules — nothing in the core imports them and they add
@@ -817,6 +823,7 @@ Detailed guides for each module:
 - [Security](docs/security.md) — Prompt/output guards, at-rest encryption
 - [Experiments](docs/experiments.md) — A/B testing, variant comparison
 - [Lab](docs/lab.md) — Benchmarks, datasets, evaluators
+- [Evaluation](docs/evaluation.md) — Gate pipeline, flyeval CLI, champion/challenger, retrieval metrics
 - Studio — moved to [fireflyframework-agentic-studio](https://github.com/fireflyframework/fireflyframework-agentic-studio)
 ---
 
diff --git a/docs/evaluation.md b/docs/evaluation.md
new file mode 100644
index 00000000..c2abe319
--- /dev/null
+++ b/docs/evaluation.md
@@ -0,0 +1,435 @@
+# Evaluation Guide
+
+Copyright 2026 Firefly Software Foundation. Licensed under the Apache License 2.0.
+
+The Evaluation subpackage provides gate-based quality gates, LLM-as-judge advisory scoring,
+champion/challenger tracking, and deterministic retrieval metrics for assessing agent outputs.
+
+---
+
+## Concepts
+
+### Gate pipeline
+
+The evaluation framework runs **five gates** in sequence. Every gate always runs — a failed
+gate raises a *flag*, not a veto, so the scorecard always carries the complete picture.
+
+| Gate | Name | Kind | Description |
+|------|------|------|-------------|
+| G1 | Structural & Safe | Deterministic | Schema validity, PII non-disclosure, empty-registry guard. |
+| G2 | Must-finds & Negative Controls | Deterministic | Lexical/semantic recall against the must-find registry; NC precision. |
+| G3 | Evidence (Grounding) | Deterministic | Excerpt-to-corpus anchoring; fabricated-evidence detection. |
+| G4 | LLM-as-a-Judge | Advisory (non-blocking) | Semantic faithfulness, entailment, gap detection — never changes the verdict. |
+| G5 | No-regression / Promotion | Human decision | Champion/challenger comparison with A/A noise band; collects sign-offs. |
+
+**No gate vetoes.** Failures append to the `GateResult` flags list and scoring continues.
+The scorecard carries every signal regardless of which gates fired.
+
+### GateResult
+
+`GateResult` is a dataclass returned by each gate:
+
+```python
+@dataclass
+class GateResult:
+    gate: str       # "G1", "G2", …, "G5"
+    passed: bool
+    reason_code: str = ""   # e.g. "SCHEMA_INVALID", "NC_HIT", "UNGROUNDED"
+    details: dict = field(default_factory=dict)
+```
+
+`str(gate_result)` prints `[G2] PASS` or `[G2] FLAG:NC_HIT`.
+
+### Verdict
+
+`verdict(gate_results)` returns `VERDICT_PROMOTE` or `VERDICT_HOLD`:
+
+- `VERDICT_PROMOTE` — all gates passed **and** G5 (the human sign-off gate) is present.
+- `VERDICT_HOLD` — any gate flagged, or G5 is missing.
+
+The CLI exits `0` on PROMOTE and `1` on HOLD, so it composes into CI.
+
+### Must-find registry
+
+A registry (`lean-1` schema) is a JSON file listing items the discovery output is
+expected to surface (`tier` L0–L3) and negative controls (NC) it must *not* assert.
+
+```json
+{
+  "schema_version": "lean-1",
+  "corpus": "banca-cordobesa",
+  "items": [
+    { "id": "ao-pep-4eyes", "tier": "L0", "scope": "decision",
+      "description": "PEP cases require a second analyst sign-off (4-eyes)",
+      "keywords": ["PEP", "4-eyes"],
+      "evidence": ["SOP-002-kyc-edd.md"] },
+    { "id": "ao-nc-realtime", "tier": "NC", "scope": "finding",
+      "description": "KYC-Hub synchronises in real time — factually false" }
+  ]
+}
+```
+
+Tier semantics: L0 = must-find control (a single miss flags the run), L1 = high-priority,
+L2 = important, L3 = nice-to-have (not counted in the recall floor).
+
+### Advisory judge (G4)
+
+G4 calls a chat LLM (or local Ollama model) for semantic checks the deterministic gates
+cannot perform: faithfulness, entailment, numeric/temporal fidelity, actionability,
+fabricated-entity detection, and more. It is:
+
+- **Non-blocking** — `AdvisoryReport` is carried separately and never enters `verdict()`.
+- **Non-deterministic** — each metric runs `judge_runs` times (default: 3) and the
+  median score is reported.
+- **Opt-in** — pass `--judge-model provider:model` to activate it; omit the flag to skip.
+
+### Champion/challenger pattern
+
+Champions are **per-corpus**. `ChampionRecord` persists the best-known run so that
+promotion decisions are made against a stable, signed baseline rather than the last run.
+
+```
+               ┌──────────────────────────────────────────┐
+               │  run result JSON (challenger)            │
+               └──────────────┬───────────────────────────┘
+                              │
+              ┌───────────────▼───────────────┐
+              │  G1 · G2 · G3 (deterministic) │
+              │  G4 (advisory, opt-in)         │
+              └───────────────┬───────────────┘
+                              │  flags + scores
+              ┌───────────────▼───────────────┐
+              │  G5 — no-regression vs        │
+              │  champion baseline + A/A band │
+              └───────────────┬───────────────┘
+                              │
+              ┌───────────────▼───────────────┐
+              │  Markdown scorecard           │
+              │  PROMOTE / HOLD               │
+              └───────────────────────────────┘
+```
+
+`invalidate_champion()` marks a baseline invalid. The `EMPTY_MUST_FIND` guard in G1
+prevents a fake-100% champion being created against an empty registry.
+
+---
+
+## Installation
+
+The evaluation subpackage requires `scipy` and `numpy`. Install the optional extra:
+
+```bash
+pip install "fireflyframework-agentic[evaluation]"
+```
+
+The `flyeval` CLI entry-point is registered automatically by the package. Verify:
+
+```bash
+flyeval --version
+```
+
+---
+
+## CLI
+
+All subcommands exit `0` on PROMOTE and `1` on HOLD.
+
+### `flyeval gate`
+
+Run the full gate pipeline against a result JSON and print a Markdown scorecard.
+
+```bash
+flyeval gate \
+  --result      runs/2026-06-18/output.json \
+  --registry    registries/banca-cordobesa.json \
+  --baseline    baselines/banca-cordobesa.json \
+  --judge-model anthropic:claude-3-5-haiku \
+  --judge-runs  3
+```
+
+Key flags:
+
+| Flag | Default | Description |
+|------|---------|-------------|
+| `--result` | required | Path to the run's `output.json`. |
+| `--registry` | required | Must-find registry (lean-1 JSON). |
+| `--baseline` | — | Champion baseline JSON for G5 regression check. |
+| `--judge-model` | — | `provider:model` for G4 advisory judge. |
+| `--judge-runs` | 3 | Number of independent judge calls (median aggregation). |
+| `--no-judge` | — | Skip G4 entirely. |
+| `--recall-floor` | 0.70 | Minimum G2 recall before flagging. |
+| `--grounding-floor` | 0.90 | Minimum G3 grounding rate before flagging. |
+| `--corpus` | — | Path to the evidence corpus bundle for G3 verification. |
+| `--pii-list` | — | Path to a JSON array of names to scan for PII leaks (G1). |
+| `--embedder` | — | `provider:model` for semantic recall (G2 embedding path). |
+| `--model-id` | "unknown" | Identifier of the model under evaluation (for scorecard). |
+
+### `flyeval aa-band`
+
+Compute the A/A noise band from multiple repeated runs of the same model to establish
+the noise floor before setting up the champion comparison.
+
+```bash
+flyeval aa-band \
+  --results runs/aa-run-1/output.json runs/aa-run-2/output.json runs/aa-run-3/output.json \
+  --registry registries/banca-cordobesa.json
+```
+
+The command prints per-metric variance and recommended noise floors.
+
+### `flyeval day-zero`
+
+Promote the very first champion for a corpus (Day-Zero protocol). Requires at least
+`--signoffs` sign-offs (default: 2) before PROMOTE is issued.
+
+```bash
+flyeval day-zero \
+  --result   runs/2026-06-18/output.json \
+  --registry registries/banca-cordobesa.json \
+  --baseline baselines/banca-cordobesa.json \
+  --signoffs 2
+```
+
+The command writes the new `ChampionRecord` into `--baseline` on success.
+
+### `flyeval invalidate`
+
+Mark the current champion invalid with a documented reason. Use this when the registry
+changes in a way that makes the existing champion incommensurable.
+
+```bash
+flyeval invalidate \
+  --baseline baselines/banca-cordobesa.json \
+  --reason   "Registry expanded from 39 to 94 items (lean-1 v2)."
+```
+
+---
+
+## Python API
+
+### Running gates
+
+```python
+import json
+from fireflyframework_agentic.evaluation import (
+    run_gates,
+    render_scorecard,
+    verdict,
+    load_registry,
+    VERDICT_PROMOTE,
+)
+
+result = json.loads(open("runs/2026-06-18/output.json").read())
+registry = load_registry("registries/banca-cordobesa.json")
+
+gate_results = run_gates(result, registry)
+scorecard_md = render_scorecard(
+    gate_results,
+    corpus="banca-cordobesa",
+    model_id="anthropic:claude-3-5-sonnet",
+    run_id="2026-06-18-sonnet-01",
+)
+print(scorecard_md)
+
+v = verdict(gate_results)
+print("Verdict:", v)  # "PROMOTE" or "HOLD"
+assert v == VERDICT_PROMOTE
+```
+
+### Champion management
+
+```python
+from fireflyframework_agentic.evaluation import (
+    load_champion,
+    save_champion,
+    invalidate_champion,
+    ChampionRecord,
+)
+
+# Load the current champion (returns None on Day Zero).
+champ = load_champion("baselines/banca-cordobesa.json")
+if champ is None:
+    print("Day Zero — no champion yet.")
+else:
+    print(f"Champion: {champ.run_id} | {champ.primary_metric()}={champ.primary_score():.3f}")
+
+# Save a new champion after a successful PROMOTE.
+new_champ = ChampionRecord(
+    corpus="banca-cordobesa",
+    run_id="2026-06-18-sonnet-01",
+    model_id="anthropic:claude-3-5-sonnet",
+    registry_sha256=registry.sha256(),
+    scores={"lexical_recall": 0.857, "grounding_pct": 0.941},
+    human_sign_offs=["alice", "bob"],
+)
+save_champion("baselines/banca-cordobesa.json", new_champ)
+
+# Invalidate when the registry changes materially.
+invalidate_champion(
+    "baselines/banca-cordobesa.json",
+    reason="Registry expanded from 39 to 94 items.",
+)
+```
+
+### EvalConfig
+
+`EvalConfig` is a Pydantic model that captures the parameters of a single evaluation run.
+Use it to build reproducible, serialisable run records.
+
+```python
+from fireflyframework_agentic.evaluation.models import EvalConfig
+
+cfg = EvalConfig(
+    model_id="anthropic:claude-3-5-sonnet",
+    corpus="banca-cordobesa",
+    run_id="2026-06-18-sonnet-01",
+    registry_path="registries/banca-cordobesa.json",
+    corpus_path="corpora/banca-cordobesa/",
+    baseline_path="baselines/banca-cordobesa.json",
+    judge_model="anthropic:claude-3-5-haiku",
+    judge_runs=3,
+)
+print(cfg.model_dump_json(indent=2))
+```
+
+### Advisory judge (G4)
+
+```python
+from fireflyframework_agentic.evaluation import run_judge, JudgeClient, build_embedder
+
+client = JudgeClient(
+    chat_fn=my_chat_fn,        # callable(system: str, user: str) -> dict
+    embed_fn=build_embedder("ollama:bge-m3"),
+)
+
+advisory = run_judge(
+    result=result,
+    registry=registry,
+    client=client,
+    runs=3,
+    missed_ids=[],   # IDs the deterministic G2 missed — judge tries to recover them
+)
+print(advisory.scores)   # dict of metric -> float
+print(advisory.errors)   # any metrics that failed (best-effort, never raises)
+```
+
+---
+
+## Retrieval Metrics
+
+The `compute_retrieval_metrics()` function computes standard IR metrics over ranked
+retrieval results. It is imported from `fireflyframework_agentic.lab.retrieval_metrics`
+and re-exported by the evaluation package.
+
+Supported metrics at cut-offs k ∈ {1, 5, 10}:
+
+- **Hit@k** — at least one gold document in top-k.
+- **Recall@k** — fraction of gold documents in top-k.
+- **Precision@k** — fraction of top-k results that are gold.
+- **MRR@10** — mean reciprocal rank of the first gold hit.
+- **MAP@10** — mean average precision.
+- **nDCG@10** — normalised discounted cumulative gain.
+
+```python
+from fireflyframework_agentic.evaluation import compute_retrieval_metrics, RetrieverMetrics
+
+# Each row is a query; each row's "retrieved" list is ranked (rank=1 is top).
+rows = [
+    {
+        "query": "KYC enhanced due diligence steps",
+        "gold": ["SOP-002-kyc-edd.md"],
+        "retrieved": [
+            {"rank": 1, "source_id": "SOP-002-kyc-edd.md", "is_gold": True},
+            {"rank": 2, "source_id": "SOP-001-account-opening.md", "is_gold": False},
+            {"rank": 3, "source_id": "INT-002-KYC-Jaime.md", "is_gold": True},
+        ],
+    },
+]
+
+metrics: RetrieverMetrics = compute_retrieval_metrics(rows)
+print(f"Recall@5:  {metrics.recall_5:.3f}")
+print(f"nDCG@10:   {metrics.ndcg_10:.3f}")
+print(f"MRR@10:    {metrics.mrr_10:.3f}")
+```
+
+`RetrieverMetrics` also carries optional fields when the raw rows include them:
+`no_answer_rate`, `citation_precision`, `mean_search_ms`, `mean_answer_ms`.
+
+---
+
+## Architecture
+
+```mermaid
+flowchart TD
+    R["result JSON\n(DiscoveryResult / output.json)"]
+    REG["Registry JSON\n(lean-1 must-find)"]
+    CORP["Corpus bundle\n(raw evidence documents)"]
+    BASE["Baseline JSON\n(champion record)"]
+
+    R --> G1["G1 · Structural & Safe\n(schema, PII, empty-registry)"]
+    REG --> G1
+    R --> G2["G2 · Recall & NC Precision\n(lexical + optional semantic)"]
+    REG --> G2
+    R --> G3["G3 · Grounding\n(excerpt anchoring, fabrication)"]
+    CORP --> G3
+    R --> G4["G4 · LLM Judge advisory\n(faithfulness, entailment, gaps)"]
+    REG --> G4
+    G1 --> SC["Markdown Scorecard\nrender_scorecard()"]
+    G2 --> SC
+    G3 --> SC
+    G4 -.advisory.-> SC
+    BASE --> G5["G5 · No-regression\n(A/A band, sign-offs)"]
+    G1 --> G5
+    G2 --> G5
+    G3 --> G5
+    G5 --> SC
+    SC --> V["verdict()\nPROMOTE / HOLD"]
+    V --> CHAMP["save_champion()\nor invalidate_champion()"]
+```
+
+---
+
+## Reference
+
+### Exports
+
+All symbols below are importable from `fireflyframework_agentic.evaluation`.
+
+| Symbol | Kind | Description |
+|--------|------|-------------|
+| `EvalConfig` | Pydantic model | Parameters for a single evaluation run. |
+| `GateResult` | Dataclass | Result of one gate: `gate`, `passed`, `reason_code`, `details`. |
+| `Verdict` | Constants class | `Verdict.PROMOTE`, `Verdict.HOLD`. |
+| `VERDICT_PROMOTE` | `str` | `"PROMOTE"`. |
+| `VERDICT_HOLD` | `str` | `"HOLD"`. |
+| `run_gates()` | Function | Run all four deterministic gates (G1–G3, G5 shape) and return results. |
+| `g2_recall_precision()` | Function | Run only G2 (recall + NC precision) and return `GateResult`. |
+| `verdict()` | Function | Derive PROMOTE/HOLD from a list of `GateResult`. |
+| `render_scorecard()` | Function | Render a Markdown scorecard from gate results and metadata. |
+| `ChampionRecord` | Dataclass | Per-corpus champion metadata and scores. |
+| `load_champion()` | Function | Load the current champion from `baseline.json`; returns `None` on Day Zero. |
+| `save_champion()` | Function | Persist a new champion to `baseline.json`. |
+| `invalidate_champion()` | Function | Mark the champion invalid with a reason string. |
+| `AdvisoryReport` | Dataclass | G4 judge output: `scores`, `errors`, `raw`. |
+| `run_judge()` | Function | Run the LLM-as-a-Judge advisory pass. |
+| `JudgeClient` | Dataclass | Holds `chat_fn` and `embed_fn` for the judge. |
+| `OllamaEmbedder` | Class | Local Ollama embedding callable (default BGE-M3). |
+| `build_embedder()` | Function | Factory: `"ollama:bge-m3"` → `OllamaEmbedder`. |
+| `cosine()` | Function | Cosine similarity between two numpy vectors. |
+| `Registry` | Dataclass | Parsed must-find registry with real items and NC items. |
+| `RegistryItem` | Dataclass | One must-find or NC item: `id`, `tier`, `scope`, `description`, …. |
+| `load_registry()` | Function | Parse and validate a lean-1 registry JSON file. |
+| `registry_sha256()` | Function | SHA-256 of a registry file path. |
+| `load_corpus()` | Function | Load and index a corpus bundle for G3 evidence verification. |
+| `corpus_sha256()` | Function | SHA-256 of a corpus directory or bundle. |
+| `verify_evidence_index()` | Function | Check each `evidence_index` entry against the corpus. |
+| `EMPTY` / `FABRICATED` / `SOURCE_UNKNOWN` / `VERIFIED` | `str` | Evidence verification status constants. |
+| `RetrieverMetrics` | Pydantic model | IR metrics: `recall_k`, `precision_k`, `ndcg_10`, `mrr_10`, `map_10`. |
+| `compute_retrieval_metrics()` | Function | Compute IR metrics from a list of ranked-retrieval result rows. |
+| `anchored()` | Function | True if claim and evidence share at least one non-trivial token. |
+| `matches()` | Function | Gate predicate: does a candidate match a registry item? |
+| `source_stem()` | Function | Normalise a `locator` path to its file stem for dedup. |
+| `tokens()` | Function | Tokenise text to a list of lowercase word strings. |
+| `aa_band()` | Function | Compute per-metric A/A noise floor from repeated runs. |
+| `aggregate_grounding()` | Function | Summarise grounding stats across a result's findings. |
+| `left_skew_flag()` | Function | True when the score distribution is left-skewed (over-optimistic). |
diff --git a/examples/flycanon_eval_example.py b/examples/flycanon_eval_example.py
new file mode 100644
index 00000000..30e66bd1
--- /dev/null
+++ b/examples/flycanon_eval_example.py
@@ -0,0 +1,376 @@
+# Copyright 2026 Firefly Software Foundation
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""FlyCanon evaluation example — RAG retrieval benchmark with champion/challenger tracking.
+
+Demonstrates how to use ``fireflyframework_agentic.evaluation`` to replicate
+the flycanon experiment evaluation workflow:
+
+1. Load a results JSONL file produced by a flycanon retrieval pipeline.
+2. Compute deterministic IR metrics (Recall@k, Precision@k, MRR, nDCG, MAP).
+3. Compare against a saved baseline to detect regression.
+4. Print a formatted metrics table.
+5. Offer to promote the new run to champion when it beats the baseline.
+
+The champion/challenger pattern mirrors the flycanon_experiments harness:
+each run writes metrics to a file; ``approve`` promotes it by repointing
+baseline.json.  Here we replicate that flow using the framework's
+individual retrieval metric functions directly.
+
+Usage::
+
+    # Score a results file (no baseline comparison)
+    python examples/flycanon_eval_example.py --results-file results.jsonl
+
+    # Compare against a saved baseline
+    python examples/flycanon_eval_example.py \\
+        --results-file results.jsonl \\
+        --baseline baseline.json
+
+    # Promote if better (write new champion to baseline.json)
+    python examples/flycanon_eval_example.py \\
+        --results-file results.jsonl \\
+        --baseline baseline.json \\
+        --promote-if-better
+
+Exit codes: 0 = scored successfully, 1 = regression detected vs baseline.
+
+Results JSONL format
+--------------------
+Each line is a JSON object representing one query's retrieval result::
+
+    {
+        "question": "What was Apple's revenue in Q4 2023?",
+        "gold": ["AAPL_10K_2023", "AAPL_10Q_Q4_2023"],
+        "retrieved": [
+            {"rank": 1, "source_id": "AAPL_10K_2023",  "is_gold": true},
+            {"rank": 2, "source_id": "MSFT_10K_2023",  "is_gold": false},
+            {"rank": 3, "source_id": "AAPL_10Q_Q4_2023", "is_gold": true}
+        ],
+        "answer": "Apple's revenue in Q4 2023 was $89.5 billion.",
+        "no_answer": false,
+        "citations": [
+            {"source_id": "AAPL_10K_2023", "is_gold": true}
+        ],
+        "search_ms": 142,
+        "answer_ms": 2310
+    }
+
+The ``gold`` list contains the source IDs that are considered correct answers.
+Each entry in ``retrieved`` must have a 1-based ``rank``, ``source_id`` (or
+``identities`` list), and ``is_gold`` bool.
+
+Baseline JSON format
+--------------------
+A flat JSON object with metric names as keys and float values::
+
+    {
+        "ndcg@10": 0.7234,
+        "mrr@10": 0.6891,
+        "recall@10": 0.8120,
+        "hit@10": 0.9100,
+        "map@10": 0.6543,
+        "n_queries": 200
+    }
+
+This is the same format written by ``--promote-if-better``.
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import sys
+from pathlib import Path
+
+from fireflyframework_agentic.evaluation import (
+    citation_precision,
+    hit_at_k,
+    map_score,
+    mean_latency_ms,
+    mrr,
+    ndcg,
+    no_answer_rate,
+    precision_at_k,
+    recall_at_k,
+)
+
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+
+# Metrics that form the primary quality signal for champion/challenger
+# comparisons.  These are listed in priority order: nDCG@10 is the primary
+# ranking metric; MRR@10 measures how quickly the first gold result appears;
+# Recall@10 measures overall coverage; Hit@10 measures binary success rate;
+# MAP@10 measures precision across the ranked list.
+PRIMARY_METRICS = ["ndcg@10", "mrr@10", "recall@10", "hit@10", "map@10"]
+
+# Regression threshold: a metric must drop by more than this fraction of its
+# baseline value to be flagged as a regression (guards against noise).
+REGRESSION_THRESHOLD = 0.01
+
+
+def _load_jsonl(path: str) -> list[dict]:
+    """Load a newline-delimited JSON file, one object per line."""
+    lines = Path(path).read_text(encoding="utf-8").strip().splitlines()
+    return [json.loads(line) for line in lines if line.strip()]
+
+
+def _load_baseline(path: str) -> dict | None:
+    """Load a baseline JSON file, returning None if it does not exist."""
+    p = Path(path)
+    if not p.exists():
+        return None
+    return json.loads(p.read_text(encoding="utf-8"))
+
+
+def _save_baseline(path: str, metrics: dict) -> None:
+    """Write a flat metrics dict to the baseline JSON file."""
+    Path(path).write_text(json.dumps(metrics, indent=2, ensure_ascii=False) + "\n", encoding="utf-8")
+
+
+def _compute_metrics(results: list[dict]) -> dict:
+    """Compute all IR metrics and return a flat dict."""
+    return {
+        "n_queries": len(results),
+        "hit@1": hit_at_k(results, 1),
+        "hit@5": hit_at_k(results, 5),
+        "hit@10": hit_at_k(results, 10),
+        "recall@1": recall_at_k(results, 1),
+        "recall@5": recall_at_k(results, 5),
+        "recall@10": recall_at_k(results, 10),
+        "precision@1": precision_at_k(results, 1),
+        "precision@5": precision_at_k(results, 5),
+        "precision@10": precision_at_k(results, 10),
+        "mrr@10": mrr(results),
+        "map@10": map_score(results),
+        "ndcg@10": ndcg(results),
+        "no_answer_rate": no_answer_rate(results),
+        "citation_precision": citation_precision(results),
+        "mean_search_ms": mean_latency_ms(results, "search_ms"),
+        "mean_answer_ms": mean_latency_ms(results, "answer_ms"),
+    }
+
+
+def _print_metrics_table(flat: dict, baseline: dict | None) -> None:
+    """Print a formatted table comparing current metrics vs baseline."""
+
+    col_w = 22
+    num_w = 10
+    header = f"{'Metric':<{col_w}} {'Current':>{num_w}}"
+    if baseline:
+        header += f" {'Baseline':>{num_w}} {'Delta':>{num_w}}"
+    print(header)
+    print("-" * (col_w + num_w + (num_w * 2 + 2 if baseline else 0)))
+
+    for key, value in flat.items():
+        if value is None:
+            continue
+        # Format floats as 4 decimal places; ints as plain integers.
+        cur_str = f"{value:.4f}" if isinstance(value, float) else str(value)
+
+        row = f"{key:<{col_w}} {cur_str:>{num_w}}"
+        if baseline and key in baseline and isinstance(value, float):
+            base_val = baseline[key]
+            delta = value - base_val
+            delta_str = f"{delta:+.4f}"
+            row += f" {base_val:>{num_w}.4f} {delta_str:>{num_w}}"
+        print(row)
+
+    print()
+
+
+def _detect_regressions(flat: dict, baseline: dict) -> list[str]:
+    """Return the names of primary metrics that regressed vs baseline.
+
+    A regression is flagged when the new value drops by more than
+    REGRESSION_THRESHOLD * baseline_value (relative threshold).  This
+    guards against flagging noise as a regression.
+    """
+    regressions = []
+    for key in PRIMARY_METRICS:
+        new_val = flat.get(key)
+        base_val = baseline.get(key)
+        if new_val is None or base_val is None:
+            continue
+        if base_val > 0 and (base_val - new_val) / base_val > REGRESSION_THRESHOLD:
+            regressions.append(key)
+    return regressions
+
+
+def _beats_baseline(flat: dict, baseline: dict) -> bool:
+    """Return True if the new metrics are better than or equal to the baseline.
+
+    'Better' means no primary metric has regressed beyond REGRESSION_THRESHOLD
+    AND at least one primary metric has improved.
+    """
+    regressions = _detect_regressions(flat, baseline)
+    if regressions:
+        return False
+    # Check for at least one improvement.
+    for key in PRIMARY_METRICS:
+        new_val = flat.get(key)
+        base_val = baseline.get(key)
+        if new_val is not None and base_val is not None and new_val > base_val:
+            return True
+    return False
+
+
+# ---------------------------------------------------------------------------
+# Main evaluation flow
+# ---------------------------------------------------------------------------
+
+
+def run_evaluation(args: argparse.Namespace) -> int:
+    """Run retrieval metric scoring and optional champion/challenger comparison."""
+
+    # ------------------------------------------------------------------
+    # Step 1 — Load results from the JSONL file.
+    #
+    # Each line is one query's retrieval result.  The file is produced by
+    # a flycanon pipeline run (runner.run_queries writes results.jsonl).
+    # ------------------------------------------------------------------
+    print(f"Loading results  : {args.results_file}")
+    results = _load_jsonl(args.results_file)
+    print(f"  {len(results)} query results loaded.")
+
+    if not results:
+        print("ERROR: results file is empty.", file=sys.stderr)
+        return 1
+
+    # ------------------------------------------------------------------
+    # Step 2 — Compute deterministic IR metrics.
+    #
+    # Metrics are computed at cut-offs k ∈ {1, 5, 10} and include:
+    #   hit@k       -- at least one gold doc in top-k (binary)
+    #   recall@k    -- fraction of gold docs found in top-k
+    #   precision@k -- fraction of top-k that are gold
+    #   mrr@10      -- mean reciprocal rank of first gold hit
+    #   map@10      -- mean average precision
+    #   ndcg@10     -- normalised discounted cumulative gain
+    # ------------------------------------------------------------------
+    print("\nComputing retrieval metrics ...")
+    flat = _compute_metrics(results)
+
+    print(f"  nDCG@10    : {flat['ndcg@10']:.4f}")
+    print(f"  MRR@10     : {flat['mrr@10']:.4f}")
+    print(f"  Recall@10  : {flat['recall@10']:.4f}")
+    print(f"  Hit@10     : {flat['hit@10']:.4f}")
+    print(f"  MAP@10     : {flat['map@10']:.4f}")
+
+    # ------------------------------------------------------------------
+    # Step 3 — Load the baseline (champion) for regression detection.
+    # ------------------------------------------------------------------
+    baseline = None
+    if args.baseline:
+        baseline = _load_baseline(args.baseline)
+        if baseline:
+            print(f"\nLoaded baseline  : {args.baseline}")
+        else:
+            print(f"\nNo baseline found at {args.baseline} — first run, no comparison.")
+
+    # ------------------------------------------------------------------
+    # Step 4 — Print the full metrics table.
+    # ------------------------------------------------------------------
+    print("\n" + "=" * 56)
+    print("Retrieval Metrics")
+    print("=" * 56)
+    _print_metrics_table(flat, baseline)
+
+    # ------------------------------------------------------------------
+    # Step 5 — Regression check.
+    #
+    # Compare against the baseline on primary metrics.  Regressions block
+    # promotion (exit code 1) unless --promote-if-better is set and the
+    # run actually improved overall.
+    # ------------------------------------------------------------------
+
+    if baseline:
+        regressions = _detect_regressions(flat, baseline)
+        if regressions:
+            print(f"REGRESSION detected on: {', '.join(regressions)}")
+            print(f"  Threshold: {REGRESSION_THRESHOLD * 100:.0f}% relative drop on any primary metric.")
+        else:
+            better = _beats_baseline(flat, baseline)
+            if better:
+                print("Challenger BEATS baseline on at least one primary metric.")
+            else:
+                print("Challenger is on-par with baseline (no regression, no improvement).")
+
+        if regressions and not args.promote_if_better:
+            print("\nVerdict: HOLD — regression detected.  Tune the pipeline and re-run.")
+            return 1
+
+    # ------------------------------------------------------------------
+    # Step 6 — Champion promotion.
+    #
+    # When --promote-if-better is set and the metrics beat (or equal) the
+    # baseline, save the new metrics as the champion.  Future runs will
+    # compare against this updated record.
+    # ------------------------------------------------------------------
+    if args.promote_if_better and args.baseline:
+        if baseline is None or _beats_baseline(flat, baseline):
+            _save_baseline(args.baseline, flat)
+            print(f"\nChampion PROMOTED — metrics saved to {args.baseline}")
+        else:
+            print("\nNot promoted — challenger did not beat baseline on primary metrics.")
+
+    print("\nVerdict: PROMOTE" if not (baseline and _detect_regressions(flat, baseline)) else "\nVerdict: HOLD")
+    return 0
+
+
+# ---------------------------------------------------------------------------
+# CLI
+# ---------------------------------------------------------------------------
+
+
+def build_parser() -> argparse.ArgumentParser:
+    p = argparse.ArgumentParser(
+        prog="flycanon_eval_example",
+        description=(
+            "FlyCanon RAG retrieval benchmark — computes IR metrics from a results JSONL "
+            "and compares against a champion baseline."
+        ),
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
+    )
+    p.add_argument(
+        "--results-file",
+        required=True,
+        help="Path to results.jsonl produced by the flycanon pipeline.",
+    )
+    p.add_argument(
+        "--baseline",
+        default=None,
+        help=("Path to baseline.json (champion store).  When absent, scores are printed without comparison."),
+    )
+    p.add_argument(
+        "--promote-if-better",
+        action="store_true",
+        help=(
+            "When set, write new metrics to baseline.json if the challenger beats the "
+            "champion on primary metrics.  Has no effect when --baseline is omitted."
+        ),
+    )
+    return p
+
+
+def main() -> None:
+    parser = build_parser()
+    args = parser.parse_args()
+    sys.exit(run_evaluation(args))
+
+
+if __name__ == "__main__":
+    main()
diff --git a/fireflyframework_agentic/__init__.py b/fireflyframework_agentic/__init__.py
index 993b0248..1736f1f4 100644
--- a/fireflyframework_agentic/__init__.py
+++ b/fireflyframework_agentic/__init__.py
@@ -24,6 +24,13 @@
 
     config = get_config()
     print(config.default_model)
+
+Optional subpackages (not imported eagerly at the top level):
+    fireflyframework_agentic.lab          -- sessions, benchmarks, datasets, evaluation orchestration
+    fireflyframework_agentic.experiments  -- experiment tracking and comparison
+    fireflyframework_agentic.evaluation   -- gate-based quality gates, LLM-as-judge advisory,
+                                            champion/challenger tracking, retrieval metrics
+                                            (requires the ``evaluation`` optional extra)
 """
 
 from importlib.metadata import PackageNotFoundError, version
diff --git a/fireflyframework_agentic/evaluation/__init__.py b/fireflyframework_agentic/evaluation/__init__.py
new file mode 100644
index 00000000..35dd32f7
--- /dev/null
+++ b/fireflyframework_agentic/evaluation/__init__.py
@@ -0,0 +1,111 @@
+from fireflyframework_agentic.evaluation.judge import (
+    AdvisoryReport as AdvisoryReport,
+)
+from fireflyframework_agentic.evaluation.judge import (
+    EvalContext as EvalContext,
+)
+from fireflyframework_agentic.evaluation.judge import (
+    Metric as Metric,
+)
+from fireflyframework_agentic.evaluation.judge import (
+    actionability as actionability,
+)
+from fireflyframework_agentic.evaluation.judge import (
+    addresses_question as addresses_question,
+)
+from fireflyframework_agentic.evaluation.judge import (
+    answer_correctness as answer_correctness,
+)
+from fireflyframework_agentic.evaluation.judge import (
+    answer_relevancy as answer_relevancy,
+)
+from fireflyframework_agentic.evaluation.judge import (
+    citation_relevance as citation_relevance,
+)
+from fireflyframework_agentic.evaluation.judge import (
+    comparative_vs_champion as comparative_vs_champion,
+)
+from fireflyframework_agentic.evaluation.judge import (
+    contains_answer as contains_answer,
+)
+from fireflyframework_agentic.evaluation.judge import (
+    context_precision as context_precision,
+)
+from fireflyframework_agentic.evaluation.judge import (
+    context_recall as context_recall,
+)
+from fireflyframework_agentic.evaluation.judge import (
+    contradiction as contradiction,
+)
+from fireflyframework_agentic.evaluation.judge import (
+    excerpt_fill_rate as excerpt_fill_rate,
+)
+from fireflyframework_agentic.evaluation.judge import (
+    fabricated_entity as fabricated_entity,
+)
+from fireflyframework_agentic.evaluation.judge import (
+    faithfulness as faithfulness,
+)
+from fireflyframework_agentic.evaluation.judge import (
+    nc_semantic_precision as nc_semantic_precision,
+)
+from fireflyframework_agentic.evaluation.judge import (
+    numeric_temporal_fidelity as numeric_temporal_fidelity,
+)
+from fireflyframework_agentic.evaluation.judge import (
+    open_gap as open_gap,
+)
+from fireflyframework_agentic.evaluation.judge import (
+    ragas_faithfulness as ragas_faithfulness,
+)
+from fireflyframework_agentic.evaluation.judge import (
+    run_judge as run_judge,
+)
+from fireflyframework_agentic.evaluation.judge import (
+    semantic_recovery as semantic_recovery,
+)
+from fireflyframework_agentic.evaluation.judge import (
+    severity_calibration as severity_calibration,
+)
+from fireflyframework_agentic.evaluation.judge import (
+    source_coverage as source_coverage,
+)
+from fireflyframework_agentic.evaluation.judge import (
+    surface_deduplication as surface_deduplication,
+)
+from fireflyframework_agentic.evaluation.judge_client import (
+    JudgeClient as JudgeClient,
+)
+from fireflyframework_agentic.evaluation.judge_client import (
+    parse_model as parse_model,
+)
+from fireflyframework_agentic.evaluation.judge_client import (
+    same_provider as same_provider,
+)
+from fireflyframework_agentic.evaluation.retrieval_metrics import (
+    citation_precision as citation_precision,
+)
+from fireflyframework_agentic.evaluation.retrieval_metrics import (
+    hit_at_k as hit_at_k,
+)
+from fireflyframework_agentic.evaluation.retrieval_metrics import (
+    map_score as map_score,
+)
+from fireflyframework_agentic.evaluation.retrieval_metrics import (
+    mean_latency_ms as mean_latency_ms,
+)
+from fireflyframework_agentic.evaluation.retrieval_metrics import (
+    mrr as mrr,
+)
+from fireflyframework_agentic.evaluation.retrieval_metrics import (
+    ndcg as ndcg,
+)
+from fireflyframework_agentic.evaluation.retrieval_metrics import (
+    no_answer_rate as no_answer_rate,
+)
+from fireflyframework_agentic.evaluation.retrieval_metrics import (
+    precision_at_k as precision_at_k,
+)
+from fireflyframework_agentic.evaluation.retrieval_metrics import (
+    recall_at_k as recall_at_k,
+)
diff --git a/fireflyframework_agentic/evaluation/judge.py b/fireflyframework_agentic/evaluation/judge.py
new file mode 100644
index 00000000..d5bcad66
--- /dev/null
+++ b/fireflyframework_agentic/evaluation/judge.py
@@ -0,0 +1,890 @@
+"""Evaluation judge — async metrics for flyradar and flycanon pipelines.
+
+Every metric: async def metric_name(item: dict, ctx: EvalContext) -> dict | float | None
+
+Flyradar item keys: findings, evidence_index, process_graph, proposed_actions,
+  workspace, reports, lexical_missed_ids, nc_items, champion
+Flycanon item keys: question, answer, reference, contexts
+"""
+
+from __future__ import annotations
+
+import asyncio
+import math
+import os
+import statistics
+from collections.abc import Awaitable, Callable
+from dataclasses import dataclass, field
+
+from pydantic import BaseModel, ConfigDict
+
+from fireflyframework_agentic.embeddings.providers.ollama import OllamaEmbedder
+from fireflyframework_agentic.embeddings.similarity import cosine_similarity
+from fireflyframework_agentic.evaluation.judge_client import JudgeClient, same_provider
+
+Metric = Callable[["dict", "EvalContext"], Awaitable["dict | float | None"]]
+
+SYSTEM = "You are a meticulous evaluator of a process-mining discovery report. Return ONLY a JSON object."
+
+SYSTEM_RAG = "You are an evaluator of a RAG system's answers. Return ONLY a JSON object."
+
+RUBRIC = (
+    "Score the ANSWER on two metrics:\n"
+    "- contains_answer (0.0-1.0): Does the answer contain the correct information from the REFERENCE?\n"
+    "- addresses_question (0.0-1.0): Does the answer directly address what the QUESTION is asking?\n"
+    'Reply with ONLY {"contains_answer": <float>, "addresses_question": <float>}.'
+)
+
+
+class EvalContext(BaseModel):
+    model_config = ConfigDict(arbitrary_types_allowed=True)
+
+    client: JudgeClient
+    embedder: OllamaEmbedder | None = None
+    runs: int = 3
+
+
+@dataclass
+class AdvisoryReport:
+    """The G4 output: a plain metrics bag, never a GateResult.
+
+    metrics maps metric-name -> small dict (the per-metric summary).  details
+    carries supporting context (counts, ids).  errors lists per-metric failures
+    captured by run_judge's best-effort try/except so nothing propagates.
+    """
+
+    judge_model: str
+    same_provider_caveat: bool
+    calibrated: bool  # ALWAYS False for now
+    runs: int
+    metrics: dict = field(default_factory=dict)
+    details: dict = field(default_factory=dict)
+    errors: list[str] = field(default_factory=list)
+
+
+# ── shared accessors ───────────────────────────────────────────────────────────
+
+
+def _evidence_index(item: dict) -> dict[str, dict]:
+    return {ev.get("id"): ev for ev in item.get("evidence_index", []) if ev.get("id")}
+
+
+def _cited_excerpts(finding: dict, evidence_index: dict[str, dict]) -> list[str]:
+    """Excerpts of the evidence a finding cites (via evidence_refs.evidence_id)."""
+    out: list[str] = []
+    for ref in finding.get("evidence_refs", []):
+        ev = evidence_index.get(ref.get("evidence_id", ""))
+        if ev:
+            excerpt = ev.get("excerpt") or ""
+            if excerpt:
+                out.append(excerpt)
+    return out
+
+
+def _output_text(item: dict) -> str:
+    """All free text the model emitted: finding titles+descriptions + reports."""
+    parts: list[str] = []
+    for f in item.get("findings", []):
+        parts.append(f.get("title", ""))
+        parts.append(f.get("description", ""))
+    for r in item.get("reports", []):
+        parts.append(str(r))
+    return "\n".join(p for p in parts if p)
+
+
+def _workspace_intention(item: dict) -> str:
+    ws = item.get("workspace") or {}
+    return f"{ws.get('name', '')}\n{ws.get('description', '')}".strip()
+
+
+def _coerce_float(value, default=None):
+    """Coerce a model-returned number/numeric-string to float; total (never raises)."""
+    try:
+        return float(value)
+    except (TypeError, ValueError):
+        return default
+
+
+def _source_stem(locator: str) -> str:
+    """Return the part before the first '#', or the full string if no '#'."""
+    idx = locator.find("#")
+    return locator[:idx] if idx != -1 else locator
+
+
+async def _gather_chat(chat_fn, prompts: list[tuple[str, str]]) -> list[dict]:
+    """Run a list of (system, user) prompts concurrently, returning ordered results."""
+    results = await asyncio.gather(*[chat_fn(s, u) for s, u in prompts], return_exceptions=True)
+    return [r if isinstance(r, dict) else {} for r in results]
+
+
+# ── [D] DETERMINISTIC — no LLM, always available ────────────────────────────────
+
+
+async def source_coverage(item: dict, ctx: EvalContext) -> dict:  # noqa: ARG001
+    """Distinct source documents cited by >=1 finding vs all source documents.
+
+    Returns {cited, total, orphaned} where orphaned is the sorted list of
+    source stems present in evidence_index but cited by no finding.
+    """
+    ev_idx = _evidence_index(item)
+    all_stems = {_source_stem(ev.get("locator", "")) for ev in item.get("evidence_index", []) if ev.get("locator")}
+    cited_stems: set[str] = set()
+    for f in item.get("findings", []):
+        for ref in f.get("evidence_refs", []):
+            ev = ev_idx.get(ref.get("evidence_id", ""))
+            if ev and ev.get("locator"):
+                cited_stems.add(_source_stem(ev["locator"]))
+    cited_stems &= all_stems
+    orphaned = sorted(all_stems - cited_stems)
+    return {"cited": len(cited_stems), "total": len(all_stems), "orphaned": orphaned}
+
+
+async def excerpt_fill_rate(item: dict, ctx: EvalContext) -> dict:  # noqa: ARG001
+    """Fraction of evidence_index entries with a non-empty excerpt.
+
+    Returns {populated, total}.
+    """
+    entries = item.get("evidence_index", [])
+    populated = sum(1 for ev in entries if (ev.get("excerpt") or "").strip())
+    return {"populated": populated, "total": len(entries)}
+
+
+# ── [E] EMBEDDING — needs embedder ───────────────────────────────────────────────
+
+
+async def semantic_recovery(item: dict, ctx: EvalContext, tau: float = 0.70) -> dict | None:
+    """Context-recall: recover lexical misses by embedding similarity.
+
+    Reads item["lexical_missed_ids"] (list of str).
+    Returns None if ctx.embedder is None.
+    """
+    if ctx.embedder is None:
+        return None
+
+    lexical_missed_ids: list[str] = item.get("lexical_missed_ids", [])
+    missed = set(lexical_missed_ids or [])
+
+    # Build the scored items from nc_items (non-NC = real items for recall)
+    # In the new EvalContext model, nc_items is a list of {"id": ..., "description": ...}
+    # We treat all item findings as the candidate surface; nc_items stay separate.
+    # Recompute as: all items scored = those not in nc_items ids.
+    # If there's no registry concept, we use findings as the denominator proxy.
+    # But keep the logic simple: just score the missed items against finding descriptions.
+    ev_idx = _evidence_index(item)
+    candidate_texts: list[str] = []
+    for f in item.get("findings", []):
+        desc = f.get("description", "")
+        if desc:
+            candidate_texts.append(desc)
+        candidate_texts.extend(_cited_excerpts(f, ev_idx))
+
+    # missed_items: we only know their IDs; we need descriptions to embed.
+    # In the new design, if no descriptions available, return minimal result.
+    all_findings = item.get("findings", [])
+    denom = max(len(all_findings), 1)
+    lexical_hits = sum(1 for f in all_findings if f.get("id") not in missed)
+
+    missed_descs: list[tuple[str, str]] = [
+        (f.get("id", ""), f.get("description", ""))
+        for f in all_findings
+        if f.get("id") in missed and f.get("description")
+    ]
+
+    if not missed_descs or not candidate_texts:
+        recovered_recall = lexical_hits / denom
+        return {
+            "lexical_recall": round(lexical_hits / denom, 4),
+            "recovered_recall": round(recovered_recall, 4),
+            "recovered": [],
+            "tau": tau,
+            "scored_denominator": denom,
+        }
+
+    item_texts = [desc for _fid, desc in missed_descs]
+    item_vecs = await ctx.embedder._embed_batch(item_texts)
+    cand_vecs = await ctx.embedder._embed_batch(candidate_texts)
+
+    recovered: list[dict] = []
+    for (fid, _desc), ivec in zip(missed_descs, item_vecs, strict=False):
+        best = max((cosine_similarity(ivec, cvec) for cvec in cand_vecs), default=0.0)
+        if best >= tau:
+            recovered.append({"id": fid, "cosine": round(best, 4)})
+
+    recovered_recall = (lexical_hits + len(recovered)) / denom
+    return {
+        "lexical_recall": round(lexical_hits / denom, 4),
+        "recovered_recall": round(recovered_recall, 4),
+        "recovered": recovered,
+        "tau": tau,
+        "scored_denominator": denom,
+    }
+
+
+# ── [J] JUDGE — needs chat_fn(system, user) -> dict ──────────────────────────────
+
+
+async def faithfulness(item: dict, ctx: EvalContext) -> dict:
+    """Entailment: does each finding's cited evidence SUPPORT its claim?
+
+    Returns {supported, total, unsupported_ids}.
+    """
+    ev_idx = _evidence_index(item)
+    findings = item.get("findings", [])
+    cited = [(f, _cited_excerpts(f, ev_idx)) for f in findings]
+    prompts = [
+        (
+            SYSTEM,
+            "Does the cited evidence span ENTAIL the claim made in this finding?\n"
+            'Reply with ONLY {"verdict": "SUPPORTED" or "NOT_SUPPORTED", "reason": "<one line>"}.\n\n'
+            f"FINDING: {f.get('description', '')}\n"
+            f"CITED EVIDENCE: {' || '.join(excerpts)}",
+        )
+        for f, excerpts in cited
+        if excerpts
+    ]
+    answers = iter(await _gather_chat(ctx.client.chat_json, prompts))
+    supported = 0
+    unsupported_ids: list[str] = []
+    for f, excerpts in cited:
+        fid = f.get("id", "?")
+        if not excerpts:
+            unsupported_ids.append(fid)
+            continue
+        verdict = str(next(answers).get("verdict", "")).upper()
+        if verdict == "SUPPORTED":
+            supported += 1
+        else:
+            unsupported_ids.append(fid)
+    return {"supported": supported, "total": len(findings), "unsupported_ids": unsupported_ids}
+
+
+async def numeric_temporal_fidelity(item: dict, ctx: EvalContext) -> dict:
+    """Flag numbers/dates asserted in a finding that do NOT match its evidence.
+
+    Returns {mismatches: [{finding_id, value, source}], count}.
+    """
+    ev_idx = _evidence_index(item)
+    scored = [(f, excerpts) for f in item.get("findings", []) if (excerpts := _cited_excerpts(f, ev_idx))]
+    prompts = [
+        (
+            SYSTEM,
+            "List every specific number or date asserted in the FINDING that does "
+            "NOT match the CITED EVIDENCE.\n"
+            'Reply with ONLY {"mismatches": [{"value": "<claimed>", "source": "<what the evidence says>"}]}. '
+            "Empty list if all match.\n\n"
+            f"FINDING: {f.get('description', '')}\n"
+            f"CITED EVIDENCE: {' || '.join(excerpts)}",
+        )
+        for f, excerpts in scored
+    ]
+    answers = await _gather_chat(ctx.client.chat_json, prompts)
+    mismatches: list[dict] = []
+    for (f, _excerpts), answer in zip(scored, answers, strict=False):
+        for m in answer.get("mismatches", []) or []:
+            mismatches.append(
+                {
+                    "finding_id": f.get("id", "?"),
+                    "value": m.get("value", ""),
+                    "source": m.get("source", ""),
+                }
+            )
+    return {"mismatches": mismatches, "count": len(mismatches)}
+
+
+async def citation_relevance(item: dict, ctx: EvalContext) -> dict:
+    """Context precision: fraction of cited passages actually relevant to the claim.
+
+    Returns {precision, relevant, total}.
+    """
+    ev_idx = _evidence_index(item)
+    prompts: list[tuple[str, str]] = []
+    for f in item.get("findings", []):
+        desc = f.get("description", "")
+        for ref in f.get("evidence_refs", []):
+            ev = ev_idx.get(ref.get("evidence_id", ""))
+            if not ev:
+                continue
+            excerpt = ev.get("excerpt") or ""
+            if not excerpt:
+                continue
+            prompts.append(
+                (
+                    SYSTEM,
+                    "Is this cited passage actually relevant to / used by this claim?\n"
+                    'Reply with ONLY {"relevant": "yes" or "no"}.\n\n'
+                    f"CLAIM: {desc}\n"
+                    f"CITED PASSAGE: {excerpt}",
+                )
+            )
+    answers = await _gather_chat(ctx.client.chat_json, prompts)
+    total = len(prompts)
+    relevant = sum(1 for a in answers if str(a.get("relevant", "")).lower() == "yes")
+    if not total:
+        return {"precision": None, "relevant": relevant, "total": total}
+    return {"precision": round(relevant / total, 4), "relevant": relevant, "total": total}
+
+
+async def nc_semantic_precision(item: dict, ctx: EvalContext) -> dict:
+    """Count negative-control falsehoods the output asserts or endorses.
+
+    Reads item["nc_items"] as list of {"id": ..., "description": ...} dicts.
+    Returns {asserted, total, asserted_ids}.
+    """
+    output_text = _output_text(item)
+    nc_items: list[dict] = item.get("nc_items", [])
+    prompts = [
+        (
+            SYSTEM,
+            "Does the OUTPUT assert or endorse the following FALSE statement?\n"
+            'Reply with ONLY {"asserted": "yes" or "no"}.\n\n'
+            f"FALSE STATEMENT: {nc.get('description', '')}\n"
+            f"OUTPUT:\n{output_text}",
+        )
+        for nc in nc_items
+    ]
+    answers = await _gather_chat(ctx.client.chat_json, prompts)
+    asserted_ids = [
+        nc.get("id", "?")
+        for nc, a in zip(nc_items, answers, strict=False)
+        if str(a.get("asserted", "")).lower() == "yes"
+    ]
+    return {"asserted": len(asserted_ids), "total": len(nc_items), "asserted_ids": asserted_ids}
+
+
+async def fabricated_entity(item: dict, ctx: EvalContext) -> dict:
+    """Count systems/orgs/metrics named in the output but absent from the corpus.
+
+    Returns {count, entities}.
+    """
+    output_text = _output_text(item)
+    corpus = "\n".join(f"{ev.get('locator', '')} :: {ev.get('excerpt', '')}" for ev in item.get("evidence_index", []))
+    user = (
+        "List any system, organization, or metric NAMED in the OUTPUT that does NOT "
+        "appear anywhere in the CORPUS EVIDENCE.\n"
+        'Reply with ONLY {"fabricated": ["<entity>", ...]}.  Empty list if none.\n\n'
+        f"OUTPUT:\n{output_text}\n\n"
+        f"CORPUS EVIDENCE:\n{corpus}"
+    )
+    answer = await ctx.client.chat_json(SYSTEM, user)
+    entities = answer.get("fabricated", []) or []
+    return {"count": len(entities), "entities": list(entities)}
+
+
+async def contradiction(item: dict, ctx: EvalContext) -> dict:
+    """Count internally contradictory finding pairs.
+
+    Returns {count, pairs}.
+    """
+    lines = []
+    for f in item.get("findings", []):
+        lines.append(f"{f.get('id', '?')}: {f.get('title', '')} — {f.get('description', '')}")
+    user = (
+        "Are any two of these FINDINGS mutually contradictory? List each contradicting pair.\n"
+        'Reply with ONLY {"pairs": [["<id_a>", "<id_b>"], ...]}.  Empty list if none.\n\n' + "\n".join(lines)
+    )
+    answer = await ctx.client.chat_json(SYSTEM, user)
+    pairs = answer.get("pairs", []) or []
+    return {"count": len(pairs), "pairs": [list(p) for p in pairs]}
+
+
+async def open_gap(item: dict, ctx: EvalContext) -> dict:
+    """G-Eval open probe: the most important process issue the output missed.
+
+    Returns {gap} — a free-text advisory narrative (no score).
+    """
+    pg = item.get("process_graph") or {}
+    pg_summary = f"process_graph has {len(pg.get('processes', []))} processes"
+    user = (
+        "Given this corpus scope and output, what important process issue did the "
+        "output FAIL to surface?\n"
+        'Reply with ONLY {"gap": "<the most important missed issue, one short paragraph>"}.\n\n'
+        f"WORKSPACE SCOPE: {_workspace_intention(item)}\n"
+        f"{pg_summary}\n"
+        f"OUTPUT:\n{_output_text(item)}"
+    )
+    answer = await ctx.client.chat_json(SYSTEM, user)
+    return {"gap": str(answer.get("gap", ""))}
+
+
+async def actionability(item: dict, ctx: EvalContext) -> dict:
+    """Average 0-1 rating of whether proposed actions are specific+quantified+linked.
+
+    Returns {score, rated}.
+    """
+    actions = item.get("proposed_actions", []) or []
+    finding_ids = {f.get("id") for f in item.get("findings", [])}
+    prompts = [
+        (
+            SYSTEM,
+            "Rate whether this proposed action is SPECIFIC, QUANTIFIED, and LINKED to a "
+            "finding.\n"
+            'Reply with ONLY {"score": <number 0-1>}.\n\n'
+            f"TITLE: {a.get('title', '')}\n"
+            f"DESCRIPTION: {a.get('description', '')}\n"
+            f"OWNER: {a.get('owner_persona', '')}  HORIZON: {a.get('horizon', '')}  "
+            f"LEVER: {a.get('lever', '')}  EFFORT: {a.get('effort', '')}\n"
+            f"EXPECTED_SAVINGS_FTE: {a.get('expected_savings_fte', '')}  "
+            f"EXPECTED_SAVINGS_USD: {a.get('expected_savings_usd', '')}\n"
+            f"LINKED_TO_FINDING: {a.get('finding_id') in finding_ids}",
+        )
+        for a in actions
+    ]
+    answers = await _gather_chat(ctx.client.chat_json, prompts)
+    scores: list[float] = []
+    for a in answers:
+        value = _coerce_float(a.get("score"))
+        if value is None:
+            continue
+        scores.append(value)
+    score = round(sum(scores) / len(scores), 4) if scores else None
+    return {"score": score, "rated": len(scores)}
+
+
+async def severity_calibration(item: dict, ctx: EvalContext) -> dict:
+    """Per-finding judgment of whether stated severity matches the evidence.
+
+    Returns {miscalibrated, total, verdicts: {finding_id: under|over|calibrated}}.
+    """
+    ev_idx = _evidence_index(item)
+    findings = item.get("findings", [])
+    prompts = [
+        (
+            SYSTEM,
+            "Does the STATED SEVERITY match what the CITED EVIDENCE supports?\n"
+            'Reply with ONLY {"calibration": "under" or "over" or "calibrated"}.\n\n'
+            f"STATED SEVERITY: {f.get('severity', '')}  SCORE: {f.get('score', '')}\n"
+            f"FINDING: {f.get('description', '')}\n"
+            f"CITED EVIDENCE: {' || '.join(_cited_excerpts(f, ev_idx))}",
+        )
+        for f in findings
+    ]
+    answers = await _gather_chat(ctx.client.chat_json, prompts)
+    verdicts: dict[str, str] = {}
+    miscalibrated = 0
+    for f, a in zip(findings, answers, strict=False):
+        verdict = str(a.get("calibration", "calibrated")).lower()
+        verdicts[f.get("id", "?")] = verdict
+        if verdict in ("under", "over"):
+            miscalibrated += 1
+    return {"miscalibrated": miscalibrated, "total": len(findings), "verdicts": verdicts}
+
+
+async def answer_relevancy(item: dict, ctx: EvalContext) -> dict:
+    """RAGAS-style: does the output address the stated workspace intention?
+
+    Returns {score} in [0,1], or {"score": None} when the vote fails to coerce.
+    """
+    user = (
+        "Does the OUTPUT address the stated WORKSPACE INTENTION (on-topic, responsive)?\n"
+        'Reply with ONLY {"score": <number 0-1>}.\n\n'
+        f"WORKSPACE INTENTION: {_workspace_intention(item)}\n"
+        f"OUTPUT:\n{_output_text(item)}"
+    )
+    answer = await ctx.client.chat_json(SYSTEM, user)
+    return {"score": _coerce_float(answer.get("score"))}
+
+
+async def surface_deduplication(item: dict, ctx: EvalContext) -> dict:
+    """Fraction of near-duplicate process-graph node pairs that are genuinely distinct.
+
+    Returns {distinct, redundant, total, distinct_rate, redundant_pairs}.
+    """
+    pg = item.get("process_graph", {})
+    procs = pg.get("processes", [])
+
+    def _toks(node: dict) -> frozenset[str]:
+        return frozenset(node.get("name", "").lower().split())
+
+    per_surface_cap = 10
+    candidates: list[tuple[str, dict, dict, str]] = []
+
+    if len(procs) >= 2:
+        pairs: list[tuple[float, dict, dict]] = []
+        for i in range(len(procs)):
+            for j in range(i + 1, len(procs)):
+                a_t, b_t = _toks(procs[i]), _toks(procs[j])
+                union = a_t | b_t
+                if not union:
+                    continue
+                jac = len(a_t & b_t) / len(union)
+                if jac >= 0.30:
+                    pairs.append((jac, procs[i], procs[j]))
+        pairs.sort(key=lambda x: x[0], reverse=True)
+        for _jac, a, b in pairs[:per_surface_cap]:
+            candidates.append(("process", a, b, ""))
+
+    for surface_key, attr in (("activity", "activities"), ("decision", "decisions")):
+        all_pairs: list[tuple[float, dict, dict, str]] = []
+        for proc in procs:
+            nodes = proc.get(attr, [])
+            proc_name = proc.get("name", "")
+            if len(nodes) < 2:
+                continue
+            for i in range(len(nodes)):
+                for j in range(i + 1, len(nodes)):
+                    a_t, b_t = _toks(nodes[i]), _toks(nodes[j])
+                    union = a_t | b_t
+                    if not union:
+                        continue
+                    jac = len(a_t & b_t) / len(union)
+                    if jac >= 0.30:
+                        all_pairs.append((jac, nodes[i], nodes[j], proc_name))
+        all_pairs.sort(key=lambda x: x[0], reverse=True)
+        for _jac, a, b, proc_name in all_pairs[:per_surface_cap]:
+            candidates.append((surface_key, a, b, proc_name))
+
+    if not candidates:
+        return {"distinct": 0, "redundant": 0, "total": 0, "distinct_rate": None, "redundant_pairs": []}
+
+    prompts = []
+    for surface, a, b, parent_proc in candidates:
+        ctx_line = f"\nPARENT PROCESS: {parent_proc}\n" if parent_proc else ""
+        prompts.append(
+            (
+                SYSTEM,
+                f"Are these two {surface} nodes genuinely DISTINCT process concepts, or is one a "
+                f"duplicate / sub-case / restatement of the other?\n"
+                f"{ctx_line}"
+                'Reply with ONLY {"verdict": "DISTINCT" or "DUPLICATE", "reason": "<one line>"}.\n\n'
+                f"{surface.upper()} A: {a.get('name', '')} — {a.get('description', '')}\n"
+                f"{surface.upper()} B: {b.get('name', '')} — {b.get('description', '')}",
+            )
+        )
+
+    answers = await _gather_chat(ctx.client.chat_json, prompts)
+
+    distinct = 0
+    redundant = 0
+    redundant_pairs: list[dict] = []
+    for (surface, a, b, _parent), answer in zip(candidates, answers, strict=False):
+        verdict = str(answer.get("verdict", "")).upper()
+        if verdict == "DISTINCT":
+            distinct += 1
+        else:
+            redundant += 1
+            redundant_pairs.append(
+                {
+                    "surface": surface,
+                    "a": a.get("name", ""),
+                    "b": b.get("name", ""),
+                    "reason": str(answer.get("reason", "")),
+                }
+            )
+
+    total = distinct + redundant
+    return {
+        "distinct": distinct,
+        "redundant": redundant,
+        "total": total,
+        "distinct_rate": round(distinct / total, 4) if total else None,
+        "redundant_pairs": redundant_pairs,
+    }
+
+
+async def comparative_vs_champion(item: dict, ctx: EvalContext) -> dict | None:
+    """Pairwise MT-Bench-style review of candidate vs champion (advisory only).
+
+    Returns None if item["champion"] is not present.
+    Returns {candidate, champion, more_consistent}.
+    """
+    champion = item.get("champion")
+    if champion is None:
+        return None
+    user = (
+        "Score the CANDIDATE and the CHAMPION outputs on five axes (1-5 each): "
+        "Coverage, Quality, Evidence, Actionability, Regression.  Then say which is "
+        "more internally consistent.\n"
+        "Reply with ONLY "
+        '{"candidate": {"coverage": x, "quality": x, "evidence": x, "actionability": x, "regression": x}, '
+        '"champion": {"coverage": x, "quality": x, "evidence": x, "actionability": x, "regression": x}, '
+        '"more_consistent": "candidate" or "champion"}.\n\n'
+        f"CANDIDATE:\n{_output_text(item)}\n\n"
+        f"CHAMPION:\n{_output_text(champion)}"
+    )
+    out = await ctx.client.chat_json(SYSTEM, user)
+    return {
+        "candidate": out.get("candidate", {}),
+        "champion": out.get("champion", {}),
+        "more_consistent": out.get("more_consistent", ""),
+    }
+
+
+# ── flycanon custom metrics ───────────────────────────────────────────────────────
+
+
+async def _rag_score_once(item: dict, ctx: EvalContext) -> dict | None:
+    """Single RAG scoring call: returns {"contains_answer": float, "addresses_question": float}."""
+    question = item.get("question", "")
+    reference = item.get("reference", "")
+    answer = item.get("answer", "")
+    if not question or not answer:
+        return None
+    user = f"QUESTION: {question}\nREFERENCE: {reference}\nANSWER: {answer}\n\n{RUBRIC}"
+    result = await ctx.client.chat_json(SYSTEM_RAG, user)
+    return result
+
+
+async def contains_answer(item: dict, ctx: EvalContext) -> float | None:
+    """Flycanon: does the answer contain the correct information from the reference?
+
+    Runs ctx.runs times and returns the median score.
+    Returns None if the item lacks question/answer.
+    """
+    scores: list[float] = []
+    for _ in range(max(1, ctx.runs)):
+        result = await _rag_score_once(item, ctx)
+        if result is None:
+            return None
+        val = _coerce_float(result.get("contains_answer"))
+        if val is not None:
+            scores.append(val)
+    if not scores:
+        return None
+    return round(statistics.median(scores), 4)
+
+
+async def addresses_question(item: dict, ctx: EvalContext) -> float | None:
+    """Flycanon: does the answer directly address what the question is asking?
+
+    Runs ctx.runs times and returns the median score.
+    Returns None if the item lacks question/answer.
+    """
+    scores: list[float] = []
+    for _ in range(max(1, ctx.runs)):
+        result = await _rag_score_once(item, ctx)
+        if result is None:
+            return None
+        val = _coerce_float(result.get("addresses_question"))
+        if val is not None:
+            scores.append(val)
+    if not scores:
+        return None
+    return round(statistics.median(scores), 4)
+
+
+# ── RAGAS metrics ─────────────────────────────────────────────────────────────────
+# ragas/langchain imports are inline inside _sync() since ragas is optional.
+
+
+def _make_ragas_sample(item: dict):
+    """Build a RAGAS SingleTurnSample from an item dict (ragas import inline)."""
+    from ragas import SingleTurnSample  # type: ignore[import]  # noqa: PLC0415
+
+    return SingleTurnSample(
+        user_input=item.get("question", ""),
+        response=item.get("answer", ""),
+        reference=item.get("reference", ""),
+        retrieved_contexts=item.get("contexts", []),
+    )
+
+
+def _make_ragas_llm(ctx: EvalContext):
+    """Build a LangChain LLM wrapper for RAGAS (langchain import inline)."""
+    provider, model = ctx.client.provider, ctx.client.model
+    if provider == "anthropic":
+        from langchain_anthropic import ChatAnthropic  # type: ignore[import]  # noqa: PLC0415
+
+        api_key = os.environ.get("ANTHROPIC_API_KEY", "")
+        return ChatAnthropic(model=model, api_key=api_key, temperature=0.0)  # type: ignore[call-arg,arg-type]
+    if provider in ("openai", "azure"):
+        from langchain_openai import ChatOpenAI  # type: ignore[import]  # noqa: PLC0415
+
+        api_key = os.environ.get("OPENAI_API_KEY", "")
+        return ChatOpenAI(model=model, api_key=api_key, temperature=0.0)  # type: ignore[call-arg,arg-type]
+    if provider == "ollama":
+        from langchain_ollama import ChatOllama  # type: ignore[import]  # noqa: PLC0415
+
+        return ChatOllama(model=model, temperature=0.0)
+    raise ValueError(f"RAGAS: unsupported provider {provider!r}")
+
+
+def _make_ragas_embeddings(ctx: EvalContext):
+    """Build LangChain embeddings for RAGAS (langchain import inline)."""
+    if ctx.embedder is not None:
+        from langchain_ollama import OllamaEmbeddings  # type: ignore[import]  # noqa: PLC0415
+
+        return OllamaEmbeddings(model=ctx.embedder._model)
+    from langchain_anthropic import AnthropicEmbeddings  # type: ignore[import]  # noqa: PLC0415
+
+    return AnthropicEmbeddings()
+
+
+async def _ragas_score(metric_name: str, item: dict, ctx: EvalContext) -> float | None:
+    """Run a single named RAGAS metric and return its float score (or None)."""
+
+    def _sync():
+        from ragas import evaluate  # type: ignore[import]  # noqa: PLC0415
+        from ragas.dataset_schema import EvaluationDataset  # type: ignore[import]  # noqa: PLC0415
+        from ragas.metrics import (  # type: ignore[import]  # noqa: PLC0415
+            AnswerCorrectness,
+            AnswerRelevancy,
+            ContextPrecision,
+            ContextRecall,
+            Faithfulness,
+        )
+
+        _metrics_map = {
+            "answer_correctness": AnswerCorrectness,
+            "answer_relevancy_ragas": AnswerRelevancy,
+            "ragas_faithfulness": Faithfulness,
+            "context_recall": ContextRecall,
+            "context_precision": ContextPrecision,
+        }
+        metric_cls = _metrics_map.get(metric_name)
+        if metric_cls is None:
+            return None
+
+        llm = _make_ragas_llm(ctx)
+        embeddings = _make_ragas_embeddings(ctx)
+        metric = metric_cls(llm=llm, embeddings=embeddings)
+        sample = _make_ragas_sample(item)
+        dataset = EvaluationDataset(samples=[sample])
+        result = evaluate(dataset=dataset, metrics=[metric])
+        df = result.to_pandas()  # type: ignore[attr-defined]
+        col = df.columns[df.columns.str.contains(metric_name.replace("_ragas", ""), case=False)]
+        if col.empty:
+            return None
+        val = df[col[0]].iloc[0]
+        if val is None or (isinstance(val, float) and math.isnan(val)):
+            return None
+        return round(float(val), 4)
+
+    loop = asyncio.get_event_loop()
+    return await loop.run_in_executor(None, _sync)
+
+
+async def answer_correctness(item: dict, ctx: EvalContext) -> float | None:
+    """RAGAS answer correctness (semantic F1 against reference)."""
+    return await _ragas_score("answer_correctness", item, ctx)
+
+
+async def ragas_faithfulness(item: dict, ctx: EvalContext) -> float | None:
+    """RAGAS faithfulness (answer grounded in retrieved contexts)."""
+    return await _ragas_score("ragas_faithfulness", item, ctx)
+
+
+async def context_recall(item: dict, ctx: EvalContext) -> float | None:
+    """RAGAS context recall (reference coverage by retrieved contexts)."""
+    return await _ragas_score("context_recall", item, ctx)
+
+
+async def context_precision(item: dict, ctx: EvalContext) -> float | None:
+    """RAGAS context precision (retrieved contexts relevant to the question)."""
+    return await _ragas_score("context_precision", item, ctx)
+
+
+# ── median-of-N helpers ──────────────────────────────────────────────────────────
+
+
+def _numeric_leaves(d: dict) -> dict[tuple, float]:
+    """Flatten a metric dict to {path: float} over its FLOAT score-leaves only."""
+    out: dict[tuple, float] = {}
+
+    def walk(node, path: tuple) -> None:
+        if isinstance(node, float):
+            out[path] = node
+        elif isinstance(node, dict):
+            for k, v in node.items():
+                walk(v, path + (k,))
+
+    walk(d, ())
+    return out
+
+
+def _set_leaf(d: dict, path: tuple, value: float) -> None:
+    node = d
+    for key in path[:-1]:
+        node = node[key]
+    node[path[-1]] = value
+
+
+def _median_runs(samples: list[dict]) -> dict:
+    """Median across N metric-dicts: FLOAT score-leaves -> per-key median; rest = first."""
+    samples = [s for s in samples if isinstance(s, dict)]
+    if not samples:
+        return {}
+    base = samples[0]
+    if len(samples) == 1:
+        return base
+    leaf_values: dict[tuple, list[float]] = {}
+    for s in samples:
+        for path, val in _numeric_leaves(s).items():
+            leaf_values.setdefault(path, []).append(val)
+    merged = dict(base)
+    for path, vals in leaf_values.items():
+        try:
+            _set_leaf(merged, path, round(statistics.median(vals), 4))
+        except (KeyError, TypeError):
+            continue
+    return merged
+
+
+# ── orchestrator ─────────────────────────────────────────────────────────────────
+
+
+async def run_judge(
+    item: dict,
+    ctx: EvalContext,
+    *,
+    pipeline_model: str = "",
+) -> AdvisoryReport:
+    """Run all metrics concurrently and return an AdvisoryReport.
+
+    Best-effort: never raises. Failing metrics append to report.errors.
+    """
+    report = AdvisoryReport(
+        judge_model=ctx.client.model_spec,
+        same_provider_caveat=same_provider(pipeline_model, ctx.client.model_spec),
+        calibrated=False,
+        runs=ctx.runs,
+    )
+
+    # [D] deterministic (no LLM)
+    det_metrics: list[tuple[str, Metric]] = [
+        ("source_coverage", source_coverage),
+        ("excerpt_fill_rate", excerpt_fill_rate),
+    ]
+    # [E] embedding
+    emb_metrics: list[tuple[str, Metric]] = [
+        ("semantic_recovery", semantic_recovery),
+    ]
+    # [J] judge metrics (median-of-runs handled externally for single-call ones)
+    judge_metrics: list[tuple[str, Metric]] = [
+        ("faithfulness", faithfulness),
+        ("numeric_temporal_fidelity", numeric_temporal_fidelity),
+        ("citation_relevance", citation_relevance),
+        ("nc_semantic_precision", nc_semantic_precision),
+        ("fabricated_entity", fabricated_entity),
+        ("contradiction", contradiction),
+        ("open_gap", open_gap),
+        ("actionability", actionability),
+        ("severity_calibration", severity_calibration),
+        ("answer_relevancy", answer_relevancy),
+        ("surface_deduplication", surface_deduplication),
+        ("comparative_vs_champion", comparative_vs_champion),
+    ]
+    # flycanon custom
+    flycanon_metrics: list[tuple[str, Metric]] = [
+        ("contains_answer", contains_answer),
+        ("addresses_question", addresses_question),
+    ]
+    # RAGAS
+    ragas_metrics: list[tuple[str, Metric]] = [
+        ("answer_correctness", answer_correctness),
+        ("ragas_faithfulness", ragas_faithfulness),
+        ("context_recall", context_recall),
+        ("context_precision", context_precision),
+    ]
+
+    all_metrics = det_metrics + emb_metrics + judge_metrics + flycanon_metrics + ragas_metrics
+
+    async def _run_one(name: str, fn: Metric) -> None:
+        try:
+            result = await fn(item, ctx)
+            if result is not None:
+                report.metrics[name] = result
+        except Exception as exc:
+            report.errors.append(f"{name}: {type(exc).__name__}: {exc}")
+
+    await asyncio.gather(*[_run_one(name, fn) for name, fn in all_metrics])
+    return report
diff --git a/fireflyframework_agentic/evaluation/judge_client.py b/fireflyframework_agentic/evaluation/judge_client.py
new file mode 100644
index 00000000..7f050d16
--- /dev/null
+++ b/fireflyframework_agentic/evaluation/judge_client.py
@@ -0,0 +1,254 @@
+"""Async LLM scoring client for judge metrics.
+
+Thin httpx-based wrapper over Anthropic / OpenAI / Azure OpenAI / Ollama.
+Reads API keys lazily (per-call) from env so importing never requires secrets.
+Provider/model spec: "<provider>:<model>", e.g. "anthropic:claude-sonnet-4-6".
+"""
+
+from __future__ import annotations
+
+import asyncio
+import json
+import os
+import re
+
+import httpx
+
+_RETRY_STATUS = (429, 500, 502, 503, 504)
+_MAX_RETRY_AFTER = 30.0
+
+
+def _env(name: str, default: str | None = None) -> str | None:
+    value = os.environ.get(name)
+    if value is None:
+        return default
+    value = value.strip()
+    return value if value else default
+
+
+def parse_model(spec: str) -> tuple[str, str]:
+    """Split "provider:model" -> (provider, model). Bare spec -> ("unknown", spec)."""
+    spec = (spec or "").strip()
+    if ":" not in spec:
+        return "unknown", spec
+    provider, model = spec.split(":", 1)
+    return provider.strip().lower(), model.strip()
+
+
+def same_provider(pipeline_model: str, judge_model: str) -> bool:
+    """True iff both specs share the same known provider prefix."""
+    p, _ = parse_model(pipeline_model)
+    j, _ = parse_model(judge_model)
+    if p == "unknown" or j == "unknown":
+        return False
+    return p == j
+
+
+def _first_json_object(text: str) -> dict:
+    """Extract the first balanced JSON object from text (handles prose/code-fence wrapping)."""
+    if not text:
+        raise ValueError("empty model response")
+
+    # Fast path: a clean JSON object with no surrounding prose.  A non-dict
+    # clean parse (e.g. a top-level array) is intentionally ignored so the brace
+    # scanner can still find an embedded object rather than returning arr[0].
+    try:
+        parsed = json.loads(text.strip())
+    except (json.JSONDecodeError, ValueError):
+        parsed = None
+    if isinstance(parsed, dict):
+        return parsed
+
+    start = text.find("{")
+    while start != -1:
+        depth = 0
+        in_string = False
+        escape = False
+        for i in range(start, len(text)):
+            ch = text[i]
+            if in_string:
+                if escape:
+                    escape = False
+                elif ch == "\\":
+                    escape = True
+                elif ch == '"':
+                    in_string = False
+                continue
+            if ch == '"':
+                in_string = True
+            elif ch == "{":
+                depth += 1
+            elif ch == "}":
+                depth -= 1
+                if depth == 0:
+                    candidate = text[start : i + 1]
+                    try:
+                        return json.loads(candidate)
+                    except json.JSONDecodeError:
+                        break  # try the next '{'
+        start = text.find("{", start + 1)
+
+    # Greedy fallback: first '{' .. last '}' across newlines.
+    match = re.search(r"\{.*\}", text, re.DOTALL)
+    if match:
+        return json.loads(match.group(0))
+    raise ValueError("no JSON object found in model response")
+
+
+class JudgeClient:
+    """Async multi-provider chat client returning parsed JSON dicts.
+
+    Dispatch is by the provider prefix of the model spec.  temperature is pinned
+    to 0.0 for deterministic verdicts.  Transient HTTP errors (429/5xx) and network
+    errors are retried up to max_retries with backoff.
+
+    The API key / endpoint env vars are read lazily inside chat_json, so
+    constructing a JudgeClient never requires a secret.
+    """
+
+    def __init__(self, model: str, timeout: int = 120, max_retries: int = 3) -> None:
+        self.model_spec = model
+        self.provider, self.model = parse_model(model)
+        self.timeout = timeout
+        self.max_retries = max_retries
+
+    async def chat_json(self, system: str, user: str, max_tokens: int = 1024) -> dict:
+        """Send (system, user) to the provider and parse the first JSON object.
+
+        Raises on exhausted retries / unknown provider / unparseable output.
+        """
+        last_exc: Exception | None = None
+        for attempt in range(self.max_retries):
+            try:
+                if self.provider == "anthropic":
+                    return await self._anthropic(system, user, max_tokens)
+                if self.provider == "openai":
+                    return await self._openai(system, user, max_tokens)
+                if self.provider == "azure":
+                    return await self._azure(system, user, max_tokens)
+                if self.provider == "ollama":
+                    return await self._ollama(system, user, max_tokens)
+                raise ValueError(
+                    f"unknown judge provider {self.provider!r} in {self.model_spec!r}; "
+                    "use anthropic:/openai:/azure:/ollama:"
+                )
+            except httpx.HTTPStatusError as exc:
+                last_exc = exc
+                if exc.response.status_code not in _RETRY_STATUS or attempt == self.max_retries - 1:
+                    raise
+                retry_after_header = exc.response.headers.get("retry-after")
+                if retry_after_header is not None:
+                    try:
+                        delay = min(float(retry_after_header), _MAX_RETRY_AFTER)
+                    except (TypeError, ValueError):
+                        delay = 2.0**attempt
+                else:
+                    delay = 2.0**attempt
+                await asyncio.sleep(delay)
+            except httpx.RequestError as exc:
+                last_exc = exc
+                if attempt == self.max_retries - 1:
+                    raise
+                await asyncio.sleep(2.0)
+        if last_exc is not None:
+            raise last_exc
+        raise RuntimeError("chat_json exhausted retries without a response")
+
+    async def _anthropic(self, system: str, user: str, max_tokens: int) -> dict:
+        api_key = _env("ANTHROPIC_API_KEY")
+        if not api_key:
+            raise RuntimeError("ANTHROPIC_API_KEY not set")
+        body = {
+            "model": self.model,
+            "max_tokens": max_tokens,
+            "temperature": 0.0,
+            "system": system,
+            "messages": [{"role": "user", "content": user}],
+        }
+        headers = {
+            "x-api-key": api_key,
+            "anthropic-version": "2023-06-01",
+            "content-type": "application/json",
+        }
+        async with httpx.AsyncClient(timeout=self.timeout) as client:
+            resp = await client.post("https://api.anthropic.com/v1/messages", json=body, headers=headers)
+            resp.raise_for_status()
+            data = resp.json()
+        text = next((b.get("text") for b in data.get("content", []) if b.get("type") == "text"), None)
+        if not text:
+            raise RuntimeError(f"judge returned no text: {data}")
+        return _first_json_object(text)
+
+    async def _openai(self, system: str, user: str, max_tokens: int) -> dict:
+        api_key = _env("OPENAI_API_KEY")
+        if not api_key:
+            raise RuntimeError("OPENAI_API_KEY not set")
+        body = {
+            "model": self.model,
+            "max_tokens": max_tokens,
+            "temperature": 0.0,
+            "messages": [
+                {"role": "system", "content": system},
+                {"role": "user", "content": user},
+            ],
+        }
+        headers = {"Authorization": f"Bearer {api_key}", "content-type": "application/json"}
+        async with httpx.AsyncClient(timeout=self.timeout) as client:
+            resp = await client.post("https://api.openai.com/v1/chat/completions", json=body, headers=headers)
+            resp.raise_for_status()
+            data = resp.json()
+        choices = data.get("choices") or []
+        if choices:
+            text = (choices[0].get("message") or {}).get("content")
+            if text:
+                return _first_json_object(text)
+        raise RuntimeError(f"judge returned no text: {data}")
+
+    async def _azure(self, system: str, user: str, max_tokens: int) -> dict:
+        endpoint = _env("AZURE_OPENAI_ENDPOINT")
+        api_key = _env("AZURE_OPENAI_API_KEY")
+        if not endpoint:
+            raise RuntimeError("AZURE_OPENAI_ENDPOINT not set")
+        if not api_key:
+            raise RuntimeError("AZURE_OPENAI_API_KEY not set")
+        api_version = _env("AZURE_OPENAI_API_VERSION") or "2024-02-01"
+        url = f"{endpoint.rstrip('/')}/openai/deployments/{self.model}/chat/completions?api-version={api_version}"
+        body = {
+            "max_tokens": max_tokens,
+            "temperature": 0.0,
+            "messages": [
+                {"role": "system", "content": system},
+                {"role": "user", "content": user},
+            ],
+        }
+        headers = {"api-key": api_key, "content-type": "application/json"}
+        async with httpx.AsyncClient(timeout=self.timeout) as client:
+            resp = await client.post(url, json=body, headers=headers)
+            resp.raise_for_status()
+            data = resp.json()
+        choices = data.get("choices") or []
+        if choices:
+            text = (choices[0].get("message") or {}).get("content")
+            if text:
+                return _first_json_object(text)
+        raise RuntimeError(f"judge returned no text: {data}")
+
+    async def _ollama(self, system: str, user: str, max_tokens: int) -> dict:  # noqa: ARG002
+        host = _env("OLLAMA_HOST") or "http://localhost:11434"
+        body = {
+            "model": self.model,
+            "stream": False,
+            "options": {"temperature": 0.0},
+            "messages": [
+                {"role": "system", "content": system},
+                {"role": "user", "content": user},
+            ],
+        }
+        async with httpx.AsyncClient(timeout=self.timeout) as client:
+            resp = await client.post(f"{host.rstrip('/')}/api/chat", json=body)
+            resp.raise_for_status()
+            data = resp.json()
+        text = (data.get("message") or {}).get("content")
+        if not text:
+            raise RuntimeError(f"judge returned no text: {data}")
+        return _first_json_object(text)
diff --git a/fireflyframework_agentic/evaluation/retrieval_metrics.py b/fireflyframework_agentic/evaluation/retrieval_metrics.py
new file mode 100644
index 00000000..7c9c5cfe
--- /dev/null
+++ b/fireflyframework_agentic/evaluation/retrieval_metrics.py
@@ -0,0 +1,176 @@
+# Copyright 2026 Firefly Software Foundation
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Deterministic IR evaluation metrics for ranked retrieval results (no LLM, no network).
+
+Each metric is a plain function that takes a list of result rows and returns a
+float — the same design as scikit-learn or MS MARCO evaluation scripts.
+
+Result row schema (dict)::
+
+    {
+        "retrieved": [{"rank": int, "source_id": str, "is_gold": bool}, ...],
+        "gold": [str, ...],          # gold source identifiers
+        # optional:
+        "no_answer": bool,           # model refused / produced no answer
+        "answer": str,               # used for no_answer detection when no_answer absent
+        "citations": [{"is_gold": bool}, ...],
+        "search_ms": float,
+        "answer_ms": float,
+    }
+
+Individual metrics::
+
+    hit_at_k(results, k)        -> float
+    recall_at_k(results, k)     -> float
+    precision_at_k(results, k)  -> float
+    mrr(results, k=10)          -> float
+    map_score(results, k=10)    -> float
+    ndcg(results, k=10)         -> float
+    no_answer_rate(results)     -> float | None
+    citation_precision(results) -> float | None
+    mean_latency_ms(results, field) -> float | None
+"""
+
+from __future__ import annotations
+
+import math
+
+
+def _dedup(retrieved: list[dict]) -> list[dict]:
+    """Return one entry per source, first chunk wins, preserving rank order."""
+    seen: set[str] = set()
+    out: list[dict] = []
+    for r in sorted(retrieved, key=lambda x: x["rank"]):
+        key = r.get("source_id") or "|".join(r.get("identities", []))
+        if key not in seen:
+            seen.add(key)
+            out.append(r)
+    return out
+
+
+def _ndcg_single(retrieved: list[dict], n_gold: int, k: int = 10) -> float:
+    dcg = sum(1.0 / math.log2(r["rank"] + 1) for r in retrieved if r.get("is_gold") and r["rank"] <= k)
+    ideal = sum(1.0 / math.log2(i + 2) for i in range(min(n_gold, k)))
+    return dcg / ideal if ideal else 0.0
+
+
+def _ap_single(retrieved: list[dict], n_gold: int, k: int = 10) -> float:
+    hits, precisions = 0, []
+    for r in sorted(retrieved, key=lambda x: x["rank"]):
+        if r["rank"] > k:
+            break
+        if r.get("is_gold"):
+            hits += 1
+            precisions.append(hits / r["rank"])
+    return sum(precisions) / min(n_gold, k) if n_gold else 0.0
+
+
+def hit_at_k(results: list[dict], k: int) -> float:
+    """Fraction of queries where at least one gold document appears in top-k."""
+    if not results:
+        return 0.0
+    hits = 0
+    for row in results:
+        retrieved = _dedup(row["retrieved"])
+        gold_ranks = [r["rank"] for r in retrieved if r.get("is_gold")]
+        if any(g <= k for g in gold_ranks):
+            hits += 1
+    return round(hits / len(results), 4)
+
+
+def recall_at_k(results: list[dict], k: int) -> float:
+    """Mean fraction of gold documents found in top-k."""
+    if not results:
+        return 0.0
+    total = 0.0
+    for row in results:
+        retrieved = _dedup(row["retrieved"])
+        n_gold = max(len(set(row["gold"])), 1)
+        gold_ranks = [r["rank"] for r in retrieved if r.get("is_gold")]
+        total += len([g for g in gold_ranks if g <= k]) / n_gold
+    return round(total / len(results), 4)
+
+
+def precision_at_k(results: list[dict], k: int) -> float:
+    """Mean fraction of top-k results that are gold."""
+    if not results:
+        return 0.0
+    total = 0.0
+    for row in results:
+        retrieved = _dedup(row["retrieved"])
+        gold_ranks = [r["rank"] for r in retrieved if r.get("is_gold")]
+        total += len([g for g in gold_ranks if g <= k]) / k
+    return round(total / len(results), 4)
+
+
+def mrr(results: list[dict], k: int = 10) -> float:
+    """Mean reciprocal rank of the first gold hit (up to k)."""
+    if not results:
+        return 0.0
+    total = 0.0
+    for row in results:
+        retrieved = _dedup(row["retrieved"])
+        gold_ranks = sorted(r["rank"] for r in retrieved if r.get("is_gold") and r["rank"] <= k)
+        total += 1.0 / gold_ranks[0] if gold_ranks else 0.0
+    return round(total / len(results), 4)
+
+
+def map_score(results: list[dict], k: int = 10) -> float:
+    """Mean average precision at k."""
+    if not results:
+        return 0.0
+    total = 0.0
+    for row in results:
+        retrieved = _dedup(row["retrieved"])
+        n_gold = max(len(set(row["gold"])), 1)
+        total += _ap_single(retrieved, n_gold, k)
+    return round(total / len(results), 4)
+
+
+def ndcg(results: list[dict], k: int = 10) -> float:
+    """Mean normalised discounted cumulative gain at k."""
+    if not results:
+        return 0.0
+    total = 0.0
+    for row in results:
+        retrieved = _dedup(row["retrieved"])
+        n_gold = max(len(set(row["gold"])), 1)
+        total += _ndcg_single(retrieved, n_gold, k)
+    return round(total / len(results), 4)
+
+
+def no_answer_rate(results: list[dict]) -> float | None:
+    """Fraction of queries where the model produced no answer. None if no results."""
+    if not results:
+        return None
+    count = sum(1 for row in results if row.get("no_answer") or not row.get("answer", "").strip())
+    return round(count / len(results), 4)
+
+
+def citation_precision(results: list[dict]) -> float | None:
+    """Precision of in-answer citations vs gold set. None if no citations present."""
+    num = den = 0.0
+    for row in results:
+        cites = row.get("citations", [])
+        if cites:
+            num += sum(1 for c in cites if c.get("is_gold"))
+            den += len(cites)
+    return round(num / den, 4) if den else None
+
+
+def mean_latency_ms(results: list[dict], field: str) -> float | None:
+    """Mean latency in ms for the given field (``search_ms`` or ``answer_ms``). None if absent."""
+    values = [row[field] for row in results if row.get(field) is not None]
+    return round(sum(values) / len(values)) if values else None
diff --git a/pyproject.toml b/pyproject.toml
index cceaf667..dc6a1507 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -119,6 +119,12 @@ binary = [
 all = [
     "fireflyframework-agentic[postgres,mongodb,security,embeddings,openai-embeddings,cohere-embeddings,google-embeddings,mistral-embeddings,voyage-embeddings,azure-embeddings,bedrock-embeddings,ollama-embeddings,vectorstores-chroma,vectorstores-pinecone,vectorstores-qdrant,vectorstores-pgvector,vectorstores-sqlite-vec,watch,binary]",
 ]
+evaluation = [
+    "numpy>=1.26.0",
+    "ragas>=0.2",
+    "langchain-anthropic>=0.3",
+    "langchain-ollama>=0.3",
+]
 dev = [
     "pytest>=8.3.0",
     "pytest-asyncio>=0.24.0",
diff --git a/tests/unit/evaluation/__init__.py b/tests/unit/evaluation/__init__.py
new file mode 100644
index 00000000..e69de29b
diff --git a/tests/unit/evaluation/test_judge.py b/tests/unit/evaluation/test_judge.py
new file mode 100644
index 00000000..7f27c125
--- /dev/null
+++ b/tests/unit/evaluation/test_judge.py
@@ -0,0 +1,248 @@
+from unittest.mock import MagicMock
+
+import pytest
+
+from fireflyframework_agentic.evaluation.judge import (
+    EvalContext,
+    addresses_question,
+    contains_answer,
+    excerpt_fill_rate,
+    faithfulness,
+    source_coverage,
+)
+from fireflyframework_agentic.evaluation.judge_client import JudgeClient
+
+
+def make_ctx(responses: list[dict]) -> EvalContext:
+    client = MagicMock(spec=JudgeClient)
+    client.model_spec = "anthropic:claude-sonnet-4-6"
+    client.provider = "anthropic"
+    client.model = "claude-sonnet-4-6"
+    call_iter = iter(responses)
+
+    async def mock_chat_json(system, user, max_tokens=1024):
+        return next(call_iter)
+
+    client.chat_json = mock_chat_json
+    return EvalContext(client=client, runs=1)
+
+
+# ── contains_answer ──────────────────────────────────────────────────────────────
+
+
+@pytest.mark.asyncio
+async def test_contains_answer_present():
+    ctx = make_ctx([{"contains_answer": 1.0, "addresses_question": 1.0}])
+    item = {"question": "Q", "reference": "R", "answer": "A"}
+    score = await contains_answer(item, ctx)
+    assert score == 1.0
+
+
+@pytest.mark.asyncio
+async def test_contains_answer_absent():
+    ctx = make_ctx([{"contains_answer": 0.0, "addresses_question": 0.5}])
+    item = {"question": "Q", "reference": "R", "answer": "wrong"}
+    score = await contains_answer(item, ctx)
+    assert score == 0.0
+
+
+@pytest.mark.asyncio
+async def test_contains_answer_partial():
+    ctx = make_ctx([{"contains_answer": 0.5, "addresses_question": 0.8}])
+    item = {"question": "Q", "reference": "R", "answer": "partial"}
+    score = await contains_answer(item, ctx)
+    assert score == 0.5
+
+
+@pytest.mark.asyncio
+async def test_contains_answer_missing_question_returns_none():
+    ctx = make_ctx([])
+    item = {"reference": "R", "answer": "A"}
+    score = await contains_answer(item, ctx)
+    assert score is None
+
+
+# ── addresses_question ───────────────────────────────────────────────────────────
+
+
+@pytest.mark.asyncio
+async def test_addresses_question_yes():
+    ctx = make_ctx([{"contains_answer": 0.5, "addresses_question": 1.0}])
+    item = {"question": "Q", "reference": "R", "answer": "A"}
+    score = await addresses_question(item, ctx)
+    assert score == 1.0
+
+
+@pytest.mark.asyncio
+async def test_addresses_question_no():
+    ctx = make_ctx([{"contains_answer": 0.0, "addresses_question": 0.0}])
+    item = {"question": "Q", "reference": "R", "answer": "irrelevant"}
+    score = await addresses_question(item, ctx)
+    assert score == 0.0
+
+
+@pytest.mark.asyncio
+async def test_addresses_question_missing_answer_returns_none():
+    ctx = make_ctx([])
+    item = {"question": "Q", "reference": "R"}
+    score = await addresses_question(item, ctx)
+    assert score is None
+
+
+# ── faithfulness ─────────────────────────────────────────────────────────────────
+
+
+@pytest.mark.asyncio
+async def test_faithfulness_all_supported():
+    # One finding with cited evidence, judge says SUPPORTED.
+    ctx = make_ctx([{"verdict": "SUPPORTED", "reason": "matches"}])
+    item = {
+        "findings": [
+            {
+                "id": "F1",
+                "description": "The process takes 3 days.",
+                "evidence_refs": [{"evidence_id": "E1"}],
+            }
+        ],
+        "evidence_index": [{"id": "E1", "locator": "doc.pdf#1", "excerpt": "The process takes 3 days as documented."}],
+    }
+    result = await faithfulness(item, ctx)
+    assert result["supported"] == 1
+    assert result["total"] == 1
+    assert result["unsupported_ids"] == []
+
+
+@pytest.mark.asyncio
+async def test_faithfulness_not_supported():
+    ctx = make_ctx([{"verdict": "NOT_SUPPORTED", "reason": "contradicts"}])
+    item = {
+        "findings": [
+            {
+                "id": "F1",
+                "description": "The process takes 45 days.",
+                "evidence_refs": [{"evidence_id": "E1"}],
+            }
+        ],
+        "evidence_index": [{"id": "E1", "locator": "doc.pdf#1", "excerpt": "The process takes 3 days."}],
+    }
+    result = await faithfulness(item, ctx)
+    assert result["supported"] == 0
+    assert result["total"] == 1
+    assert "F1" in result["unsupported_ids"]
+
+
+@pytest.mark.asyncio
+async def test_faithfulness_no_cited_evidence():
+    # Finding with no evidence_refs -> counted as unsupported without LLM call.
+    ctx = make_ctx([])
+    item = {
+        "findings": [{"id": "F1", "description": "Something.", "evidence_refs": []}],
+        "evidence_index": [],
+    }
+    result = await faithfulness(item, ctx)
+    assert result["supported"] == 0
+    assert result["total"] == 1
+    assert "F1" in result["unsupported_ids"]
+
+
+# ── source_coverage ───────────────────────────────────────────────────────────────
+
+
+@pytest.mark.asyncio
+async def test_source_coverage_all_cited():
+    ctx = make_ctx([])
+    item = {
+        "findings": [
+            {
+                "id": "F1",
+                "description": "X",
+                "evidence_refs": [{"evidence_id": "E1"}],
+            }
+        ],
+        "evidence_index": [{"id": "E1", "locator": "doc.pdf#section1", "excerpt": "text"}],
+    }
+    result = await source_coverage(item, ctx)
+    assert result["cited"] == 1
+    assert result["total"] == 1
+    assert result["orphaned"] == []
+
+
+@pytest.mark.asyncio
+async def test_source_coverage_orphaned():
+    ctx = make_ctx([])
+    item = {
+        "findings": [{"id": "F1", "description": "X", "evidence_refs": []}],
+        "evidence_index": [
+            {"id": "E1", "locator": "doc1.pdf#p1", "excerpt": "text"},
+            {"id": "E2", "locator": "doc2.pdf#p2", "excerpt": "text2"},
+        ],
+    }
+    result = await source_coverage(item, ctx)
+    assert result["cited"] == 0
+    assert result["total"] == 2
+    assert len(result["orphaned"]) == 2
+
+
+@pytest.mark.asyncio
+async def test_source_coverage_stem_dedup():
+    # Two evidence items from the same file (different fragments) -> 1 source stem.
+    ctx = make_ctx([])
+    item = {
+        "findings": [
+            {
+                "id": "F1",
+                "description": "X",
+                "evidence_refs": [{"evidence_id": "E1"}],
+            }
+        ],
+        "evidence_index": [
+            {"id": "E1", "locator": "doc.pdf#section1", "excerpt": "text1"},
+            {"id": "E2", "locator": "doc.pdf#section2", "excerpt": "text2"},
+        ],
+    }
+    result = await source_coverage(item, ctx)
+    # Both E1 and E2 share "doc.pdf" stem -> 1 total stem.
+    assert result["total"] == 1
+    # E1 is cited -> that stem is covered.
+    assert result["cited"] == 1
+
+
+# ── excerpt_fill_rate ──────────────────────────────────────────────────────────────
+
+
+@pytest.mark.asyncio
+async def test_excerpt_fill_rate_full():
+    ctx = make_ctx([])
+    item = {
+        "evidence_index": [
+            {"id": "E1", "excerpt": "has content"},
+            {"id": "E2", "excerpt": "also has content"},
+        ]
+    }
+    result = await excerpt_fill_rate(item, ctx)
+    assert result["populated"] == 2
+    assert result["total"] == 2
+
+
+@pytest.mark.asyncio
+async def test_excerpt_fill_rate_partial():
+    ctx = make_ctx([])
+    item = {
+        "evidence_index": [
+            {"id": "E1", "excerpt": "has content"},
+            {"id": "E2", "excerpt": ""},
+            {"id": "E3", "excerpt": "   "},
+        ]
+    }
+    result = await excerpt_fill_rate(item, ctx)
+    assert result["populated"] == 1
+    assert result["total"] == 3
+
+
+@pytest.mark.asyncio
+async def test_excerpt_fill_rate_empty():
+    ctx = make_ctx([])
+    item = {"evidence_index": []}
+    result = await excerpt_fill_rate(item, ctx)
+    assert result["populated"] == 0
+    assert result["total"] == 0
diff --git a/tests/unit/evaluation/test_retrieval_metrics.py b/tests/unit/evaluation/test_retrieval_metrics.py
new file mode 100644
index 00000000..fa453e2d
--- /dev/null
+++ b/tests/unit/evaluation/test_retrieval_metrics.py
@@ -0,0 +1,181 @@
+# Copyright 2026 Firefly Software Foundation
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Unit tests for evaluation.retrieval_metrics."""
+
+from __future__ import annotations
+
+from fireflyframework_agentic.evaluation.retrieval_metrics import (
+    citation_precision,
+    hit_at_k,
+    map_score,
+    mean_latency_ms,
+    mrr,
+    ndcg,
+    no_answer_rate,
+    precision_at_k,
+    recall_at_k,
+)
+
+
+def _row(gold_rank: int | None = None, total: int = 5, n_gold: int = 1) -> dict:
+    retrieved = []
+    for rank in range(1, total + 1):
+        retrieved.append({"rank": rank, "source_id": f"doc-{rank}", "is_gold": rank == gold_rank})
+    gold_ids = [f"doc-{gold_rank}"] if gold_rank is not None else []
+    return {"retrieved": retrieved, "gold": gold_ids * n_gold}
+
+
+# ── hit_at_k ──────────────────────────────────────────────────────────────────
+
+
+def test_hit_at_k_gold_at_rank1():
+    assert hit_at_k([_row(gold_rank=1)], k=1) == 1.0
+
+
+def test_hit_at_k_miss_at_rank1():
+    assert hit_at_k([_row(gold_rank=2)], k=1) == 0.0
+
+
+def test_hit_at_k_gold_at_rank5():
+    assert hit_at_k([_row(gold_rank=5)], k=5) == 1.0
+
+
+def test_hit_at_k_gold_at_rank10():
+    assert hit_at_k([_row(gold_rank=10, total=10)], k=10) == 1.0
+
+
+def test_hit_at_k_empty():
+    assert hit_at_k([], k=5) == 0.0
+
+
+# ── recall_at_k ───────────────────────────────────────────────────────────────
+
+
+def test_recall_at_k_full_when_gold_at_rank1():
+    assert recall_at_k([_row(gold_rank=1, n_gold=1)], k=1) == 1.0
+
+
+def test_recall_at_k_zero_when_gold_outside_k():
+    assert recall_at_k([_row(gold_rank=5)], k=1) == 0.0
+
+
+def test_recall_at_k_increases_with_k():
+    rows = [_row(gold_rank=3)]
+    assert recall_at_k(rows, k=1) <= recall_at_k(rows, k=5) <= recall_at_k(rows, k=10)
+
+
+# ── precision_at_k ────────────────────────────────────────────────────────────
+
+
+def test_precision_at_k_gold_at_rank1():
+    assert precision_at_k([_row(gold_rank=1)], k=1) == 1.0
+
+
+def test_precision_at_k_decreases_when_k_larger():
+    rows = [_row(gold_rank=1)]
+    assert precision_at_k(rows, k=5) < precision_at_k(rows, k=1)
+
+
+# ── mrr ───────────────────────────────────────────────────────────────────────
+
+
+def test_mrr_gold_at_rank1():
+    assert mrr([_row(gold_rank=1)]) == 1.0
+
+
+def test_mrr_gold_at_rank2():
+    assert abs(mrr([_row(gold_rank=2)]) - 0.5) < 1e-9
+
+
+def test_mrr_no_gold():
+    assert mrr([_row(gold_rank=None)]) == 0.0
+
+
+def test_mrr_average_across_queries():
+    rows = [_row(gold_rank=1), _row(gold_rank=2)]
+    assert abs(mrr(rows) - 0.75) < 1e-3
+
+
+# ── ndcg ──────────────────────────────────────────────────────────────────────
+
+
+def test_ndcg_gold_at_rank1():
+    assert abs(ndcg([_row(gold_rank=1, n_gold=1)]) - 1.0) < 1e-9
+
+
+def test_ndcg_less_than_1_when_not_at_rank1():
+    score = ndcg([_row(gold_rank=3, n_gold=1)])
+    assert 0.0 < score < 1.0
+
+
+def test_ndcg_zero_when_no_gold():
+    assert ndcg([_row(gold_rank=None)]) == 0.0
+
+
+# ── map_score ─────────────────────────────────────────────────────────────────
+
+
+def test_map_score_perfect_when_gold_at_rank1():
+    assert map_score([_row(gold_rank=1, n_gold=1)]) == 1.0
+
+
+def test_map_score_zero_when_no_gold():
+    assert map_score([_row(gold_rank=None)]) == 0.0
+
+
+# ── no_answer_rate ────────────────────────────────────────────────────────────
+
+
+def test_no_answer_rate_zero_when_answer_present():
+    rows = [{**_row(gold_rank=1), "answer": "some answer"}]
+    assert no_answer_rate(rows) == 0.0
+
+
+def test_no_answer_rate_one_when_no_answer_field():
+    assert no_answer_rate([_row(gold_rank=1)]) == 1.0
+
+
+def test_no_answer_rate_none_when_empty():
+    assert no_answer_rate([]) is None
+
+
+# ── citation_precision ────────────────────────────────────────────────────────
+
+
+def test_citation_precision_none_when_no_citations():
+    assert citation_precision([_row(gold_rank=1)]) is None
+
+
+def test_citation_precision_1_when_all_gold():
+    rows = [{**_row(gold_rank=1), "citations": [{"is_gold": True}, {"is_gold": True}]}]
+    assert citation_precision(rows) == 1.0
+
+
+def test_citation_precision_half_when_half_gold():
+    rows = [{**_row(gold_rank=1), "citations": [{"is_gold": True}, {"is_gold": False}]}]
+    assert citation_precision(rows) == 0.5
+
+
+# ── mean_latency_ms ───────────────────────────────────────────────────────────
+
+
+def test_mean_latency_none_when_field_absent():
+    assert mean_latency_ms([_row(gold_rank=1)], "search_ms") is None
+
+
+def test_mean_latency_computed_when_present():
+    rows = [{**_row(gold_rank=1), "search_ms": 100.0, "answer_ms": 200.0}]
+    assert mean_latency_ms(rows, "search_ms") == 100
+    assert mean_latency_ms(rows, "answer_ms") == 200
diff --git a/uv.lock b/uv.lock
index 374cca9f..364552a9 100644
--- a/uv.lock
+++ b/uv.lock
@@ -1222,6 +1222,10 @@ dev = [
 embeddings = [
     { name = "numpy" },
 ]
+evaluation = [
+    { name = "numpy" },
+    { name = "scipy" },
+]
 google-embeddings = [
     { name = "google-generativeai" },
 ]
@@ -1292,6 +1296,7 @@ requires-dist = [
     { name = "mistralai", marker = "extra == 'mistral-embeddings'", specifier = ">=1.0.0" },
     { name = "motor", marker = "extra == 'mongodb'", specifier = ">=3.6.0" },
     { name = "numpy", marker = "extra == 'embeddings'", specifier = ">=1.26.0" },
+    { name = "numpy", marker = "extra == 'evaluation'", specifier = ">=1.26.0" },
     { name = "numpy", marker = "extra == 'reasoning-eval'", specifier = ">=2.0.0" },
     { name = "openai", marker = "extra == 'azure-embeddings'", specifier = ">=1.0.0" },
     { name = "openai", marker = "extra == 'openai-embeddings'", specifier = ">=1.0.0" },
@@ -1317,13 +1322,14 @@ requires-dist = [
     { name = "python-dotenv", specifier = ">=1.0.0" },
     { name = "qdrant-client", marker = "extra == 'vectorstores-qdrant'", specifier = ">=1.12.0" },
     { name = "ruff", marker = "extra == 'dev'", specifier = ">=0.9.0" },
+    { name = "scipy", marker = "extra == 'evaluation'", specifier = ">=1.11" },
     { name = "sqlalchemy", marker = "extra == 'postgres'", specifier = ">=2.0.0" },
     { name = "sqlite-vec", marker = "extra == 'vectorstores-sqlite-vec'", specifier = ">=0.1.6" },
     { name = "testcontainers", marker = "extra == 'dev'", specifier = ">=4.10.0" },
     { name = "voyageai", marker = "extra == 'voyage-embeddings'", specifier = ">=0.3.0" },
     { name = "watchfiles", marker = "extra == 'watch'", specifier = ">=0.24.0" },
 ]
-provides-extras = ["postgres", "mongodb", "security", "embeddings", "openai-embeddings", "cohere-embeddings", "google-embeddings", "mistral-embeddings", "voyage-embeddings", "azure-embeddings", "bedrock-embeddings", "ollama-embeddings", "reasoning-eval", "vectorstores-chroma", "vectorstores-sqlite-vec", "vectorstores-pinecone", "vectorstores-qdrant", "vectorstores-pgvector", "watch", "binary", "all", "dev"]
+provides-extras = ["postgres", "mongodb", "security", "embeddings", "openai-embeddings", "cohere-embeddings", "google-embeddings", "mistral-embeddings", "voyage-embeddings", "azure-embeddings", "bedrock-embeddings", "ollama-embeddings", "reasoning-eval", "vectorstores-chroma", "vectorstores-sqlite-vec", "vectorstores-pinecone", "vectorstores-qdrant", "vectorstores-pgvector", "watch", "binary", "all", "evaluation", "dev"]
 
 [[package]]
 name = "flatbuffers"
@@ -4502,6 +4508,57 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/87/72/c6c32d2b657fa3dad1de340254e14390b1e334ce38268b7ad51abda3c8c2/s3transfer-0.17.0-py3-none-any.whl", hash = "sha256:ce3801712acf4ad3e89fb9990df97b4972e93f4b3b0004d214be5bce12814c20", size = 86811, upload-time = "2026-04-29T22:07:34.966Z" },
 ]
 
+[[package]]
+name = "scipy"
+version = "1.17.1"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "numpy" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/7a/97/5a3609c4f8d58b039179648e62dd220f89864f56f7357f5d4f45c29eb2cc/scipy-1.17.1.tar.gz", hash = "sha256:95d8e012d8cb8816c226aef832200b1d45109ed4464303e997c5b13122b297c0", size = 30573822, upload-time = "2026-02-23T00:26:24.851Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/76/27/07ee1b57b65e92645f219b37148a7e7928b82e2b5dbeccecb4dff7c64f0b/scipy-1.17.1-cp313-cp313-macosx_10_14_x86_64.whl", hash = "sha256:5e3c5c011904115f88a39308379c17f91546f77c1667cea98739fe0fccea804c", size = 31590199, upload-time = "2026-02-23T00:19:17.192Z" },
+    { url = "https://files.pythonhosted.org/packages/ec/ae/db19f8ab842e9b724bf5dbb7db29302a91f1e55bc4d04b1025d6d605a2c5/scipy-1.17.1-cp313-cp313-macosx_12_0_arm64.whl", hash = "sha256:6fac755ca3d2c3edcb22f479fceaa241704111414831ddd3bc6056e18516892f", size = 28154001, upload-time = "2026-02-23T00:19:22.241Z" },
+    { url = "https://files.pythonhosted.org/packages/5b/58/3ce96251560107b381cbd6e8413c483bbb1228a6b919fa8652b0d4090e7f/scipy-1.17.1-cp313-cp313-macosx_14_0_arm64.whl", hash = "sha256:7ff200bf9d24f2e4d5dc6ee8c3ac64d739d3a89e2326ba68aaf6c4a2b838fd7d", size = 20325719, upload-time = "2026-02-23T00:19:26.329Z" },
+    { url = "https://files.pythonhosted.org/packages/b2/83/15087d945e0e4d48ce2377498abf5ad171ae013232ae31d06f336e64c999/scipy-1.17.1-cp313-cp313-macosx_14_0_x86_64.whl", hash = "sha256:4b400bdc6f79fa02a4d86640310dde87a21fba0c979efff5248908c6f15fad1b", size = 22683595, upload-time = "2026-02-23T00:19:30.304Z" },
+    { url = "https://files.pythonhosted.org/packages/b4/e0/e58fbde4a1a594c8be8114eb4aac1a55bcd6587047efc18a61eb1f5c0d30/scipy-1.17.1-cp313-cp313-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:2b64ca7d4aee0102a97f3ba22124052b4bd2152522355073580bf4845e2550b6", size = 32896429, upload-time = "2026-02-23T00:19:35.536Z" },
+    { url = "https://files.pythonhosted.org/packages/f5/5f/f17563f28ff03c7b6799c50d01d5d856a1d55f2676f537ca8d28c7f627cd/scipy-1.17.1-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:581b2264fc0aa555f3f435a5944da7504ea3a065d7029ad60e7c3d1ae09c5464", size = 35203952, upload-time = "2026-02-23T00:19:42.259Z" },
+    { url = "https://files.pythonhosted.org/packages/8d/a5/9afd17de24f657fdfe4df9a3f1ea049b39aef7c06000c13db1530d81ccca/scipy-1.17.1-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:beeda3d4ae615106d7094f7e7cef6218392e4465cc95d25f900bebabfded0950", size = 34979063, upload-time = "2026-02-23T00:19:47.547Z" },
+    { url = "https://files.pythonhosted.org/packages/8b/13/88b1d2384b424bf7c924f2038c1c409f8d88bb2a8d49d097861dd64a57b2/scipy-1.17.1-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:6609bc224e9568f65064cfa72edc0f24ee6655b47575954ec6339534b2798369", size = 37598449, upload-time = "2026-02-23T00:19:53.238Z" },
+    { url = "https://files.pythonhosted.org/packages/35/e5/d6d0e51fc888f692a35134336866341c08655d92614f492c6860dc45bb2c/scipy-1.17.1-cp313-cp313-win_amd64.whl", hash = "sha256:37425bc9175607b0268f493d79a292c39f9d001a357bebb6b88fdfaff13f6448", size = 36510943, upload-time = "2026-02-23T00:20:50.89Z" },
+    { url = "https://files.pythonhosted.org/packages/2a/fd/3be73c564e2a01e690e19cc618811540ba5354c67c8680dce3281123fb79/scipy-1.17.1-cp313-cp313-win_arm64.whl", hash = "sha256:5cf36e801231b6a2059bf354720274b7558746f3b1a4efb43fcf557ccd484a87", size = 24545621, upload-time = "2026-02-23T00:20:55.871Z" },
+    { url = "https://files.pythonhosted.org/packages/6f/6b/17787db8b8114933a66f9dcc479a8272e4b4da75fe03b0c282f7b0ade8cd/scipy-1.17.1-cp313-cp313t-macosx_10_14_x86_64.whl", hash = "sha256:d59c30000a16d8edc7e64152e30220bfbd724c9bbb08368c054e24c651314f0a", size = 31936708, upload-time = "2026-02-23T00:19:58.694Z" },
+    { url = "https://files.pythonhosted.org/packages/38/2e/524405c2b6392765ab1e2b722a41d5da33dc5c7b7278184a8ad29b6cb206/scipy-1.17.1-cp313-cp313t-macosx_12_0_arm64.whl", hash = "sha256:010f4333c96c9bb1a4516269e33cb5917b08ef2166d5556ca2fd9f082a9e6ea0", size = 28570135, upload-time = "2026-02-23T00:20:03.934Z" },
+    { url = "https://files.pythonhosted.org/packages/fd/c3/5bd7199f4ea8556c0c8e39f04ccb014ac37d1468e6cfa6a95c6b3562b76e/scipy-1.17.1-cp313-cp313t-macosx_14_0_arm64.whl", hash = "sha256:2ceb2d3e01c5f1d83c4189737a42d9cb2fc38a6eeed225e7515eef71ad301dce", size = 20741977, upload-time = "2026-02-23T00:20:07.935Z" },
+    { url = "https://files.pythonhosted.org/packages/d9/b8/8ccd9b766ad14c78386599708eb745f6b44f08400a5fd0ade7cf89b6fc93/scipy-1.17.1-cp313-cp313t-macosx_14_0_x86_64.whl", hash = "sha256:844e165636711ef41f80b4103ed234181646b98a53c8f05da12ca5ca289134f6", size = 23029601, upload-time = "2026-02-23T00:20:12.161Z" },
+    { url = "https://files.pythonhosted.org/packages/6d/a0/3cb6f4d2fb3e17428ad2880333cac878909ad1a89f678527b5328b93c1d4/scipy-1.17.1-cp313-cp313t-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:158dd96d2207e21c966063e1635b1063cd7787b627b6f07305315dd73d9c679e", size = 33019667, upload-time = "2026-02-23T00:20:17.208Z" },
+    { url = "https://files.pythonhosted.org/packages/f3/c3/2d834a5ac7bf3a0c806ad1508efc02dda3c8c61472a56132d7894c312dea/scipy-1.17.1-cp313-cp313t-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:74cbb80d93260fe2ffa334efa24cb8f2f0f622a9b9febf8b483c0b865bfb3475", size = 35264159, upload-time = "2026-02-23T00:20:23.087Z" },
+    { url = "https://files.pythonhosted.org/packages/4d/77/d3ed4becfdbd217c52062fafe35a72388d1bd82c2d0ba5ca19d6fcc93e11/scipy-1.17.1-cp313-cp313t-musllinux_1_2_aarch64.whl", hash = "sha256:dbc12c9f3d185f5c737d801da555fb74b3dcfa1a50b66a1a93e09190f41fab50", size = 35102771, upload-time = "2026-02-23T00:20:28.636Z" },
+    { url = "https://files.pythonhosted.org/packages/bd/12/d19da97efde68ca1ee5538bb261d5d2c062f0c055575128f11a2730e3ac1/scipy-1.17.1-cp313-cp313t-musllinux_1_2_x86_64.whl", hash = "sha256:94055a11dfebe37c656e70317e1996dc197e1a15bbcc351bcdd4610e128fe1ca", size = 37665910, upload-time = "2026-02-23T00:20:34.743Z" },
+    { url = "https://files.pythonhosted.org/packages/06/1c/1172a88d507a4baaf72c5a09bb6c018fe2ae0ab622e5830b703a46cc9e44/scipy-1.17.1-cp313-cp313t-win_amd64.whl", hash = "sha256:e30bdeaa5deed6bc27b4cc490823cd0347d7dae09119b8803ae576ea0ce52e4c", size = 36562980, upload-time = "2026-02-23T00:20:40.575Z" },
+    { url = "https://files.pythonhosted.org/packages/70/b0/eb757336e5a76dfa7911f63252e3b7d1de00935d7705cf772db5b45ec238/scipy-1.17.1-cp313-cp313t-win_arm64.whl", hash = "sha256:a720477885a9d2411f94a93d16f9d89bad0f28ca23c3f8daa521e2dcc3f44d49", size = 24856543, upload-time = "2026-02-23T00:20:45.313Z" },
+    { url = "https://files.pythonhosted.org/packages/cf/83/333afb452af6f0fd70414dc04f898647ee1423979ce02efa75c3b0f2c28e/scipy-1.17.1-cp314-cp314-macosx_10_14_x86_64.whl", hash = "sha256:a48a72c77a310327f6a3a920092fa2b8fd03d7deaa60f093038f22d98e096717", size = 31584510, upload-time = "2026-02-23T00:21:01.015Z" },
+    { url = "https://files.pythonhosted.org/packages/ed/a6/d05a85fd51daeb2e4ea71d102f15b34fedca8e931af02594193ae4fd25f7/scipy-1.17.1-cp314-cp314-macosx_12_0_arm64.whl", hash = "sha256:45abad819184f07240d8a696117a7aacd39787af9e0b719d00285549ed19a1e9", size = 28170131, upload-time = "2026-02-23T00:21:05.888Z" },
+    { url = "https://files.pythonhosted.org/packages/db/7b/8624a203326675d7746a254083a187398090a179335b2e4a20e2ddc46e83/scipy-1.17.1-cp314-cp314-macosx_14_0_arm64.whl", hash = "sha256:3fd1fcdab3ea951b610dc4cef356d416d5802991e7e32b5254828d342f7b7e0b", size = 20342032, upload-time = "2026-02-23T00:21:09.904Z" },
+    { url = "https://files.pythonhosted.org/packages/c9/35/2c342897c00775d688d8ff3987aced3426858fd89d5a0e26e020b660b301/scipy-1.17.1-cp314-cp314-macosx_14_0_x86_64.whl", hash = "sha256:7bdf2da170b67fdf10bca777614b1c7d96ae3ca5794fd9587dce41eb2966e866", size = 22678766, upload-time = "2026-02-23T00:21:14.313Z" },
+    { url = "https://files.pythonhosted.org/packages/ef/f2/7cdb8eb308a1a6ae1e19f945913c82c23c0c442a462a46480ce487fdc0ac/scipy-1.17.1-cp314-cp314-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:adb2642e060a6549c343603a3851ba76ef0b74cc8c079a9a58121c7ec9fe2350", size = 32957007, upload-time = "2026-02-23T00:21:19.663Z" },
+    { url = "https://files.pythonhosted.org/packages/0b/2e/7eea398450457ecb54e18e9d10110993fa65561c4f3add5e8eccd2b9cd41/scipy-1.17.1-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:eee2cfda04c00a857206a4330f0c5e3e56535494e30ca445eb19ec624ae75118", size = 35221333, upload-time = "2026-02-23T00:21:25.278Z" },
+    { url = "https://files.pythonhosted.org/packages/d9/77/5b8509d03b77f093a0d52e606d3c4f79e8b06d1d38c441dacb1e26cacf46/scipy-1.17.1-cp314-cp314-musllinux_1_2_aarch64.whl", hash = "sha256:d2650c1fb97e184d12d8ba010493ee7b322864f7d3d00d3f9bb97d9c21de4068", size = 35042066, upload-time = "2026-02-23T00:21:31.358Z" },
+    { url = "https://files.pythonhosted.org/packages/f9/df/18f80fb99df40b4070328d5ae5c596f2f00fffb50167e31439e932f29e7d/scipy-1.17.1-cp314-cp314-musllinux_1_2_x86_64.whl", hash = "sha256:08b900519463543aa604a06bec02461558a6e1cef8fdbb8098f77a48a83c8118", size = 37612763, upload-time = "2026-02-23T00:21:37.247Z" },
+    { url = "https://files.pythonhosted.org/packages/4b/39/f0e8ea762a764a9dc52aa7dabcfad51a354819de1f0d4652b6a1122424d6/scipy-1.17.1-cp314-cp314-win_amd64.whl", hash = "sha256:3877ac408e14da24a6196de0ddcace62092bfc12a83823e92e49e40747e52c19", size = 37290984, upload-time = "2026-02-23T00:22:35.023Z" },
+    { url = "https://files.pythonhosted.org/packages/7c/56/fe201e3b0f93d1a8bcf75d3379affd228a63d7e2d80ab45467a74b494947/scipy-1.17.1-cp314-cp314-win_arm64.whl", hash = "sha256:f8885db0bc2bffa59d5c1b72fad7a6a92d3e80e7257f967dd81abb553a90d293", size = 25192877, upload-time = "2026-02-23T00:22:39.798Z" },
+    { url = "https://files.pythonhosted.org/packages/96/ad/f8c414e121f82e02d76f310f16db9899c4fcde36710329502a6b2a3c0392/scipy-1.17.1-cp314-cp314t-macosx_10_14_x86_64.whl", hash = "sha256:1cc682cea2ae55524432f3cdff9e9a3be743d52a7443d0cba9017c23c87ae2f6", size = 31949750, upload-time = "2026-02-23T00:21:42.289Z" },
+    { url = "https://files.pythonhosted.org/packages/7c/b0/c741e8865d61b67c81e255f4f0a832846c064e426636cd7de84e74d209be/scipy-1.17.1-cp314-cp314t-macosx_12_0_arm64.whl", hash = "sha256:2040ad4d1795a0ae89bfc7e8429677f365d45aa9fd5e4587cf1ea737f927b4a1", size = 28585858, upload-time = "2026-02-23T00:21:47.706Z" },
+    { url = "https://files.pythonhosted.org/packages/ed/1b/3985219c6177866628fa7c2595bfd23f193ceebbe472c98a08824b9466ff/scipy-1.17.1-cp314-cp314t-macosx_14_0_arm64.whl", hash = "sha256:131f5aaea57602008f9822e2115029b55d4b5f7c070287699fe45c661d051e39", size = 20757723, upload-time = "2026-02-23T00:21:52.039Z" },
+    { url = "https://files.pythonhosted.org/packages/c0/19/2a04aa25050d656d6f7b9e7b685cc83d6957fb101665bfd9369ca6534563/scipy-1.17.1-cp314-cp314t-macosx_14_0_x86_64.whl", hash = "sha256:9cdc1a2fcfd5c52cfb3045feb399f7b3ce822abdde3a193a6b9a60b3cb5854ca", size = 23043098, upload-time = "2026-02-23T00:21:56.185Z" },
+    { url = "https://files.pythonhosted.org/packages/86/f1/3383beb9b5d0dbddd030335bf8a8b32d4317185efe495374f134d8be6cce/scipy-1.17.1-cp314-cp314t-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:6e3dcd57ab780c741fde8dc68619de988b966db759a3c3152e8e9142c26295ad", size = 33030397, upload-time = "2026-02-23T00:22:01.404Z" },
+    { url = "https://files.pythonhosted.org/packages/41/68/8f21e8a65a5a03f25a79165ec9d2b28c00e66dc80546cf5eb803aeeff35b/scipy-1.17.1-cp314-cp314t-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:a9956e4d4f4a301ebf6cde39850333a6b6110799d470dbbb1e25326ac447f52a", size = 35281163, upload-time = "2026-02-23T00:22:07.024Z" },
+    { url = "https://files.pythonhosted.org/packages/84/8d/c8a5e19479554007a5632ed7529e665c315ae7492b4f946b0deb39870e39/scipy-1.17.1-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:a4328d245944d09fd639771de275701ccadf5f781ba0ff092ad141e017eccda4", size = 35116291, upload-time = "2026-02-23T00:22:12.585Z" },
+    { url = "https://files.pythonhosted.org/packages/52/52/e57eceff0e342a1f50e274264ed47497b59e6a4e3118808ee58ddda7b74a/scipy-1.17.1-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:a77cbd07b940d326d39a1d1b37817e2ee4d79cb30e7338f3d0cddffae70fcaa2", size = 37682317, upload-time = "2026-02-23T00:22:18.513Z" },
+    { url = "https://files.pythonhosted.org/packages/11/2f/b29eafe4a3fbc3d6de9662b36e028d5f039e72d345e05c250e121a230dd4/scipy-1.17.1-cp314-cp314t-win_amd64.whl", hash = "sha256:eb092099205ef62cd1782b006658db09e2fed75bffcae7cc0d44052d8aa0f484", size = 37345327, upload-time = "2026-02-23T00:22:24.442Z" },
+    { url = "https://files.pythonhosted.org/packages/07/39/338d9219c4e87f3e708f18857ecd24d22a0c3094752393319553096b98af/scipy-1.17.1-cp314-cp314t-win_arm64.whl", hash = "sha256:200e1050faffacc162be6a486a984a0497866ec54149a01270adc8a59b7c7d21", size = 25489165, upload-time = "2026-02-23T00:22:29.563Z" },
+]
+
 [[package]]
 name = "secretstorage"
 version = "3.5.0"