diff --git a/.github/workflows/pr-gate.yml b/.github/workflows/pr-gate.yml index d9767c01..399b45ca 100644 --- a/.github/workflows/pr-gate.yml +++ b/.github/workflows/pr-gate.yml @@ -57,7 +57,7 @@ jobs: - uses: actions/setup-python@v6 with: python-version: '3.13' - - run: uv sync --extra dev --extra binary --extra vectorstores-sqlite-vec --extra openai-embeddings + - run: uv sync --extra dev --extra binary --extra vectorstores-sqlite-vec --extra openai-embeddings --extra evaluation - run: uv run pyright test: @@ -72,7 +72,7 @@ jobs: - uses: actions/setup-python@v6 with: python-version: '3.13' - - run: uv sync --extra dev --extra binary --extra vectorstores-sqlite-vec --extra vectorstores-pgvector --extra openai-embeddings + - run: uv sync --extra dev --extra binary --extra vectorstores-sqlite-vec --extra vectorstores-pgvector --extra openai-embeddings --extra evaluation - run: uv run pytest -m "not nightly" --cov --cov-report=term-missing build: diff --git a/README.md b/README.md index 9d005b23..904237da 100644 --- a/README.md +++ b/README.md @@ -412,6 +412,12 @@ classDiagram `EvalDataset` loads/saves test cases from JSON. `ModelComparison` runs the same prompts across multiple agents for side-by-side analysis. +- **Evaluation** — Gate-based quality gates (G1–G5), LLM-as-judge advisory scoring, + champion/challenger tracking, and deterministic retrieval metrics for assessing + agent and pipeline outputs. The `flyeval` CLI drives the full gate pipeline from + the command line. Install with `pip install "fireflyframework-agentic[evaluation]"`. + See [docs/evaluation.md](docs/evaluation.md) for the full guide. + > **Optional developer tooling.** `fireflyframework_agentic.experiments` (A/B > experiments) and `fireflyframework_agentic.lab` (offline evaluation / > benchmarking) are leaf modules — nothing in the core imports them and they add @@ -817,6 +823,7 @@ Detailed guides for each module: - [Security](docs/security.md) — Prompt/output guards, at-rest encryption - [Experiments](docs/experiments.md) — A/B testing, variant comparison - [Lab](docs/lab.md) — Benchmarks, datasets, evaluators +- [Evaluation](docs/evaluation.md) — Gate pipeline, flyeval CLI, champion/challenger, retrieval metrics - Studio — moved to [fireflyframework-agentic-studio](https://github.com/fireflyframework/fireflyframework-agentic-studio) --- diff --git a/docs/evaluation.md b/docs/evaluation.md new file mode 100644 index 00000000..c2abe319 --- /dev/null +++ b/docs/evaluation.md @@ -0,0 +1,435 @@ +# Evaluation Guide + +Copyright 2026 Firefly Software Foundation. Licensed under the Apache License 2.0. + +The Evaluation subpackage provides gate-based quality gates, LLM-as-judge advisory scoring, +champion/challenger tracking, and deterministic retrieval metrics for assessing agent outputs. + +--- + +## Concepts + +### Gate pipeline + +The evaluation framework runs **five gates** in sequence. Every gate always runs — a failed +gate raises a *flag*, not a veto, so the scorecard always carries the complete picture. + +| Gate | Name | Kind | Description | +|------|------|------|-------------| +| G1 | Structural & Safe | Deterministic | Schema validity, PII non-disclosure, empty-registry guard. | +| G2 | Must-finds & Negative Controls | Deterministic | Lexical/semantic recall against the must-find registry; NC precision. | +| G3 | Evidence (Grounding) | Deterministic | Excerpt-to-corpus anchoring; fabricated-evidence detection. | +| G4 | LLM-as-a-Judge | Advisory (non-blocking) | Semantic faithfulness, entailment, gap detection — never changes the verdict. | +| G5 | No-regression / Promotion | Human decision | Champion/challenger comparison with A/A noise band; collects sign-offs. | + +**No gate vetoes.** Failures append to the `GateResult` flags list and scoring continues. +The scorecard carries every signal regardless of which gates fired. + +### GateResult + +`GateResult` is a dataclass returned by each gate: + +```python +@dataclass +class GateResult: + gate: str # "G1", "G2", …, "G5" + passed: bool + reason_code: str = "" # e.g. "SCHEMA_INVALID", "NC_HIT", "UNGROUNDED" + details: dict = field(default_factory=dict) +``` + +`str(gate_result)` prints `[G2] PASS` or `[G2] FLAG:NC_HIT`. + +### Verdict + +`verdict(gate_results)` returns `VERDICT_PROMOTE` or `VERDICT_HOLD`: + +- `VERDICT_PROMOTE` — all gates passed **and** G5 (the human sign-off gate) is present. +- `VERDICT_HOLD` — any gate flagged, or G5 is missing. + +The CLI exits `0` on PROMOTE and `1` on HOLD, so it composes into CI. + +### Must-find registry + +A registry (`lean-1` schema) is a JSON file listing items the discovery output is +expected to surface (`tier` L0–L3) and negative controls (NC) it must *not* assert. + +```json +{ + "schema_version": "lean-1", + "corpus": "banca-cordobesa", + "items": [ + { "id": "ao-pep-4eyes", "tier": "L0", "scope": "decision", + "description": "PEP cases require a second analyst sign-off (4-eyes)", + "keywords": ["PEP", "4-eyes"], + "evidence": ["SOP-002-kyc-edd.md"] }, + { "id": "ao-nc-realtime", "tier": "NC", "scope": "finding", + "description": "KYC-Hub synchronises in real time — factually false" } + ] +} +``` + +Tier semantics: L0 = must-find control (a single miss flags the run), L1 = high-priority, +L2 = important, L3 = nice-to-have (not counted in the recall floor). + +### Advisory judge (G4) + +G4 calls a chat LLM (or local Ollama model) for semantic checks the deterministic gates +cannot perform: faithfulness, entailment, numeric/temporal fidelity, actionability, +fabricated-entity detection, and more. It is: + +- **Non-blocking** — `AdvisoryReport` is carried separately and never enters `verdict()`. +- **Non-deterministic** — each metric runs `judge_runs` times (default: 3) and the + median score is reported. +- **Opt-in** — pass `--judge-model provider:model` to activate it; omit the flag to skip. + +### Champion/challenger pattern + +Champions are **per-corpus**. `ChampionRecord` persists the best-known run so that +promotion decisions are made against a stable, signed baseline rather than the last run. + +``` + ┌──────────────────────────────────────────┐ + │ run result JSON (challenger) │ + └──────────────┬───────────────────────────┘ + │ + ┌───────────────▼───────────────┐ + │ G1 · G2 · G3 (deterministic) │ + │ G4 (advisory, opt-in) │ + └───────────────┬───────────────┘ + │ flags + scores + ┌───────────────▼───────────────┐ + │ G5 — no-regression vs │ + │ champion baseline + A/A band │ + └───────────────┬───────────────┘ + │ + ┌───────────────▼───────────────┐ + │ Markdown scorecard │ + │ PROMOTE / HOLD │ + └───────────────────────────────┘ +``` + +`invalidate_champion()` marks a baseline invalid. The `EMPTY_MUST_FIND` guard in G1 +prevents a fake-100% champion being created against an empty registry. + +--- + +## Installation + +The evaluation subpackage requires `scipy` and `numpy`. Install the optional extra: + +```bash +pip install "fireflyframework-agentic[evaluation]" +``` + +The `flyeval` CLI entry-point is registered automatically by the package. Verify: + +```bash +flyeval --version +``` + +--- + +## CLI + +All subcommands exit `0` on PROMOTE and `1` on HOLD. + +### `flyeval gate` + +Run the full gate pipeline against a result JSON and print a Markdown scorecard. + +```bash +flyeval gate \ + --result runs/2026-06-18/output.json \ + --registry registries/banca-cordobesa.json \ + --baseline baselines/banca-cordobesa.json \ + --judge-model anthropic:claude-3-5-haiku \ + --judge-runs 3 +``` + +Key flags: + +| Flag | Default | Description | +|------|---------|-------------| +| `--result` | required | Path to the run's `output.json`. | +| `--registry` | required | Must-find registry (lean-1 JSON). | +| `--baseline` | — | Champion baseline JSON for G5 regression check. | +| `--judge-model` | — | `provider:model` for G4 advisory judge. | +| `--judge-runs` | 3 | Number of independent judge calls (median aggregation). | +| `--no-judge` | — | Skip G4 entirely. | +| `--recall-floor` | 0.70 | Minimum G2 recall before flagging. | +| `--grounding-floor` | 0.90 | Minimum G3 grounding rate before flagging. | +| `--corpus` | — | Path to the evidence corpus bundle for G3 verification. | +| `--pii-list` | — | Path to a JSON array of names to scan for PII leaks (G1). | +| `--embedder` | — | `provider:model` for semantic recall (G2 embedding path). | +| `--model-id` | "unknown" | Identifier of the model under evaluation (for scorecard). | + +### `flyeval aa-band` + +Compute the A/A noise band from multiple repeated runs of the same model to establish +the noise floor before setting up the champion comparison. + +```bash +flyeval aa-band \ + --results runs/aa-run-1/output.json runs/aa-run-2/output.json runs/aa-run-3/output.json \ + --registry registries/banca-cordobesa.json +``` + +The command prints per-metric variance and recommended noise floors. + +### `flyeval day-zero` + +Promote the very first champion for a corpus (Day-Zero protocol). Requires at least +`--signoffs` sign-offs (default: 2) before PROMOTE is issued. + +```bash +flyeval day-zero \ + --result runs/2026-06-18/output.json \ + --registry registries/banca-cordobesa.json \ + --baseline baselines/banca-cordobesa.json \ + --signoffs 2 +``` + +The command writes the new `ChampionRecord` into `--baseline` on success. + +### `flyeval invalidate` + +Mark the current champion invalid with a documented reason. Use this when the registry +changes in a way that makes the existing champion incommensurable. + +```bash +flyeval invalidate \ + --baseline baselines/banca-cordobesa.json \ + --reason "Registry expanded from 39 to 94 items (lean-1 v2)." +``` + +--- + +## Python API + +### Running gates + +```python +import json +from fireflyframework_agentic.evaluation import ( + run_gates, + render_scorecard, + verdict, + load_registry, + VERDICT_PROMOTE, +) + +result = json.loads(open("runs/2026-06-18/output.json").read()) +registry = load_registry("registries/banca-cordobesa.json") + +gate_results = run_gates(result, registry) +scorecard_md = render_scorecard( + gate_results, + corpus="banca-cordobesa", + model_id="anthropic:claude-3-5-sonnet", + run_id="2026-06-18-sonnet-01", +) +print(scorecard_md) + +v = verdict(gate_results) +print("Verdict:", v) # "PROMOTE" or "HOLD" +assert v == VERDICT_PROMOTE +``` + +### Champion management + +```python +from fireflyframework_agentic.evaluation import ( + load_champion, + save_champion, + invalidate_champion, + ChampionRecord, +) + +# Load the current champion (returns None on Day Zero). +champ = load_champion("baselines/banca-cordobesa.json") +if champ is None: + print("Day Zero — no champion yet.") +else: + print(f"Champion: {champ.run_id} | {champ.primary_metric()}={champ.primary_score():.3f}") + +# Save a new champion after a successful PROMOTE. +new_champ = ChampionRecord( + corpus="banca-cordobesa", + run_id="2026-06-18-sonnet-01", + model_id="anthropic:claude-3-5-sonnet", + registry_sha256=registry.sha256(), + scores={"lexical_recall": 0.857, "grounding_pct": 0.941}, + human_sign_offs=["alice", "bob"], +) +save_champion("baselines/banca-cordobesa.json", new_champ) + +# Invalidate when the registry changes materially. +invalidate_champion( + "baselines/banca-cordobesa.json", + reason="Registry expanded from 39 to 94 items.", +) +``` + +### EvalConfig + +`EvalConfig` is a Pydantic model that captures the parameters of a single evaluation run. +Use it to build reproducible, serialisable run records. + +```python +from fireflyframework_agentic.evaluation.models import EvalConfig + +cfg = EvalConfig( + model_id="anthropic:claude-3-5-sonnet", + corpus="banca-cordobesa", + run_id="2026-06-18-sonnet-01", + registry_path="registries/banca-cordobesa.json", + corpus_path="corpora/banca-cordobesa/", + baseline_path="baselines/banca-cordobesa.json", + judge_model="anthropic:claude-3-5-haiku", + judge_runs=3, +) +print(cfg.model_dump_json(indent=2)) +``` + +### Advisory judge (G4) + +```python +from fireflyframework_agentic.evaluation import run_judge, JudgeClient, build_embedder + +client = JudgeClient( + chat_fn=my_chat_fn, # callable(system: str, user: str) -> dict + embed_fn=build_embedder("ollama:bge-m3"), +) + +advisory = run_judge( + result=result, + registry=registry, + client=client, + runs=3, + missed_ids=[], # IDs the deterministic G2 missed — judge tries to recover them +) +print(advisory.scores) # dict of metric -> float +print(advisory.errors) # any metrics that failed (best-effort, never raises) +``` + +--- + +## Retrieval Metrics + +The `compute_retrieval_metrics()` function computes standard IR metrics over ranked +retrieval results. It is imported from `fireflyframework_agentic.lab.retrieval_metrics` +and re-exported by the evaluation package. + +Supported metrics at cut-offs k ∈ {1, 5, 10}: + +- **Hit@k** — at least one gold document in top-k. +- **Recall@k** — fraction of gold documents in top-k. +- **Precision@k** — fraction of top-k results that are gold. +- **MRR@10** — mean reciprocal rank of the first gold hit. +- **MAP@10** — mean average precision. +- **nDCG@10** — normalised discounted cumulative gain. + +```python +from fireflyframework_agentic.evaluation import compute_retrieval_metrics, RetrieverMetrics + +# Each row is a query; each row's "retrieved" list is ranked (rank=1 is top). +rows = [ + { + "query": "KYC enhanced due diligence steps", + "gold": ["SOP-002-kyc-edd.md"], + "retrieved": [ + {"rank": 1, "source_id": "SOP-002-kyc-edd.md", "is_gold": True}, + {"rank": 2, "source_id": "SOP-001-account-opening.md", "is_gold": False}, + {"rank": 3, "source_id": "INT-002-KYC-Jaime.md", "is_gold": True}, + ], + }, +] + +metrics: RetrieverMetrics = compute_retrieval_metrics(rows) +print(f"Recall@5: {metrics.recall_5:.3f}") +print(f"nDCG@10: {metrics.ndcg_10:.3f}") +print(f"MRR@10: {metrics.mrr_10:.3f}") +``` + +`RetrieverMetrics` also carries optional fields when the raw rows include them: +`no_answer_rate`, `citation_precision`, `mean_search_ms`, `mean_answer_ms`. + +--- + +## Architecture + +```mermaid +flowchart TD + R["result JSON\n(DiscoveryResult / output.json)"] + REG["Registry JSON\n(lean-1 must-find)"] + CORP["Corpus bundle\n(raw evidence documents)"] + BASE["Baseline JSON\n(champion record)"] + + R --> G1["G1 · Structural & Safe\n(schema, PII, empty-registry)"] + REG --> G1 + R --> G2["G2 · Recall & NC Precision\n(lexical + optional semantic)"] + REG --> G2 + R --> G3["G3 · Grounding\n(excerpt anchoring, fabrication)"] + CORP --> G3 + R --> G4["G4 · LLM Judge advisory\n(faithfulness, entailment, gaps)"] + REG --> G4 + G1 --> SC["Markdown Scorecard\nrender_scorecard()"] + G2 --> SC + G3 --> SC + G4 -.advisory.-> SC + BASE --> G5["G5 · No-regression\n(A/A band, sign-offs)"] + G1 --> G5 + G2 --> G5 + G3 --> G5 + G5 --> SC + SC --> V["verdict()\nPROMOTE / HOLD"] + V --> CHAMP["save_champion()\nor invalidate_champion()"] +``` + +--- + +## Reference + +### Exports + +All symbols below are importable from `fireflyframework_agentic.evaluation`. + +| Symbol | Kind | Description | +|--------|------|-------------| +| `EvalConfig` | Pydantic model | Parameters for a single evaluation run. | +| `GateResult` | Dataclass | Result of one gate: `gate`, `passed`, `reason_code`, `details`. | +| `Verdict` | Constants class | `Verdict.PROMOTE`, `Verdict.HOLD`. | +| `VERDICT_PROMOTE` | `str` | `"PROMOTE"`. | +| `VERDICT_HOLD` | `str` | `"HOLD"`. | +| `run_gates()` | Function | Run all four deterministic gates (G1–G3, G5 shape) and return results. | +| `g2_recall_precision()` | Function | Run only G2 (recall + NC precision) and return `GateResult`. | +| `verdict()` | Function | Derive PROMOTE/HOLD from a list of `GateResult`. | +| `render_scorecard()` | Function | Render a Markdown scorecard from gate results and metadata. | +| `ChampionRecord` | Dataclass | Per-corpus champion metadata and scores. | +| `load_champion()` | Function | Load the current champion from `baseline.json`; returns `None` on Day Zero. | +| `save_champion()` | Function | Persist a new champion to `baseline.json`. | +| `invalidate_champion()` | Function | Mark the champion invalid with a reason string. | +| `AdvisoryReport` | Dataclass | G4 judge output: `scores`, `errors`, `raw`. | +| `run_judge()` | Function | Run the LLM-as-a-Judge advisory pass. | +| `JudgeClient` | Dataclass | Holds `chat_fn` and `embed_fn` for the judge. | +| `OllamaEmbedder` | Class | Local Ollama embedding callable (default BGE-M3). | +| `build_embedder()` | Function | Factory: `"ollama:bge-m3"` → `OllamaEmbedder`. | +| `cosine()` | Function | Cosine similarity between two numpy vectors. | +| `Registry` | Dataclass | Parsed must-find registry with real items and NC items. | +| `RegistryItem` | Dataclass | One must-find or NC item: `id`, `tier`, `scope`, `description`, …. | +| `load_registry()` | Function | Parse and validate a lean-1 registry JSON file. | +| `registry_sha256()` | Function | SHA-256 of a registry file path. | +| `load_corpus()` | Function | Load and index a corpus bundle for G3 evidence verification. | +| `corpus_sha256()` | Function | SHA-256 of a corpus directory or bundle. | +| `verify_evidence_index()` | Function | Check each `evidence_index` entry against the corpus. | +| `EMPTY` / `FABRICATED` / `SOURCE_UNKNOWN` / `VERIFIED` | `str` | Evidence verification status constants. | +| `RetrieverMetrics` | Pydantic model | IR metrics: `recall_k`, `precision_k`, `ndcg_10`, `mrr_10`, `map_10`. | +| `compute_retrieval_metrics()` | Function | Compute IR metrics from a list of ranked-retrieval result rows. | +| `anchored()` | Function | True if claim and evidence share at least one non-trivial token. | +| `matches()` | Function | Gate predicate: does a candidate match a registry item? | +| `source_stem()` | Function | Normalise a `locator` path to its file stem for dedup. | +| `tokens()` | Function | Tokenise text to a list of lowercase word strings. | +| `aa_band()` | Function | Compute per-metric A/A noise floor from repeated runs. | +| `aggregate_grounding()` | Function | Summarise grounding stats across a result's findings. | +| `left_skew_flag()` | Function | True when the score distribution is left-skewed (over-optimistic). | diff --git a/examples/flycanon_eval_example.py b/examples/flycanon_eval_example.py new file mode 100644 index 00000000..30e66bd1 --- /dev/null +++ b/examples/flycanon_eval_example.py @@ -0,0 +1,376 @@ +# Copyright 2026 Firefly Software Foundation +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""FlyCanon evaluation example — RAG retrieval benchmark with champion/challenger tracking. + +Demonstrates how to use ``fireflyframework_agentic.evaluation`` to replicate +the flycanon experiment evaluation workflow: + +1. Load a results JSONL file produced by a flycanon retrieval pipeline. +2. Compute deterministic IR metrics (Recall@k, Precision@k, MRR, nDCG, MAP). +3. Compare against a saved baseline to detect regression. +4. Print a formatted metrics table. +5. Offer to promote the new run to champion when it beats the baseline. + +The champion/challenger pattern mirrors the flycanon_experiments harness: +each run writes metrics to a file; ``approve`` promotes it by repointing +baseline.json. Here we replicate that flow using the framework's +individual retrieval metric functions directly. + +Usage:: + + # Score a results file (no baseline comparison) + python examples/flycanon_eval_example.py --results-file results.jsonl + + # Compare against a saved baseline + python examples/flycanon_eval_example.py \\ + --results-file results.jsonl \\ + --baseline baseline.json + + # Promote if better (write new champion to baseline.json) + python examples/flycanon_eval_example.py \\ + --results-file results.jsonl \\ + --baseline baseline.json \\ + --promote-if-better + +Exit codes: 0 = scored successfully, 1 = regression detected vs baseline. + +Results JSONL format +-------------------- +Each line is a JSON object representing one query's retrieval result:: + + { + "question": "What was Apple's revenue in Q4 2023?", + "gold": ["AAPL_10K_2023", "AAPL_10Q_Q4_2023"], + "retrieved": [ + {"rank": 1, "source_id": "AAPL_10K_2023", "is_gold": true}, + {"rank": 2, "source_id": "MSFT_10K_2023", "is_gold": false}, + {"rank": 3, "source_id": "AAPL_10Q_Q4_2023", "is_gold": true} + ], + "answer": "Apple's revenue in Q4 2023 was $89.5 billion.", + "no_answer": false, + "citations": [ + {"source_id": "AAPL_10K_2023", "is_gold": true} + ], + "search_ms": 142, + "answer_ms": 2310 + } + +The ``gold`` list contains the source IDs that are considered correct answers. +Each entry in ``retrieved`` must have a 1-based ``rank``, ``source_id`` (or +``identities`` list), and ``is_gold`` bool. + +Baseline JSON format +-------------------- +A flat JSON object with metric names as keys and float values:: + + { + "ndcg@10": 0.7234, + "mrr@10": 0.6891, + "recall@10": 0.8120, + "hit@10": 0.9100, + "map@10": 0.6543, + "n_queries": 200 + } + +This is the same format written by ``--promote-if-better``. +""" + +from __future__ import annotations + +import argparse +import json +import sys +from pathlib import Path + +from fireflyframework_agentic.evaluation import ( + citation_precision, + hit_at_k, + map_score, + mean_latency_ms, + mrr, + ndcg, + no_answer_rate, + precision_at_k, + recall_at_k, +) + +# --------------------------------------------------------------------------- +# Helpers +# --------------------------------------------------------------------------- + +# Metrics that form the primary quality signal for champion/challenger +# comparisons. These are listed in priority order: nDCG@10 is the primary +# ranking metric; MRR@10 measures how quickly the first gold result appears; +# Recall@10 measures overall coverage; Hit@10 measures binary success rate; +# MAP@10 measures precision across the ranked list. +PRIMARY_METRICS = ["ndcg@10", "mrr@10", "recall@10", "hit@10", "map@10"] + +# Regression threshold: a metric must drop by more than this fraction of its +# baseline value to be flagged as a regression (guards against noise). +REGRESSION_THRESHOLD = 0.01 + + +def _load_jsonl(path: str) -> list[dict]: + """Load a newline-delimited JSON file, one object per line.""" + lines = Path(path).read_text(encoding="utf-8").strip().splitlines() + return [json.loads(line) for line in lines if line.strip()] + + +def _load_baseline(path: str) -> dict | None: + """Load a baseline JSON file, returning None if it does not exist.""" + p = Path(path) + if not p.exists(): + return None + return json.loads(p.read_text(encoding="utf-8")) + + +def _save_baseline(path: str, metrics: dict) -> None: + """Write a flat metrics dict to the baseline JSON file.""" + Path(path).write_text(json.dumps(metrics, indent=2, ensure_ascii=False) + "\n", encoding="utf-8") + + +def _compute_metrics(results: list[dict]) -> dict: + """Compute all IR metrics and return a flat dict.""" + return { + "n_queries": len(results), + "hit@1": hit_at_k(results, 1), + "hit@5": hit_at_k(results, 5), + "hit@10": hit_at_k(results, 10), + "recall@1": recall_at_k(results, 1), + "recall@5": recall_at_k(results, 5), + "recall@10": recall_at_k(results, 10), + "precision@1": precision_at_k(results, 1), + "precision@5": precision_at_k(results, 5), + "precision@10": precision_at_k(results, 10), + "mrr@10": mrr(results), + "map@10": map_score(results), + "ndcg@10": ndcg(results), + "no_answer_rate": no_answer_rate(results), + "citation_precision": citation_precision(results), + "mean_search_ms": mean_latency_ms(results, "search_ms"), + "mean_answer_ms": mean_latency_ms(results, "answer_ms"), + } + + +def _print_metrics_table(flat: dict, baseline: dict | None) -> None: + """Print a formatted table comparing current metrics vs baseline.""" + + col_w = 22 + num_w = 10 + header = f"{'Metric':<{col_w}} {'Current':>{num_w}}" + if baseline: + header += f" {'Baseline':>{num_w}} {'Delta':>{num_w}}" + print(header) + print("-" * (col_w + num_w + (num_w * 2 + 2 if baseline else 0))) + + for key, value in flat.items(): + if value is None: + continue + # Format floats as 4 decimal places; ints as plain integers. + cur_str = f"{value:.4f}" if isinstance(value, float) else str(value) + + row = f"{key:<{col_w}} {cur_str:>{num_w}}" + if baseline and key in baseline and isinstance(value, float): + base_val = baseline[key] + delta = value - base_val + delta_str = f"{delta:+.4f}" + row += f" {base_val:>{num_w}.4f} {delta_str:>{num_w}}" + print(row) + + print() + + +def _detect_regressions(flat: dict, baseline: dict) -> list[str]: + """Return the names of primary metrics that regressed vs baseline. + + A regression is flagged when the new value drops by more than + REGRESSION_THRESHOLD * baseline_value (relative threshold). This + guards against flagging noise as a regression. + """ + regressions = [] + for key in PRIMARY_METRICS: + new_val = flat.get(key) + base_val = baseline.get(key) + if new_val is None or base_val is None: + continue + if base_val > 0 and (base_val - new_val) / base_val > REGRESSION_THRESHOLD: + regressions.append(key) + return regressions + + +def _beats_baseline(flat: dict, baseline: dict) -> bool: + """Return True if the new metrics are better than or equal to the baseline. + + 'Better' means no primary metric has regressed beyond REGRESSION_THRESHOLD + AND at least one primary metric has improved. + """ + regressions = _detect_regressions(flat, baseline) + if regressions: + return False + # Check for at least one improvement. + for key in PRIMARY_METRICS: + new_val = flat.get(key) + base_val = baseline.get(key) + if new_val is not None and base_val is not None and new_val > base_val: + return True + return False + + +# --------------------------------------------------------------------------- +# Main evaluation flow +# --------------------------------------------------------------------------- + + +def run_evaluation(args: argparse.Namespace) -> int: + """Run retrieval metric scoring and optional champion/challenger comparison.""" + + # ------------------------------------------------------------------ + # Step 1 — Load results from the JSONL file. + # + # Each line is one query's retrieval result. The file is produced by + # a flycanon pipeline run (runner.run_queries writes results.jsonl). + # ------------------------------------------------------------------ + print(f"Loading results : {args.results_file}") + results = _load_jsonl(args.results_file) + print(f" {len(results)} query results loaded.") + + if not results: + print("ERROR: results file is empty.", file=sys.stderr) + return 1 + + # ------------------------------------------------------------------ + # Step 2 — Compute deterministic IR metrics. + # + # Metrics are computed at cut-offs k ∈ {1, 5, 10} and include: + # hit@k -- at least one gold doc in top-k (binary) + # recall@k -- fraction of gold docs found in top-k + # precision@k -- fraction of top-k that are gold + # mrr@10 -- mean reciprocal rank of first gold hit + # map@10 -- mean average precision + # ndcg@10 -- normalised discounted cumulative gain + # ------------------------------------------------------------------ + print("\nComputing retrieval metrics ...") + flat = _compute_metrics(results) + + print(f" nDCG@10 : {flat['ndcg@10']:.4f}") + print(f" MRR@10 : {flat['mrr@10']:.4f}") + print(f" Recall@10 : {flat['recall@10']:.4f}") + print(f" Hit@10 : {flat['hit@10']:.4f}") + print(f" MAP@10 : {flat['map@10']:.4f}") + + # ------------------------------------------------------------------ + # Step 3 — Load the baseline (champion) for regression detection. + # ------------------------------------------------------------------ + baseline = None + if args.baseline: + baseline = _load_baseline(args.baseline) + if baseline: + print(f"\nLoaded baseline : {args.baseline}") + else: + print(f"\nNo baseline found at {args.baseline} — first run, no comparison.") + + # ------------------------------------------------------------------ + # Step 4 — Print the full metrics table. + # ------------------------------------------------------------------ + print("\n" + "=" * 56) + print("Retrieval Metrics") + print("=" * 56) + _print_metrics_table(flat, baseline) + + # ------------------------------------------------------------------ + # Step 5 — Regression check. + # + # Compare against the baseline on primary metrics. Regressions block + # promotion (exit code 1) unless --promote-if-better is set and the + # run actually improved overall. + # ------------------------------------------------------------------ + + if baseline: + regressions = _detect_regressions(flat, baseline) + if regressions: + print(f"REGRESSION detected on: {', '.join(regressions)}") + print(f" Threshold: {REGRESSION_THRESHOLD * 100:.0f}% relative drop on any primary metric.") + else: + better = _beats_baseline(flat, baseline) + if better: + print("Challenger BEATS baseline on at least one primary metric.") + else: + print("Challenger is on-par with baseline (no regression, no improvement).") + + if regressions and not args.promote_if_better: + print("\nVerdict: HOLD — regression detected. Tune the pipeline and re-run.") + return 1 + + # ------------------------------------------------------------------ + # Step 6 — Champion promotion. + # + # When --promote-if-better is set and the metrics beat (or equal) the + # baseline, save the new metrics as the champion. Future runs will + # compare against this updated record. + # ------------------------------------------------------------------ + if args.promote_if_better and args.baseline: + if baseline is None or _beats_baseline(flat, baseline): + _save_baseline(args.baseline, flat) + print(f"\nChampion PROMOTED — metrics saved to {args.baseline}") + else: + print("\nNot promoted — challenger did not beat baseline on primary metrics.") + + print("\nVerdict: PROMOTE" if not (baseline and _detect_regressions(flat, baseline)) else "\nVerdict: HOLD") + return 0 + + +# --------------------------------------------------------------------------- +# CLI +# --------------------------------------------------------------------------- + + +def build_parser() -> argparse.ArgumentParser: + p = argparse.ArgumentParser( + prog="flycanon_eval_example", + description=( + "FlyCanon RAG retrieval benchmark — computes IR metrics from a results JSONL " + "and compares against a champion baseline." + ), + formatter_class=argparse.ArgumentDefaultsHelpFormatter, + ) + p.add_argument( + "--results-file", + required=True, + help="Path to results.jsonl produced by the flycanon pipeline.", + ) + p.add_argument( + "--baseline", + default=None, + help=("Path to baseline.json (champion store). When absent, scores are printed without comparison."), + ) + p.add_argument( + "--promote-if-better", + action="store_true", + help=( + "When set, write new metrics to baseline.json if the challenger beats the " + "champion on primary metrics. Has no effect when --baseline is omitted." + ), + ) + return p + + +def main() -> None: + parser = build_parser() + args = parser.parse_args() + sys.exit(run_evaluation(args)) + + +if __name__ == "__main__": + main() diff --git a/fireflyframework_agentic/__init__.py b/fireflyframework_agentic/__init__.py index 993b0248..1736f1f4 100644 --- a/fireflyframework_agentic/__init__.py +++ b/fireflyframework_agentic/__init__.py @@ -24,6 +24,13 @@ config = get_config() print(config.default_model) + +Optional subpackages (not imported eagerly at the top level): + fireflyframework_agentic.lab -- sessions, benchmarks, datasets, evaluation orchestration + fireflyframework_agentic.experiments -- experiment tracking and comparison + fireflyframework_agentic.evaluation -- gate-based quality gates, LLM-as-judge advisory, + champion/challenger tracking, retrieval metrics + (requires the ``evaluation`` optional extra) """ from importlib.metadata import PackageNotFoundError, version diff --git a/fireflyframework_agentic/evaluation/__init__.py b/fireflyframework_agentic/evaluation/__init__.py new file mode 100644 index 00000000..35dd32f7 --- /dev/null +++ b/fireflyframework_agentic/evaluation/__init__.py @@ -0,0 +1,111 @@ +from fireflyframework_agentic.evaluation.judge import ( + AdvisoryReport as AdvisoryReport, +) +from fireflyframework_agentic.evaluation.judge import ( + EvalContext as EvalContext, +) +from fireflyframework_agentic.evaluation.judge import ( + Metric as Metric, +) +from fireflyframework_agentic.evaluation.judge import ( + actionability as actionability, +) +from fireflyframework_agentic.evaluation.judge import ( + addresses_question as addresses_question, +) +from fireflyframework_agentic.evaluation.judge import ( + answer_correctness as answer_correctness, +) +from fireflyframework_agentic.evaluation.judge import ( + answer_relevancy as answer_relevancy, +) +from fireflyframework_agentic.evaluation.judge import ( + citation_relevance as citation_relevance, +) +from fireflyframework_agentic.evaluation.judge import ( + comparative_vs_champion as comparative_vs_champion, +) +from fireflyframework_agentic.evaluation.judge import ( + contains_answer as contains_answer, +) +from fireflyframework_agentic.evaluation.judge import ( + context_precision as context_precision, +) +from fireflyframework_agentic.evaluation.judge import ( + context_recall as context_recall, +) +from fireflyframework_agentic.evaluation.judge import ( + contradiction as contradiction, +) +from fireflyframework_agentic.evaluation.judge import ( + excerpt_fill_rate as excerpt_fill_rate, +) +from fireflyframework_agentic.evaluation.judge import ( + fabricated_entity as fabricated_entity, +) +from fireflyframework_agentic.evaluation.judge import ( + faithfulness as faithfulness, +) +from fireflyframework_agentic.evaluation.judge import ( + nc_semantic_precision as nc_semantic_precision, +) +from fireflyframework_agentic.evaluation.judge import ( + numeric_temporal_fidelity as numeric_temporal_fidelity, +) +from fireflyframework_agentic.evaluation.judge import ( + open_gap as open_gap, +) +from fireflyframework_agentic.evaluation.judge import ( + ragas_faithfulness as ragas_faithfulness, +) +from fireflyframework_agentic.evaluation.judge import ( + run_judge as run_judge, +) +from fireflyframework_agentic.evaluation.judge import ( + semantic_recovery as semantic_recovery, +) +from fireflyframework_agentic.evaluation.judge import ( + severity_calibration as severity_calibration, +) +from fireflyframework_agentic.evaluation.judge import ( + source_coverage as source_coverage, +) +from fireflyframework_agentic.evaluation.judge import ( + surface_deduplication as surface_deduplication, +) +from fireflyframework_agentic.evaluation.judge_client import ( + JudgeClient as JudgeClient, +) +from fireflyframework_agentic.evaluation.judge_client import ( + parse_model as parse_model, +) +from fireflyframework_agentic.evaluation.judge_client import ( + same_provider as same_provider, +) +from fireflyframework_agentic.evaluation.retrieval_metrics import ( + citation_precision as citation_precision, +) +from fireflyframework_agentic.evaluation.retrieval_metrics import ( + hit_at_k as hit_at_k, +) +from fireflyframework_agentic.evaluation.retrieval_metrics import ( + map_score as map_score, +) +from fireflyframework_agentic.evaluation.retrieval_metrics import ( + mean_latency_ms as mean_latency_ms, +) +from fireflyframework_agentic.evaluation.retrieval_metrics import ( + mrr as mrr, +) +from fireflyframework_agentic.evaluation.retrieval_metrics import ( + ndcg as ndcg, +) +from fireflyframework_agentic.evaluation.retrieval_metrics import ( + no_answer_rate as no_answer_rate, +) +from fireflyframework_agentic.evaluation.retrieval_metrics import ( + precision_at_k as precision_at_k, +) +from fireflyframework_agentic.evaluation.retrieval_metrics import ( + recall_at_k as recall_at_k, +) diff --git a/fireflyframework_agentic/evaluation/judge.py b/fireflyframework_agentic/evaluation/judge.py new file mode 100644 index 00000000..d5bcad66 --- /dev/null +++ b/fireflyframework_agentic/evaluation/judge.py @@ -0,0 +1,890 @@ +"""Evaluation judge — async metrics for flyradar and flycanon pipelines. + +Every metric: async def metric_name(item: dict, ctx: EvalContext) -> dict | float | None + +Flyradar item keys: findings, evidence_index, process_graph, proposed_actions, + workspace, reports, lexical_missed_ids, nc_items, champion +Flycanon item keys: question, answer, reference, contexts +""" + +from __future__ import annotations + +import asyncio +import math +import os +import statistics +from collections.abc import Awaitable, Callable +from dataclasses import dataclass, field + +from pydantic import BaseModel, ConfigDict + +from fireflyframework_agentic.embeddings.providers.ollama import OllamaEmbedder +from fireflyframework_agentic.embeddings.similarity import cosine_similarity +from fireflyframework_agentic.evaluation.judge_client import JudgeClient, same_provider + +Metric = Callable[["dict", "EvalContext"], Awaitable["dict | float | None"]] + +SYSTEM = "You are a meticulous evaluator of a process-mining discovery report. Return ONLY a JSON object." + +SYSTEM_RAG = "You are an evaluator of a RAG system's answers. Return ONLY a JSON object." + +RUBRIC = ( + "Score the ANSWER on two metrics:\n" + "- contains_answer (0.0-1.0): Does the answer contain the correct information from the REFERENCE?\n" + "- addresses_question (0.0-1.0): Does the answer directly address what the QUESTION is asking?\n" + 'Reply with ONLY {"contains_answer": , "addresses_question": }.' +) + + +class EvalContext(BaseModel): + model_config = ConfigDict(arbitrary_types_allowed=True) + + client: JudgeClient + embedder: OllamaEmbedder | None = None + runs: int = 3 + + +@dataclass +class AdvisoryReport: + """The G4 output: a plain metrics bag, never a GateResult. + + metrics maps metric-name -> small dict (the per-metric summary). details + carries supporting context (counts, ids). errors lists per-metric failures + captured by run_judge's best-effort try/except so nothing propagates. + """ + + judge_model: str + same_provider_caveat: bool + calibrated: bool # ALWAYS False for now + runs: int + metrics: dict = field(default_factory=dict) + details: dict = field(default_factory=dict) + errors: list[str] = field(default_factory=list) + + +# ── shared accessors ─────────────────────────────────────────────────────────── + + +def _evidence_index(item: dict) -> dict[str, dict]: + return {ev.get("id"): ev for ev in item.get("evidence_index", []) if ev.get("id")} + + +def _cited_excerpts(finding: dict, evidence_index: dict[str, dict]) -> list[str]: + """Excerpts of the evidence a finding cites (via evidence_refs.evidence_id).""" + out: list[str] = [] + for ref in finding.get("evidence_refs", []): + ev = evidence_index.get(ref.get("evidence_id", "")) + if ev: + excerpt = ev.get("excerpt") or "" + if excerpt: + out.append(excerpt) + return out + + +def _output_text(item: dict) -> str: + """All free text the model emitted: finding titles+descriptions + reports.""" + parts: list[str] = [] + for f in item.get("findings", []): + parts.append(f.get("title", "")) + parts.append(f.get("description", "")) + for r in item.get("reports", []): + parts.append(str(r)) + return "\n".join(p for p in parts if p) + + +def _workspace_intention(item: dict) -> str: + ws = item.get("workspace") or {} + return f"{ws.get('name', '')}\n{ws.get('description', '')}".strip() + + +def _coerce_float(value, default=None): + """Coerce a model-returned number/numeric-string to float; total (never raises).""" + try: + return float(value) + except (TypeError, ValueError): + return default + + +def _source_stem(locator: str) -> str: + """Return the part before the first '#', or the full string if no '#'.""" + idx = locator.find("#") + return locator[:idx] if idx != -1 else locator + + +async def _gather_chat(chat_fn, prompts: list[tuple[str, str]]) -> list[dict]: + """Run a list of (system, user) prompts concurrently, returning ordered results.""" + results = await asyncio.gather(*[chat_fn(s, u) for s, u in prompts], return_exceptions=True) + return [r if isinstance(r, dict) else {} for r in results] + + +# ── [D] DETERMINISTIC — no LLM, always available ──────────────────────────────── + + +async def source_coverage(item: dict, ctx: EvalContext) -> dict: # noqa: ARG001 + """Distinct source documents cited by >=1 finding vs all source documents. + + Returns {cited, total, orphaned} where orphaned is the sorted list of + source stems present in evidence_index but cited by no finding. + """ + ev_idx = _evidence_index(item) + all_stems = {_source_stem(ev.get("locator", "")) for ev in item.get("evidence_index", []) if ev.get("locator")} + cited_stems: set[str] = set() + for f in item.get("findings", []): + for ref in f.get("evidence_refs", []): + ev = ev_idx.get(ref.get("evidence_id", "")) + if ev and ev.get("locator"): + cited_stems.add(_source_stem(ev["locator"])) + cited_stems &= all_stems + orphaned = sorted(all_stems - cited_stems) + return {"cited": len(cited_stems), "total": len(all_stems), "orphaned": orphaned} + + +async def excerpt_fill_rate(item: dict, ctx: EvalContext) -> dict: # noqa: ARG001 + """Fraction of evidence_index entries with a non-empty excerpt. + + Returns {populated, total}. + """ + entries = item.get("evidence_index", []) + populated = sum(1 for ev in entries if (ev.get("excerpt") or "").strip()) + return {"populated": populated, "total": len(entries)} + + +# ── [E] EMBEDDING — needs embedder ─────────────────────────────────────────────── + + +async def semantic_recovery(item: dict, ctx: EvalContext, tau: float = 0.70) -> dict | None: + """Context-recall: recover lexical misses by embedding similarity. + + Reads item["lexical_missed_ids"] (list of str). + Returns None if ctx.embedder is None. + """ + if ctx.embedder is None: + return None + + lexical_missed_ids: list[str] = item.get("lexical_missed_ids", []) + missed = set(lexical_missed_ids or []) + + # Build the scored items from nc_items (non-NC = real items for recall) + # In the new EvalContext model, nc_items is a list of {"id": ..., "description": ...} + # We treat all item findings as the candidate surface; nc_items stay separate. + # Recompute as: all items scored = those not in nc_items ids. + # If there's no registry concept, we use findings as the denominator proxy. + # But keep the logic simple: just score the missed items against finding descriptions. + ev_idx = _evidence_index(item) + candidate_texts: list[str] = [] + for f in item.get("findings", []): + desc = f.get("description", "") + if desc: + candidate_texts.append(desc) + candidate_texts.extend(_cited_excerpts(f, ev_idx)) + + # missed_items: we only know their IDs; we need descriptions to embed. + # In the new design, if no descriptions available, return minimal result. + all_findings = item.get("findings", []) + denom = max(len(all_findings), 1) + lexical_hits = sum(1 for f in all_findings if f.get("id") not in missed) + + missed_descs: list[tuple[str, str]] = [ + (f.get("id", ""), f.get("description", "")) + for f in all_findings + if f.get("id") in missed and f.get("description") + ] + + if not missed_descs or not candidate_texts: + recovered_recall = lexical_hits / denom + return { + "lexical_recall": round(lexical_hits / denom, 4), + "recovered_recall": round(recovered_recall, 4), + "recovered": [], + "tau": tau, + "scored_denominator": denom, + } + + item_texts = [desc for _fid, desc in missed_descs] + item_vecs = await ctx.embedder._embed_batch(item_texts) + cand_vecs = await ctx.embedder._embed_batch(candidate_texts) + + recovered: list[dict] = [] + for (fid, _desc), ivec in zip(missed_descs, item_vecs, strict=False): + best = max((cosine_similarity(ivec, cvec) for cvec in cand_vecs), default=0.0) + if best >= tau: + recovered.append({"id": fid, "cosine": round(best, 4)}) + + recovered_recall = (lexical_hits + len(recovered)) / denom + return { + "lexical_recall": round(lexical_hits / denom, 4), + "recovered_recall": round(recovered_recall, 4), + "recovered": recovered, + "tau": tau, + "scored_denominator": denom, + } + + +# ── [J] JUDGE — needs chat_fn(system, user) -> dict ────────────────────────────── + + +async def faithfulness(item: dict, ctx: EvalContext) -> dict: + """Entailment: does each finding's cited evidence SUPPORT its claim? + + Returns {supported, total, unsupported_ids}. + """ + ev_idx = _evidence_index(item) + findings = item.get("findings", []) + cited = [(f, _cited_excerpts(f, ev_idx)) for f in findings] + prompts = [ + ( + SYSTEM, + "Does the cited evidence span ENTAIL the claim made in this finding?\n" + 'Reply with ONLY {"verdict": "SUPPORTED" or "NOT_SUPPORTED", "reason": ""}.\n\n' + f"FINDING: {f.get('description', '')}\n" + f"CITED EVIDENCE: {' || '.join(excerpts)}", + ) + for f, excerpts in cited + if excerpts + ] + answers = iter(await _gather_chat(ctx.client.chat_json, prompts)) + supported = 0 + unsupported_ids: list[str] = [] + for f, excerpts in cited: + fid = f.get("id", "?") + if not excerpts: + unsupported_ids.append(fid) + continue + verdict = str(next(answers).get("verdict", "")).upper() + if verdict == "SUPPORTED": + supported += 1 + else: + unsupported_ids.append(fid) + return {"supported": supported, "total": len(findings), "unsupported_ids": unsupported_ids} + + +async def numeric_temporal_fidelity(item: dict, ctx: EvalContext) -> dict: + """Flag numbers/dates asserted in a finding that do NOT match its evidence. + + Returns {mismatches: [{finding_id, value, source}], count}. + """ + ev_idx = _evidence_index(item) + scored = [(f, excerpts) for f in item.get("findings", []) if (excerpts := _cited_excerpts(f, ev_idx))] + prompts = [ + ( + SYSTEM, + "List every specific number or date asserted in the FINDING that does " + "NOT match the CITED EVIDENCE.\n" + 'Reply with ONLY {"mismatches": [{"value": "", "source": ""}]}. ' + "Empty list if all match.\n\n" + f"FINDING: {f.get('description', '')}\n" + f"CITED EVIDENCE: {' || '.join(excerpts)}", + ) + for f, excerpts in scored + ] + answers = await _gather_chat(ctx.client.chat_json, prompts) + mismatches: list[dict] = [] + for (f, _excerpts), answer in zip(scored, answers, strict=False): + for m in answer.get("mismatches", []) or []: + mismatches.append( + { + "finding_id": f.get("id", "?"), + "value": m.get("value", ""), + "source": m.get("source", ""), + } + ) + return {"mismatches": mismatches, "count": len(mismatches)} + + +async def citation_relevance(item: dict, ctx: EvalContext) -> dict: + """Context precision: fraction of cited passages actually relevant to the claim. + + Returns {precision, relevant, total}. + """ + ev_idx = _evidence_index(item) + prompts: list[tuple[str, str]] = [] + for f in item.get("findings", []): + desc = f.get("description", "") + for ref in f.get("evidence_refs", []): + ev = ev_idx.get(ref.get("evidence_id", "")) + if not ev: + continue + excerpt = ev.get("excerpt") or "" + if not excerpt: + continue + prompts.append( + ( + SYSTEM, + "Is this cited passage actually relevant to / used by this claim?\n" + 'Reply with ONLY {"relevant": "yes" or "no"}.\n\n' + f"CLAIM: {desc}\n" + f"CITED PASSAGE: {excerpt}", + ) + ) + answers = await _gather_chat(ctx.client.chat_json, prompts) + total = len(prompts) + relevant = sum(1 for a in answers if str(a.get("relevant", "")).lower() == "yes") + if not total: + return {"precision": None, "relevant": relevant, "total": total} + return {"precision": round(relevant / total, 4), "relevant": relevant, "total": total} + + +async def nc_semantic_precision(item: dict, ctx: EvalContext) -> dict: + """Count negative-control falsehoods the output asserts or endorses. + + Reads item["nc_items"] as list of {"id": ..., "description": ...} dicts. + Returns {asserted, total, asserted_ids}. + """ + output_text = _output_text(item) + nc_items: list[dict] = item.get("nc_items", []) + prompts = [ + ( + SYSTEM, + "Does the OUTPUT assert or endorse the following FALSE statement?\n" + 'Reply with ONLY {"asserted": "yes" or "no"}.\n\n' + f"FALSE STATEMENT: {nc.get('description', '')}\n" + f"OUTPUT:\n{output_text}", + ) + for nc in nc_items + ] + answers = await _gather_chat(ctx.client.chat_json, prompts) + asserted_ids = [ + nc.get("id", "?") + for nc, a in zip(nc_items, answers, strict=False) + if str(a.get("asserted", "")).lower() == "yes" + ] + return {"asserted": len(asserted_ids), "total": len(nc_items), "asserted_ids": asserted_ids} + + +async def fabricated_entity(item: dict, ctx: EvalContext) -> dict: + """Count systems/orgs/metrics named in the output but absent from the corpus. + + Returns {count, entities}. + """ + output_text = _output_text(item) + corpus = "\n".join(f"{ev.get('locator', '')} :: {ev.get('excerpt', '')}" for ev in item.get("evidence_index", [])) + user = ( + "List any system, organization, or metric NAMED in the OUTPUT that does NOT " + "appear anywhere in the CORPUS EVIDENCE.\n" + 'Reply with ONLY {"fabricated": ["", ...]}. Empty list if none.\n\n' + f"OUTPUT:\n{output_text}\n\n" + f"CORPUS EVIDENCE:\n{corpus}" + ) + answer = await ctx.client.chat_json(SYSTEM, user) + entities = answer.get("fabricated", []) or [] + return {"count": len(entities), "entities": list(entities)} + + +async def contradiction(item: dict, ctx: EvalContext) -> dict: + """Count internally contradictory finding pairs. + + Returns {count, pairs}. + """ + lines = [] + for f in item.get("findings", []): + lines.append(f"{f.get('id', '?')}: {f.get('title', '')} — {f.get('description', '')}") + user = ( + "Are any two of these FINDINGS mutually contradictory? List each contradicting pair.\n" + 'Reply with ONLY {"pairs": [["", ""], ...]}. Empty list if none.\n\n' + "\n".join(lines) + ) + answer = await ctx.client.chat_json(SYSTEM, user) + pairs = answer.get("pairs", []) or [] + return {"count": len(pairs), "pairs": [list(p) for p in pairs]} + + +async def open_gap(item: dict, ctx: EvalContext) -> dict: + """G-Eval open probe: the most important process issue the output missed. + + Returns {gap} — a free-text advisory narrative (no score). + """ + pg = item.get("process_graph") or {} + pg_summary = f"process_graph has {len(pg.get('processes', []))} processes" + user = ( + "Given this corpus scope and output, what important process issue did the " + "output FAIL to surface?\n" + 'Reply with ONLY {"gap": ""}.\n\n' + f"WORKSPACE SCOPE: {_workspace_intention(item)}\n" + f"{pg_summary}\n" + f"OUTPUT:\n{_output_text(item)}" + ) + answer = await ctx.client.chat_json(SYSTEM, user) + return {"gap": str(answer.get("gap", ""))} + + +async def actionability(item: dict, ctx: EvalContext) -> dict: + """Average 0-1 rating of whether proposed actions are specific+quantified+linked. + + Returns {score, rated}. + """ + actions = item.get("proposed_actions", []) or [] + finding_ids = {f.get("id") for f in item.get("findings", [])} + prompts = [ + ( + SYSTEM, + "Rate whether this proposed action is SPECIFIC, QUANTIFIED, and LINKED to a " + "finding.\n" + 'Reply with ONLY {"score": }.\n\n' + f"TITLE: {a.get('title', '')}\n" + f"DESCRIPTION: {a.get('description', '')}\n" + f"OWNER: {a.get('owner_persona', '')} HORIZON: {a.get('horizon', '')} " + f"LEVER: {a.get('lever', '')} EFFORT: {a.get('effort', '')}\n" + f"EXPECTED_SAVINGS_FTE: {a.get('expected_savings_fte', '')} " + f"EXPECTED_SAVINGS_USD: {a.get('expected_savings_usd', '')}\n" + f"LINKED_TO_FINDING: {a.get('finding_id') in finding_ids}", + ) + for a in actions + ] + answers = await _gather_chat(ctx.client.chat_json, prompts) + scores: list[float] = [] + for a in answers: + value = _coerce_float(a.get("score")) + if value is None: + continue + scores.append(value) + score = round(sum(scores) / len(scores), 4) if scores else None + return {"score": score, "rated": len(scores)} + + +async def severity_calibration(item: dict, ctx: EvalContext) -> dict: + """Per-finding judgment of whether stated severity matches the evidence. + + Returns {miscalibrated, total, verdicts: {finding_id: under|over|calibrated}}. + """ + ev_idx = _evidence_index(item) + findings = item.get("findings", []) + prompts = [ + ( + SYSTEM, + "Does the STATED SEVERITY match what the CITED EVIDENCE supports?\n" + 'Reply with ONLY {"calibration": "under" or "over" or "calibrated"}.\n\n' + f"STATED SEVERITY: {f.get('severity', '')} SCORE: {f.get('score', '')}\n" + f"FINDING: {f.get('description', '')}\n" + f"CITED EVIDENCE: {' || '.join(_cited_excerpts(f, ev_idx))}", + ) + for f in findings + ] + answers = await _gather_chat(ctx.client.chat_json, prompts) + verdicts: dict[str, str] = {} + miscalibrated = 0 + for f, a in zip(findings, answers, strict=False): + verdict = str(a.get("calibration", "calibrated")).lower() + verdicts[f.get("id", "?")] = verdict + if verdict in ("under", "over"): + miscalibrated += 1 + return {"miscalibrated": miscalibrated, "total": len(findings), "verdicts": verdicts} + + +async def answer_relevancy(item: dict, ctx: EvalContext) -> dict: + """RAGAS-style: does the output address the stated workspace intention? + + Returns {score} in [0,1], or {"score": None} when the vote fails to coerce. + """ + user = ( + "Does the OUTPUT address the stated WORKSPACE INTENTION (on-topic, responsive)?\n" + 'Reply with ONLY {"score": }.\n\n' + f"WORKSPACE INTENTION: {_workspace_intention(item)}\n" + f"OUTPUT:\n{_output_text(item)}" + ) + answer = await ctx.client.chat_json(SYSTEM, user) + return {"score": _coerce_float(answer.get("score"))} + + +async def surface_deduplication(item: dict, ctx: EvalContext) -> dict: + """Fraction of near-duplicate process-graph node pairs that are genuinely distinct. + + Returns {distinct, redundant, total, distinct_rate, redundant_pairs}. + """ + pg = item.get("process_graph", {}) + procs = pg.get("processes", []) + + def _toks(node: dict) -> frozenset[str]: + return frozenset(node.get("name", "").lower().split()) + + per_surface_cap = 10 + candidates: list[tuple[str, dict, dict, str]] = [] + + if len(procs) >= 2: + pairs: list[tuple[float, dict, dict]] = [] + for i in range(len(procs)): + for j in range(i + 1, len(procs)): + a_t, b_t = _toks(procs[i]), _toks(procs[j]) + union = a_t | b_t + if not union: + continue + jac = len(a_t & b_t) / len(union) + if jac >= 0.30: + pairs.append((jac, procs[i], procs[j])) + pairs.sort(key=lambda x: x[0], reverse=True) + for _jac, a, b in pairs[:per_surface_cap]: + candidates.append(("process", a, b, "")) + + for surface_key, attr in (("activity", "activities"), ("decision", "decisions")): + all_pairs: list[tuple[float, dict, dict, str]] = [] + for proc in procs: + nodes = proc.get(attr, []) + proc_name = proc.get("name", "") + if len(nodes) < 2: + continue + for i in range(len(nodes)): + for j in range(i + 1, len(nodes)): + a_t, b_t = _toks(nodes[i]), _toks(nodes[j]) + union = a_t | b_t + if not union: + continue + jac = len(a_t & b_t) / len(union) + if jac >= 0.30: + all_pairs.append((jac, nodes[i], nodes[j], proc_name)) + all_pairs.sort(key=lambda x: x[0], reverse=True) + for _jac, a, b, proc_name in all_pairs[:per_surface_cap]: + candidates.append((surface_key, a, b, proc_name)) + + if not candidates: + return {"distinct": 0, "redundant": 0, "total": 0, "distinct_rate": None, "redundant_pairs": []} + + prompts = [] + for surface, a, b, parent_proc in candidates: + ctx_line = f"\nPARENT PROCESS: {parent_proc}\n" if parent_proc else "" + prompts.append( + ( + SYSTEM, + f"Are these two {surface} nodes genuinely DISTINCT process concepts, or is one a " + f"duplicate / sub-case / restatement of the other?\n" + f"{ctx_line}" + 'Reply with ONLY {"verdict": "DISTINCT" or "DUPLICATE", "reason": ""}.\n\n' + f"{surface.upper()} A: {a.get('name', '')} — {a.get('description', '')}\n" + f"{surface.upper()} B: {b.get('name', '')} — {b.get('description', '')}", + ) + ) + + answers = await _gather_chat(ctx.client.chat_json, prompts) + + distinct = 0 + redundant = 0 + redundant_pairs: list[dict] = [] + for (surface, a, b, _parent), answer in zip(candidates, answers, strict=False): + verdict = str(answer.get("verdict", "")).upper() + if verdict == "DISTINCT": + distinct += 1 + else: + redundant += 1 + redundant_pairs.append( + { + "surface": surface, + "a": a.get("name", ""), + "b": b.get("name", ""), + "reason": str(answer.get("reason", "")), + } + ) + + total = distinct + redundant + return { + "distinct": distinct, + "redundant": redundant, + "total": total, + "distinct_rate": round(distinct / total, 4) if total else None, + "redundant_pairs": redundant_pairs, + } + + +async def comparative_vs_champion(item: dict, ctx: EvalContext) -> dict | None: + """Pairwise MT-Bench-style review of candidate vs champion (advisory only). + + Returns None if item["champion"] is not present. + Returns {candidate, champion, more_consistent}. + """ + champion = item.get("champion") + if champion is None: + return None + user = ( + "Score the CANDIDATE and the CHAMPION outputs on five axes (1-5 each): " + "Coverage, Quality, Evidence, Actionability, Regression. Then say which is " + "more internally consistent.\n" + "Reply with ONLY " + '{"candidate": {"coverage": x, "quality": x, "evidence": x, "actionability": x, "regression": x}, ' + '"champion": {"coverage": x, "quality": x, "evidence": x, "actionability": x, "regression": x}, ' + '"more_consistent": "candidate" or "champion"}.\n\n' + f"CANDIDATE:\n{_output_text(item)}\n\n" + f"CHAMPION:\n{_output_text(champion)}" + ) + out = await ctx.client.chat_json(SYSTEM, user) + return { + "candidate": out.get("candidate", {}), + "champion": out.get("champion", {}), + "more_consistent": out.get("more_consistent", ""), + } + + +# ── flycanon custom metrics ─────────────────────────────────────────────────────── + + +async def _rag_score_once(item: dict, ctx: EvalContext) -> dict | None: + """Single RAG scoring call: returns {"contains_answer": float, "addresses_question": float}.""" + question = item.get("question", "") + reference = item.get("reference", "") + answer = item.get("answer", "") + if not question or not answer: + return None + user = f"QUESTION: {question}\nREFERENCE: {reference}\nANSWER: {answer}\n\n{RUBRIC}" + result = await ctx.client.chat_json(SYSTEM_RAG, user) + return result + + +async def contains_answer(item: dict, ctx: EvalContext) -> float | None: + """Flycanon: does the answer contain the correct information from the reference? + + Runs ctx.runs times and returns the median score. + Returns None if the item lacks question/answer. + """ + scores: list[float] = [] + for _ in range(max(1, ctx.runs)): + result = await _rag_score_once(item, ctx) + if result is None: + return None + val = _coerce_float(result.get("contains_answer")) + if val is not None: + scores.append(val) + if not scores: + return None + return round(statistics.median(scores), 4) + + +async def addresses_question(item: dict, ctx: EvalContext) -> float | None: + """Flycanon: does the answer directly address what the question is asking? + + Runs ctx.runs times and returns the median score. + Returns None if the item lacks question/answer. + """ + scores: list[float] = [] + for _ in range(max(1, ctx.runs)): + result = await _rag_score_once(item, ctx) + if result is None: + return None + val = _coerce_float(result.get("addresses_question")) + if val is not None: + scores.append(val) + if not scores: + return None + return round(statistics.median(scores), 4) + + +# ── RAGAS metrics ───────────────────────────────────────────────────────────────── +# ragas/langchain imports are inline inside _sync() since ragas is optional. + + +def _make_ragas_sample(item: dict): + """Build a RAGAS SingleTurnSample from an item dict (ragas import inline).""" + from ragas import SingleTurnSample # type: ignore[import] # noqa: PLC0415 + + return SingleTurnSample( + user_input=item.get("question", ""), + response=item.get("answer", ""), + reference=item.get("reference", ""), + retrieved_contexts=item.get("contexts", []), + ) + + +def _make_ragas_llm(ctx: EvalContext): + """Build a LangChain LLM wrapper for RAGAS (langchain import inline).""" + provider, model = ctx.client.provider, ctx.client.model + if provider == "anthropic": + from langchain_anthropic import ChatAnthropic # type: ignore[import] # noqa: PLC0415 + + api_key = os.environ.get("ANTHROPIC_API_KEY", "") + return ChatAnthropic(model=model, api_key=api_key, temperature=0.0) # type: ignore[call-arg,arg-type] + if provider in ("openai", "azure"): + from langchain_openai import ChatOpenAI # type: ignore[import] # noqa: PLC0415 + + api_key = os.environ.get("OPENAI_API_KEY", "") + return ChatOpenAI(model=model, api_key=api_key, temperature=0.0) # type: ignore[call-arg,arg-type] + if provider == "ollama": + from langchain_ollama import ChatOllama # type: ignore[import] # noqa: PLC0415 + + return ChatOllama(model=model, temperature=0.0) + raise ValueError(f"RAGAS: unsupported provider {provider!r}") + + +def _make_ragas_embeddings(ctx: EvalContext): + """Build LangChain embeddings for RAGAS (langchain import inline).""" + if ctx.embedder is not None: + from langchain_ollama import OllamaEmbeddings # type: ignore[import] # noqa: PLC0415 + + return OllamaEmbeddings(model=ctx.embedder._model) + from langchain_anthropic import AnthropicEmbeddings # type: ignore[import] # noqa: PLC0415 + + return AnthropicEmbeddings() + + +async def _ragas_score(metric_name: str, item: dict, ctx: EvalContext) -> float | None: + """Run a single named RAGAS metric and return its float score (or None).""" + + def _sync(): + from ragas import evaluate # type: ignore[import] # noqa: PLC0415 + from ragas.dataset_schema import EvaluationDataset # type: ignore[import] # noqa: PLC0415 + from ragas.metrics import ( # type: ignore[import] # noqa: PLC0415 + AnswerCorrectness, + AnswerRelevancy, + ContextPrecision, + ContextRecall, + Faithfulness, + ) + + _metrics_map = { + "answer_correctness": AnswerCorrectness, + "answer_relevancy_ragas": AnswerRelevancy, + "ragas_faithfulness": Faithfulness, + "context_recall": ContextRecall, + "context_precision": ContextPrecision, + } + metric_cls = _metrics_map.get(metric_name) + if metric_cls is None: + return None + + llm = _make_ragas_llm(ctx) + embeddings = _make_ragas_embeddings(ctx) + metric = metric_cls(llm=llm, embeddings=embeddings) + sample = _make_ragas_sample(item) + dataset = EvaluationDataset(samples=[sample]) + result = evaluate(dataset=dataset, metrics=[metric]) + df = result.to_pandas() # type: ignore[attr-defined] + col = df.columns[df.columns.str.contains(metric_name.replace("_ragas", ""), case=False)] + if col.empty: + return None + val = df[col[0]].iloc[0] + if val is None or (isinstance(val, float) and math.isnan(val)): + return None + return round(float(val), 4) + + loop = asyncio.get_event_loop() + return await loop.run_in_executor(None, _sync) + + +async def answer_correctness(item: dict, ctx: EvalContext) -> float | None: + """RAGAS answer correctness (semantic F1 against reference).""" + return await _ragas_score("answer_correctness", item, ctx) + + +async def ragas_faithfulness(item: dict, ctx: EvalContext) -> float | None: + """RAGAS faithfulness (answer grounded in retrieved contexts).""" + return await _ragas_score("ragas_faithfulness", item, ctx) + + +async def context_recall(item: dict, ctx: EvalContext) -> float | None: + """RAGAS context recall (reference coverage by retrieved contexts).""" + return await _ragas_score("context_recall", item, ctx) + + +async def context_precision(item: dict, ctx: EvalContext) -> float | None: + """RAGAS context precision (retrieved contexts relevant to the question).""" + return await _ragas_score("context_precision", item, ctx) + + +# ── median-of-N helpers ────────────────────────────────────────────────────────── + + +def _numeric_leaves(d: dict) -> dict[tuple, float]: + """Flatten a metric dict to {path: float} over its FLOAT score-leaves only.""" + out: dict[tuple, float] = {} + + def walk(node, path: tuple) -> None: + if isinstance(node, float): + out[path] = node + elif isinstance(node, dict): + for k, v in node.items(): + walk(v, path + (k,)) + + walk(d, ()) + return out + + +def _set_leaf(d: dict, path: tuple, value: float) -> None: + node = d + for key in path[:-1]: + node = node[key] + node[path[-1]] = value + + +def _median_runs(samples: list[dict]) -> dict: + """Median across N metric-dicts: FLOAT score-leaves -> per-key median; rest = first.""" + samples = [s for s in samples if isinstance(s, dict)] + if not samples: + return {} + base = samples[0] + if len(samples) == 1: + return base + leaf_values: dict[tuple, list[float]] = {} + for s in samples: + for path, val in _numeric_leaves(s).items(): + leaf_values.setdefault(path, []).append(val) + merged = dict(base) + for path, vals in leaf_values.items(): + try: + _set_leaf(merged, path, round(statistics.median(vals), 4)) + except (KeyError, TypeError): + continue + return merged + + +# ── orchestrator ───────────────────────────────────────────────────────────────── + + +async def run_judge( + item: dict, + ctx: EvalContext, + *, + pipeline_model: str = "", +) -> AdvisoryReport: + """Run all metrics concurrently and return an AdvisoryReport. + + Best-effort: never raises. Failing metrics append to report.errors. + """ + report = AdvisoryReport( + judge_model=ctx.client.model_spec, + same_provider_caveat=same_provider(pipeline_model, ctx.client.model_spec), + calibrated=False, + runs=ctx.runs, + ) + + # [D] deterministic (no LLM) + det_metrics: list[tuple[str, Metric]] = [ + ("source_coverage", source_coverage), + ("excerpt_fill_rate", excerpt_fill_rate), + ] + # [E] embedding + emb_metrics: list[tuple[str, Metric]] = [ + ("semantic_recovery", semantic_recovery), + ] + # [J] judge metrics (median-of-runs handled externally for single-call ones) + judge_metrics: list[tuple[str, Metric]] = [ + ("faithfulness", faithfulness), + ("numeric_temporal_fidelity", numeric_temporal_fidelity), + ("citation_relevance", citation_relevance), + ("nc_semantic_precision", nc_semantic_precision), + ("fabricated_entity", fabricated_entity), + ("contradiction", contradiction), + ("open_gap", open_gap), + ("actionability", actionability), + ("severity_calibration", severity_calibration), + ("answer_relevancy", answer_relevancy), + ("surface_deduplication", surface_deduplication), + ("comparative_vs_champion", comparative_vs_champion), + ] + # flycanon custom + flycanon_metrics: list[tuple[str, Metric]] = [ + ("contains_answer", contains_answer), + ("addresses_question", addresses_question), + ] + # RAGAS + ragas_metrics: list[tuple[str, Metric]] = [ + ("answer_correctness", answer_correctness), + ("ragas_faithfulness", ragas_faithfulness), + ("context_recall", context_recall), + ("context_precision", context_precision), + ] + + all_metrics = det_metrics + emb_metrics + judge_metrics + flycanon_metrics + ragas_metrics + + async def _run_one(name: str, fn: Metric) -> None: + try: + result = await fn(item, ctx) + if result is not None: + report.metrics[name] = result + except Exception as exc: + report.errors.append(f"{name}: {type(exc).__name__}: {exc}") + + await asyncio.gather(*[_run_one(name, fn) for name, fn in all_metrics]) + return report diff --git a/fireflyframework_agentic/evaluation/judge_client.py b/fireflyframework_agentic/evaluation/judge_client.py new file mode 100644 index 00000000..7f050d16 --- /dev/null +++ b/fireflyframework_agentic/evaluation/judge_client.py @@ -0,0 +1,254 @@ +"""Async LLM scoring client for judge metrics. + +Thin httpx-based wrapper over Anthropic / OpenAI / Azure OpenAI / Ollama. +Reads API keys lazily (per-call) from env so importing never requires secrets. +Provider/model spec: ":", e.g. "anthropic:claude-sonnet-4-6". +""" + +from __future__ import annotations + +import asyncio +import json +import os +import re + +import httpx + +_RETRY_STATUS = (429, 500, 502, 503, 504) +_MAX_RETRY_AFTER = 30.0 + + +def _env(name: str, default: str | None = None) -> str | None: + value = os.environ.get(name) + if value is None: + return default + value = value.strip() + return value if value else default + + +def parse_model(spec: str) -> tuple[str, str]: + """Split "provider:model" -> (provider, model). Bare spec -> ("unknown", spec).""" + spec = (spec or "").strip() + if ":" not in spec: + return "unknown", spec + provider, model = spec.split(":", 1) + return provider.strip().lower(), model.strip() + + +def same_provider(pipeline_model: str, judge_model: str) -> bool: + """True iff both specs share the same known provider prefix.""" + p, _ = parse_model(pipeline_model) + j, _ = parse_model(judge_model) + if p == "unknown" or j == "unknown": + return False + return p == j + + +def _first_json_object(text: str) -> dict: + """Extract the first balanced JSON object from text (handles prose/code-fence wrapping).""" + if not text: + raise ValueError("empty model response") + + # Fast path: a clean JSON object with no surrounding prose. A non-dict + # clean parse (e.g. a top-level array) is intentionally ignored so the brace + # scanner can still find an embedded object rather than returning arr[0]. + try: + parsed = json.loads(text.strip()) + except (json.JSONDecodeError, ValueError): + parsed = None + if isinstance(parsed, dict): + return parsed + + start = text.find("{") + while start != -1: + depth = 0 + in_string = False + escape = False + for i in range(start, len(text)): + ch = text[i] + if in_string: + if escape: + escape = False + elif ch == "\\": + escape = True + elif ch == '"': + in_string = False + continue + if ch == '"': + in_string = True + elif ch == "{": + depth += 1 + elif ch == "}": + depth -= 1 + if depth == 0: + candidate = text[start : i + 1] + try: + return json.loads(candidate) + except json.JSONDecodeError: + break # try the next '{' + start = text.find("{", start + 1) + + # Greedy fallback: first '{' .. last '}' across newlines. + match = re.search(r"\{.*\}", text, re.DOTALL) + if match: + return json.loads(match.group(0)) + raise ValueError("no JSON object found in model response") + + +class JudgeClient: + """Async multi-provider chat client returning parsed JSON dicts. + + Dispatch is by the provider prefix of the model spec. temperature is pinned + to 0.0 for deterministic verdicts. Transient HTTP errors (429/5xx) and network + errors are retried up to max_retries with backoff. + + The API key / endpoint env vars are read lazily inside chat_json, so + constructing a JudgeClient never requires a secret. + """ + + def __init__(self, model: str, timeout: int = 120, max_retries: int = 3) -> None: + self.model_spec = model + self.provider, self.model = parse_model(model) + self.timeout = timeout + self.max_retries = max_retries + + async def chat_json(self, system: str, user: str, max_tokens: int = 1024) -> dict: + """Send (system, user) to the provider and parse the first JSON object. + + Raises on exhausted retries / unknown provider / unparseable output. + """ + last_exc: Exception | None = None + for attempt in range(self.max_retries): + try: + if self.provider == "anthropic": + return await self._anthropic(system, user, max_tokens) + if self.provider == "openai": + return await self._openai(system, user, max_tokens) + if self.provider == "azure": + return await self._azure(system, user, max_tokens) + if self.provider == "ollama": + return await self._ollama(system, user, max_tokens) + raise ValueError( + f"unknown judge provider {self.provider!r} in {self.model_spec!r}; " + "use anthropic:/openai:/azure:/ollama:" + ) + except httpx.HTTPStatusError as exc: + last_exc = exc + if exc.response.status_code not in _RETRY_STATUS or attempt == self.max_retries - 1: + raise + retry_after_header = exc.response.headers.get("retry-after") + if retry_after_header is not None: + try: + delay = min(float(retry_after_header), _MAX_RETRY_AFTER) + except (TypeError, ValueError): + delay = 2.0**attempt + else: + delay = 2.0**attempt + await asyncio.sleep(delay) + except httpx.RequestError as exc: + last_exc = exc + if attempt == self.max_retries - 1: + raise + await asyncio.sleep(2.0) + if last_exc is not None: + raise last_exc + raise RuntimeError("chat_json exhausted retries without a response") + + async def _anthropic(self, system: str, user: str, max_tokens: int) -> dict: + api_key = _env("ANTHROPIC_API_KEY") + if not api_key: + raise RuntimeError("ANTHROPIC_API_KEY not set") + body = { + "model": self.model, + "max_tokens": max_tokens, + "temperature": 0.0, + "system": system, + "messages": [{"role": "user", "content": user}], + } + headers = { + "x-api-key": api_key, + "anthropic-version": "2023-06-01", + "content-type": "application/json", + } + async with httpx.AsyncClient(timeout=self.timeout) as client: + resp = await client.post("https://api.anthropic.com/v1/messages", json=body, headers=headers) + resp.raise_for_status() + data = resp.json() + text = next((b.get("text") for b in data.get("content", []) if b.get("type") == "text"), None) + if not text: + raise RuntimeError(f"judge returned no text: {data}") + return _first_json_object(text) + + async def _openai(self, system: str, user: str, max_tokens: int) -> dict: + api_key = _env("OPENAI_API_KEY") + if not api_key: + raise RuntimeError("OPENAI_API_KEY not set") + body = { + "model": self.model, + "max_tokens": max_tokens, + "temperature": 0.0, + "messages": [ + {"role": "system", "content": system}, + {"role": "user", "content": user}, + ], + } + headers = {"Authorization": f"Bearer {api_key}", "content-type": "application/json"} + async with httpx.AsyncClient(timeout=self.timeout) as client: + resp = await client.post("https://api.openai.com/v1/chat/completions", json=body, headers=headers) + resp.raise_for_status() + data = resp.json() + choices = data.get("choices") or [] + if choices: + text = (choices[0].get("message") or {}).get("content") + if text: + return _first_json_object(text) + raise RuntimeError(f"judge returned no text: {data}") + + async def _azure(self, system: str, user: str, max_tokens: int) -> dict: + endpoint = _env("AZURE_OPENAI_ENDPOINT") + api_key = _env("AZURE_OPENAI_API_KEY") + if not endpoint: + raise RuntimeError("AZURE_OPENAI_ENDPOINT not set") + if not api_key: + raise RuntimeError("AZURE_OPENAI_API_KEY not set") + api_version = _env("AZURE_OPENAI_API_VERSION") or "2024-02-01" + url = f"{endpoint.rstrip('/')}/openai/deployments/{self.model}/chat/completions?api-version={api_version}" + body = { + "max_tokens": max_tokens, + "temperature": 0.0, + "messages": [ + {"role": "system", "content": system}, + {"role": "user", "content": user}, + ], + } + headers = {"api-key": api_key, "content-type": "application/json"} + async with httpx.AsyncClient(timeout=self.timeout) as client: + resp = await client.post(url, json=body, headers=headers) + resp.raise_for_status() + data = resp.json() + choices = data.get("choices") or [] + if choices: + text = (choices[0].get("message") or {}).get("content") + if text: + return _first_json_object(text) + raise RuntimeError(f"judge returned no text: {data}") + + async def _ollama(self, system: str, user: str, max_tokens: int) -> dict: # noqa: ARG002 + host = _env("OLLAMA_HOST") or "http://localhost:11434" + body = { + "model": self.model, + "stream": False, + "options": {"temperature": 0.0}, + "messages": [ + {"role": "system", "content": system}, + {"role": "user", "content": user}, + ], + } + async with httpx.AsyncClient(timeout=self.timeout) as client: + resp = await client.post(f"{host.rstrip('/')}/api/chat", json=body) + resp.raise_for_status() + data = resp.json() + text = (data.get("message") or {}).get("content") + if not text: + raise RuntimeError(f"judge returned no text: {data}") + return _first_json_object(text) diff --git a/fireflyframework_agentic/evaluation/retrieval_metrics.py b/fireflyframework_agentic/evaluation/retrieval_metrics.py new file mode 100644 index 00000000..7c9c5cfe --- /dev/null +++ b/fireflyframework_agentic/evaluation/retrieval_metrics.py @@ -0,0 +1,176 @@ +# Copyright 2026 Firefly Software Foundation +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Deterministic IR evaluation metrics for ranked retrieval results (no LLM, no network). + +Each metric is a plain function that takes a list of result rows and returns a +float — the same design as scikit-learn or MS MARCO evaluation scripts. + +Result row schema (dict):: + + { + "retrieved": [{"rank": int, "source_id": str, "is_gold": bool}, ...], + "gold": [str, ...], # gold source identifiers + # optional: + "no_answer": bool, # model refused / produced no answer + "answer": str, # used for no_answer detection when no_answer absent + "citations": [{"is_gold": bool}, ...], + "search_ms": float, + "answer_ms": float, + } + +Individual metrics:: + + hit_at_k(results, k) -> float + recall_at_k(results, k) -> float + precision_at_k(results, k) -> float + mrr(results, k=10) -> float + map_score(results, k=10) -> float + ndcg(results, k=10) -> float + no_answer_rate(results) -> float | None + citation_precision(results) -> float | None + mean_latency_ms(results, field) -> float | None +""" + +from __future__ import annotations + +import math + + +def _dedup(retrieved: list[dict]) -> list[dict]: + """Return one entry per source, first chunk wins, preserving rank order.""" + seen: set[str] = set() + out: list[dict] = [] + for r in sorted(retrieved, key=lambda x: x["rank"]): + key = r.get("source_id") or "|".join(r.get("identities", [])) + if key not in seen: + seen.add(key) + out.append(r) + return out + + +def _ndcg_single(retrieved: list[dict], n_gold: int, k: int = 10) -> float: + dcg = sum(1.0 / math.log2(r["rank"] + 1) for r in retrieved if r.get("is_gold") and r["rank"] <= k) + ideal = sum(1.0 / math.log2(i + 2) for i in range(min(n_gold, k))) + return dcg / ideal if ideal else 0.0 + + +def _ap_single(retrieved: list[dict], n_gold: int, k: int = 10) -> float: + hits, precisions = 0, [] + for r in sorted(retrieved, key=lambda x: x["rank"]): + if r["rank"] > k: + break + if r.get("is_gold"): + hits += 1 + precisions.append(hits / r["rank"]) + return sum(precisions) / min(n_gold, k) if n_gold else 0.0 + + +def hit_at_k(results: list[dict], k: int) -> float: + """Fraction of queries where at least one gold document appears in top-k.""" + if not results: + return 0.0 + hits = 0 + for row in results: + retrieved = _dedup(row["retrieved"]) + gold_ranks = [r["rank"] for r in retrieved if r.get("is_gold")] + if any(g <= k for g in gold_ranks): + hits += 1 + return round(hits / len(results), 4) + + +def recall_at_k(results: list[dict], k: int) -> float: + """Mean fraction of gold documents found in top-k.""" + if not results: + return 0.0 + total = 0.0 + for row in results: + retrieved = _dedup(row["retrieved"]) + n_gold = max(len(set(row["gold"])), 1) + gold_ranks = [r["rank"] for r in retrieved if r.get("is_gold")] + total += len([g for g in gold_ranks if g <= k]) / n_gold + return round(total / len(results), 4) + + +def precision_at_k(results: list[dict], k: int) -> float: + """Mean fraction of top-k results that are gold.""" + if not results: + return 0.0 + total = 0.0 + for row in results: + retrieved = _dedup(row["retrieved"]) + gold_ranks = [r["rank"] for r in retrieved if r.get("is_gold")] + total += len([g for g in gold_ranks if g <= k]) / k + return round(total / len(results), 4) + + +def mrr(results: list[dict], k: int = 10) -> float: + """Mean reciprocal rank of the first gold hit (up to k).""" + if not results: + return 0.0 + total = 0.0 + for row in results: + retrieved = _dedup(row["retrieved"]) + gold_ranks = sorted(r["rank"] for r in retrieved if r.get("is_gold") and r["rank"] <= k) + total += 1.0 / gold_ranks[0] if gold_ranks else 0.0 + return round(total / len(results), 4) + + +def map_score(results: list[dict], k: int = 10) -> float: + """Mean average precision at k.""" + if not results: + return 0.0 + total = 0.0 + for row in results: + retrieved = _dedup(row["retrieved"]) + n_gold = max(len(set(row["gold"])), 1) + total += _ap_single(retrieved, n_gold, k) + return round(total / len(results), 4) + + +def ndcg(results: list[dict], k: int = 10) -> float: + """Mean normalised discounted cumulative gain at k.""" + if not results: + return 0.0 + total = 0.0 + for row in results: + retrieved = _dedup(row["retrieved"]) + n_gold = max(len(set(row["gold"])), 1) + total += _ndcg_single(retrieved, n_gold, k) + return round(total / len(results), 4) + + +def no_answer_rate(results: list[dict]) -> float | None: + """Fraction of queries where the model produced no answer. None if no results.""" + if not results: + return None + count = sum(1 for row in results if row.get("no_answer") or not row.get("answer", "").strip()) + return round(count / len(results), 4) + + +def citation_precision(results: list[dict]) -> float | None: + """Precision of in-answer citations vs gold set. None if no citations present.""" + num = den = 0.0 + for row in results: + cites = row.get("citations", []) + if cites: + num += sum(1 for c in cites if c.get("is_gold")) + den += len(cites) + return round(num / den, 4) if den else None + + +def mean_latency_ms(results: list[dict], field: str) -> float | None: + """Mean latency in ms for the given field (``search_ms`` or ``answer_ms``). None if absent.""" + values = [row[field] for row in results if row.get(field) is not None] + return round(sum(values) / len(values)) if values else None diff --git a/pyproject.toml b/pyproject.toml index cceaf667..dc6a1507 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -119,6 +119,12 @@ binary = [ all = [ "fireflyframework-agentic[postgres,mongodb,security,embeddings,openai-embeddings,cohere-embeddings,google-embeddings,mistral-embeddings,voyage-embeddings,azure-embeddings,bedrock-embeddings,ollama-embeddings,vectorstores-chroma,vectorstores-pinecone,vectorstores-qdrant,vectorstores-pgvector,vectorstores-sqlite-vec,watch,binary]", ] +evaluation = [ + "numpy>=1.26.0", + "ragas>=0.2", + "langchain-anthropic>=0.3", + "langchain-ollama>=0.3", +] dev = [ "pytest>=8.3.0", "pytest-asyncio>=0.24.0", diff --git a/tests/unit/evaluation/__init__.py b/tests/unit/evaluation/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/tests/unit/evaluation/test_judge.py b/tests/unit/evaluation/test_judge.py new file mode 100644 index 00000000..7f27c125 --- /dev/null +++ b/tests/unit/evaluation/test_judge.py @@ -0,0 +1,248 @@ +from unittest.mock import MagicMock + +import pytest + +from fireflyframework_agentic.evaluation.judge import ( + EvalContext, + addresses_question, + contains_answer, + excerpt_fill_rate, + faithfulness, + source_coverage, +) +from fireflyframework_agentic.evaluation.judge_client import JudgeClient + + +def make_ctx(responses: list[dict]) -> EvalContext: + client = MagicMock(spec=JudgeClient) + client.model_spec = "anthropic:claude-sonnet-4-6" + client.provider = "anthropic" + client.model = "claude-sonnet-4-6" + call_iter = iter(responses) + + async def mock_chat_json(system, user, max_tokens=1024): + return next(call_iter) + + client.chat_json = mock_chat_json + return EvalContext(client=client, runs=1) + + +# ── contains_answer ────────────────────────────────────────────────────────────── + + +@pytest.mark.asyncio +async def test_contains_answer_present(): + ctx = make_ctx([{"contains_answer": 1.0, "addresses_question": 1.0}]) + item = {"question": "Q", "reference": "R", "answer": "A"} + score = await contains_answer(item, ctx) + assert score == 1.0 + + +@pytest.mark.asyncio +async def test_contains_answer_absent(): + ctx = make_ctx([{"contains_answer": 0.0, "addresses_question": 0.5}]) + item = {"question": "Q", "reference": "R", "answer": "wrong"} + score = await contains_answer(item, ctx) + assert score == 0.0 + + +@pytest.mark.asyncio +async def test_contains_answer_partial(): + ctx = make_ctx([{"contains_answer": 0.5, "addresses_question": 0.8}]) + item = {"question": "Q", "reference": "R", "answer": "partial"} + score = await contains_answer(item, ctx) + assert score == 0.5 + + +@pytest.mark.asyncio +async def test_contains_answer_missing_question_returns_none(): + ctx = make_ctx([]) + item = {"reference": "R", "answer": "A"} + score = await contains_answer(item, ctx) + assert score is None + + +# ── addresses_question ─────────────────────────────────────────────────────────── + + +@pytest.mark.asyncio +async def test_addresses_question_yes(): + ctx = make_ctx([{"contains_answer": 0.5, "addresses_question": 1.0}]) + item = {"question": "Q", "reference": "R", "answer": "A"} + score = await addresses_question(item, ctx) + assert score == 1.0 + + +@pytest.mark.asyncio +async def test_addresses_question_no(): + ctx = make_ctx([{"contains_answer": 0.0, "addresses_question": 0.0}]) + item = {"question": "Q", "reference": "R", "answer": "irrelevant"} + score = await addresses_question(item, ctx) + assert score == 0.0 + + +@pytest.mark.asyncio +async def test_addresses_question_missing_answer_returns_none(): + ctx = make_ctx([]) + item = {"question": "Q", "reference": "R"} + score = await addresses_question(item, ctx) + assert score is None + + +# ── faithfulness ───────────────────────────────────────────────────────────────── + + +@pytest.mark.asyncio +async def test_faithfulness_all_supported(): + # One finding with cited evidence, judge says SUPPORTED. + ctx = make_ctx([{"verdict": "SUPPORTED", "reason": "matches"}]) + item = { + "findings": [ + { + "id": "F1", + "description": "The process takes 3 days.", + "evidence_refs": [{"evidence_id": "E1"}], + } + ], + "evidence_index": [{"id": "E1", "locator": "doc.pdf#1", "excerpt": "The process takes 3 days as documented."}], + } + result = await faithfulness(item, ctx) + assert result["supported"] == 1 + assert result["total"] == 1 + assert result["unsupported_ids"] == [] + + +@pytest.mark.asyncio +async def test_faithfulness_not_supported(): + ctx = make_ctx([{"verdict": "NOT_SUPPORTED", "reason": "contradicts"}]) + item = { + "findings": [ + { + "id": "F1", + "description": "The process takes 45 days.", + "evidence_refs": [{"evidence_id": "E1"}], + } + ], + "evidence_index": [{"id": "E1", "locator": "doc.pdf#1", "excerpt": "The process takes 3 days."}], + } + result = await faithfulness(item, ctx) + assert result["supported"] == 0 + assert result["total"] == 1 + assert "F1" in result["unsupported_ids"] + + +@pytest.mark.asyncio +async def test_faithfulness_no_cited_evidence(): + # Finding with no evidence_refs -> counted as unsupported without LLM call. + ctx = make_ctx([]) + item = { + "findings": [{"id": "F1", "description": "Something.", "evidence_refs": []}], + "evidence_index": [], + } + result = await faithfulness(item, ctx) + assert result["supported"] == 0 + assert result["total"] == 1 + assert "F1" in result["unsupported_ids"] + + +# ── source_coverage ─────────────────────────────────────────────────────────────── + + +@pytest.mark.asyncio +async def test_source_coverage_all_cited(): + ctx = make_ctx([]) + item = { + "findings": [ + { + "id": "F1", + "description": "X", + "evidence_refs": [{"evidence_id": "E1"}], + } + ], + "evidence_index": [{"id": "E1", "locator": "doc.pdf#section1", "excerpt": "text"}], + } + result = await source_coverage(item, ctx) + assert result["cited"] == 1 + assert result["total"] == 1 + assert result["orphaned"] == [] + + +@pytest.mark.asyncio +async def test_source_coverage_orphaned(): + ctx = make_ctx([]) + item = { + "findings": [{"id": "F1", "description": "X", "evidence_refs": []}], + "evidence_index": [ + {"id": "E1", "locator": "doc1.pdf#p1", "excerpt": "text"}, + {"id": "E2", "locator": "doc2.pdf#p2", "excerpt": "text2"}, + ], + } + result = await source_coverage(item, ctx) + assert result["cited"] == 0 + assert result["total"] == 2 + assert len(result["orphaned"]) == 2 + + +@pytest.mark.asyncio +async def test_source_coverage_stem_dedup(): + # Two evidence items from the same file (different fragments) -> 1 source stem. + ctx = make_ctx([]) + item = { + "findings": [ + { + "id": "F1", + "description": "X", + "evidence_refs": [{"evidence_id": "E1"}], + } + ], + "evidence_index": [ + {"id": "E1", "locator": "doc.pdf#section1", "excerpt": "text1"}, + {"id": "E2", "locator": "doc.pdf#section2", "excerpt": "text2"}, + ], + } + result = await source_coverage(item, ctx) + # Both E1 and E2 share "doc.pdf" stem -> 1 total stem. + assert result["total"] == 1 + # E1 is cited -> that stem is covered. + assert result["cited"] == 1 + + +# ── excerpt_fill_rate ────────────────────────────────────────────────────────────── + + +@pytest.mark.asyncio +async def test_excerpt_fill_rate_full(): + ctx = make_ctx([]) + item = { + "evidence_index": [ + {"id": "E1", "excerpt": "has content"}, + {"id": "E2", "excerpt": "also has content"}, + ] + } + result = await excerpt_fill_rate(item, ctx) + assert result["populated"] == 2 + assert result["total"] == 2 + + +@pytest.mark.asyncio +async def test_excerpt_fill_rate_partial(): + ctx = make_ctx([]) + item = { + "evidence_index": [ + {"id": "E1", "excerpt": "has content"}, + {"id": "E2", "excerpt": ""}, + {"id": "E3", "excerpt": " "}, + ] + } + result = await excerpt_fill_rate(item, ctx) + assert result["populated"] == 1 + assert result["total"] == 3 + + +@pytest.mark.asyncio +async def test_excerpt_fill_rate_empty(): + ctx = make_ctx([]) + item = {"evidence_index": []} + result = await excerpt_fill_rate(item, ctx) + assert result["populated"] == 0 + assert result["total"] == 0 diff --git a/tests/unit/evaluation/test_retrieval_metrics.py b/tests/unit/evaluation/test_retrieval_metrics.py new file mode 100644 index 00000000..fa453e2d --- /dev/null +++ b/tests/unit/evaluation/test_retrieval_metrics.py @@ -0,0 +1,181 @@ +# Copyright 2026 Firefly Software Foundation +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Unit tests for evaluation.retrieval_metrics.""" + +from __future__ import annotations + +from fireflyframework_agentic.evaluation.retrieval_metrics import ( + citation_precision, + hit_at_k, + map_score, + mean_latency_ms, + mrr, + ndcg, + no_answer_rate, + precision_at_k, + recall_at_k, +) + + +def _row(gold_rank: int | None = None, total: int = 5, n_gold: int = 1) -> dict: + retrieved = [] + for rank in range(1, total + 1): + retrieved.append({"rank": rank, "source_id": f"doc-{rank}", "is_gold": rank == gold_rank}) + gold_ids = [f"doc-{gold_rank}"] if gold_rank is not None else [] + return {"retrieved": retrieved, "gold": gold_ids * n_gold} + + +# ── hit_at_k ────────────────────────────────────────────────────────────────── + + +def test_hit_at_k_gold_at_rank1(): + assert hit_at_k([_row(gold_rank=1)], k=1) == 1.0 + + +def test_hit_at_k_miss_at_rank1(): + assert hit_at_k([_row(gold_rank=2)], k=1) == 0.0 + + +def test_hit_at_k_gold_at_rank5(): + assert hit_at_k([_row(gold_rank=5)], k=5) == 1.0 + + +def test_hit_at_k_gold_at_rank10(): + assert hit_at_k([_row(gold_rank=10, total=10)], k=10) == 1.0 + + +def test_hit_at_k_empty(): + assert hit_at_k([], k=5) == 0.0 + + +# ── recall_at_k ─────────────────────────────────────────────────────────────── + + +def test_recall_at_k_full_when_gold_at_rank1(): + assert recall_at_k([_row(gold_rank=1, n_gold=1)], k=1) == 1.0 + + +def test_recall_at_k_zero_when_gold_outside_k(): + assert recall_at_k([_row(gold_rank=5)], k=1) == 0.0 + + +def test_recall_at_k_increases_with_k(): + rows = [_row(gold_rank=3)] + assert recall_at_k(rows, k=1) <= recall_at_k(rows, k=5) <= recall_at_k(rows, k=10) + + +# ── precision_at_k ──────────────────────────────────────────────────────────── + + +def test_precision_at_k_gold_at_rank1(): + assert precision_at_k([_row(gold_rank=1)], k=1) == 1.0 + + +def test_precision_at_k_decreases_when_k_larger(): + rows = [_row(gold_rank=1)] + assert precision_at_k(rows, k=5) < precision_at_k(rows, k=1) + + +# ── mrr ─────────────────────────────────────────────────────────────────────── + + +def test_mrr_gold_at_rank1(): + assert mrr([_row(gold_rank=1)]) == 1.0 + + +def test_mrr_gold_at_rank2(): + assert abs(mrr([_row(gold_rank=2)]) - 0.5) < 1e-9 + + +def test_mrr_no_gold(): + assert mrr([_row(gold_rank=None)]) == 0.0 + + +def test_mrr_average_across_queries(): + rows = [_row(gold_rank=1), _row(gold_rank=2)] + assert abs(mrr(rows) - 0.75) < 1e-3 + + +# ── ndcg ────────────────────────────────────────────────────────────────────── + + +def test_ndcg_gold_at_rank1(): + assert abs(ndcg([_row(gold_rank=1, n_gold=1)]) - 1.0) < 1e-9 + + +def test_ndcg_less_than_1_when_not_at_rank1(): + score = ndcg([_row(gold_rank=3, n_gold=1)]) + assert 0.0 < score < 1.0 + + +def test_ndcg_zero_when_no_gold(): + assert ndcg([_row(gold_rank=None)]) == 0.0 + + +# ── map_score ───────────────────────────────────────────────────────────────── + + +def test_map_score_perfect_when_gold_at_rank1(): + assert map_score([_row(gold_rank=1, n_gold=1)]) == 1.0 + + +def test_map_score_zero_when_no_gold(): + assert map_score([_row(gold_rank=None)]) == 0.0 + + +# ── no_answer_rate ──────────────────────────────────────────────────────────── + + +def test_no_answer_rate_zero_when_answer_present(): + rows = [{**_row(gold_rank=1), "answer": "some answer"}] + assert no_answer_rate(rows) == 0.0 + + +def test_no_answer_rate_one_when_no_answer_field(): + assert no_answer_rate([_row(gold_rank=1)]) == 1.0 + + +def test_no_answer_rate_none_when_empty(): + assert no_answer_rate([]) is None + + +# ── citation_precision ──────────────────────────────────────────────────────── + + +def test_citation_precision_none_when_no_citations(): + assert citation_precision([_row(gold_rank=1)]) is None + + +def test_citation_precision_1_when_all_gold(): + rows = [{**_row(gold_rank=1), "citations": [{"is_gold": True}, {"is_gold": True}]}] + assert citation_precision(rows) == 1.0 + + +def test_citation_precision_half_when_half_gold(): + rows = [{**_row(gold_rank=1), "citations": [{"is_gold": True}, {"is_gold": False}]}] + assert citation_precision(rows) == 0.5 + + +# ── mean_latency_ms ─────────────────────────────────────────────────────────── + + +def test_mean_latency_none_when_field_absent(): + assert mean_latency_ms([_row(gold_rank=1)], "search_ms") is None + + +def test_mean_latency_computed_when_present(): + rows = [{**_row(gold_rank=1), "search_ms": 100.0, "answer_ms": 200.0}] + assert mean_latency_ms(rows, "search_ms") == 100 + assert mean_latency_ms(rows, "answer_ms") == 200 diff --git a/uv.lock b/uv.lock index 374cca9f..364552a9 100644 --- a/uv.lock +++ b/uv.lock @@ -1222,6 +1222,10 @@ dev = [ embeddings = [ { name = "numpy" }, ] +evaluation = [ + { name = "numpy" }, + { name = "scipy" }, +] google-embeddings = [ { name = "google-generativeai" }, ] @@ -1292,6 +1296,7 @@ requires-dist = [ { name = "mistralai", marker = "extra == 'mistral-embeddings'", specifier = ">=1.0.0" }, { name = "motor", marker = "extra == 'mongodb'", specifier = ">=3.6.0" }, { name = "numpy", marker = "extra == 'embeddings'", specifier = ">=1.26.0" }, + { name = "numpy", marker = "extra == 'evaluation'", specifier = ">=1.26.0" }, { name = "numpy", marker = "extra == 'reasoning-eval'", specifier = ">=2.0.0" }, { name = "openai", marker = "extra == 'azure-embeddings'", specifier = ">=1.0.0" }, { name = "openai", marker = "extra == 'openai-embeddings'", specifier = ">=1.0.0" }, @@ -1317,13 +1322,14 @@ requires-dist = [ { name = "python-dotenv", specifier = ">=1.0.0" }, { name = "qdrant-client", marker = "extra == 'vectorstores-qdrant'", specifier = ">=1.12.0" }, { name = "ruff", marker = "extra == 'dev'", specifier = ">=0.9.0" }, + { name = "scipy", marker = "extra == 'evaluation'", specifier = ">=1.11" }, { name = "sqlalchemy", marker = "extra == 'postgres'", specifier = ">=2.0.0" }, { name = "sqlite-vec", marker = "extra == 'vectorstores-sqlite-vec'", specifier = ">=0.1.6" }, { name = "testcontainers", marker = "extra == 'dev'", specifier = ">=4.10.0" }, { name = "voyageai", marker = "extra == 'voyage-embeddings'", specifier = ">=0.3.0" }, { name = "watchfiles", marker = "extra == 'watch'", specifier = ">=0.24.0" }, ] -provides-extras = ["postgres", "mongodb", "security", "embeddings", "openai-embeddings", "cohere-embeddings", "google-embeddings", "mistral-embeddings", "voyage-embeddings", "azure-embeddings", "bedrock-embeddings", "ollama-embeddings", "reasoning-eval", "vectorstores-chroma", "vectorstores-sqlite-vec", "vectorstores-pinecone", "vectorstores-qdrant", "vectorstores-pgvector", "watch", "binary", "all", "dev"] +provides-extras = ["postgres", "mongodb", "security", "embeddings", "openai-embeddings", "cohere-embeddings", "google-embeddings", "mistral-embeddings", "voyage-embeddings", "azure-embeddings", "bedrock-embeddings", "ollama-embeddings", "reasoning-eval", "vectorstores-chroma", "vectorstores-sqlite-vec", "vectorstores-pinecone", "vectorstores-qdrant", "vectorstores-pgvector", "watch", "binary", "all", "evaluation", "dev"] [[package]] name = "flatbuffers" @@ -4502,6 +4508,57 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/87/72/c6c32d2b657fa3dad1de340254e14390b1e334ce38268b7ad51abda3c8c2/s3transfer-0.17.0-py3-none-any.whl", hash = "sha256:ce3801712acf4ad3e89fb9990df97b4972e93f4b3b0004d214be5bce12814c20", size = 86811, upload-time = "2026-04-29T22:07:34.966Z" }, ] +[[package]] +name = "scipy" +version = "1.17.1" +source = { registry = "https://pypi.org/simple" } +dependencies = [ + { name = "numpy" }, +] +sdist = { url = "https://files.pythonhosted.org/packages/7a/97/5a3609c4f8d58b039179648e62dd220f89864f56f7357f5d4f45c29eb2cc/scipy-1.17.1.tar.gz", hash = "sha256:95d8e012d8cb8816c226aef832200b1d45109ed4464303e997c5b13122b297c0", size = 30573822, upload-time = "2026-02-23T00:26:24.851Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/76/27/07ee1b57b65e92645f219b37148a7e7928b82e2b5dbeccecb4dff7c64f0b/scipy-1.17.1-cp313-cp313-macosx_10_14_x86_64.whl", hash = "sha256:5e3c5c011904115f88a39308379c17f91546f77c1667cea98739fe0fccea804c", size = 31590199, upload-time = "2026-02-23T00:19:17.192Z" }, + { url = "https://files.pythonhosted.org/packages/ec/ae/db19f8ab842e9b724bf5dbb7db29302a91f1e55bc4d04b1025d6d605a2c5/scipy-1.17.1-cp313-cp313-macosx_12_0_arm64.whl", hash = "sha256:6fac755ca3d2c3edcb22f479fceaa241704111414831ddd3bc6056e18516892f", size = 28154001, upload-time = "2026-02-23T00:19:22.241Z" }, + { url = "https://files.pythonhosted.org/packages/5b/58/3ce96251560107b381cbd6e8413c483bbb1228a6b919fa8652b0d4090e7f/scipy-1.17.1-cp313-cp313-macosx_14_0_arm64.whl", hash = "sha256:7ff200bf9d24f2e4d5dc6ee8c3ac64d739d3a89e2326ba68aaf6c4a2b838fd7d", size = 20325719, upload-time = "2026-02-23T00:19:26.329Z" }, + { url = "https://files.pythonhosted.org/packages/b2/83/15087d945e0e4d48ce2377498abf5ad171ae013232ae31d06f336e64c999/scipy-1.17.1-cp313-cp313-macosx_14_0_x86_64.whl", hash = "sha256:4b400bdc6f79fa02a4d86640310dde87a21fba0c979efff5248908c6f15fad1b", size = 22683595, upload-time = "2026-02-23T00:19:30.304Z" }, + { url = "https://files.pythonhosted.org/packages/b4/e0/e58fbde4a1a594c8be8114eb4aac1a55bcd6587047efc18a61eb1f5c0d30/scipy-1.17.1-cp313-cp313-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:2b64ca7d4aee0102a97f3ba22124052b4bd2152522355073580bf4845e2550b6", size = 32896429, upload-time = "2026-02-23T00:19:35.536Z" }, + { url = "https://files.pythonhosted.org/packages/f5/5f/f17563f28ff03c7b6799c50d01d5d856a1d55f2676f537ca8d28c7f627cd/scipy-1.17.1-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:581b2264fc0aa555f3f435a5944da7504ea3a065d7029ad60e7c3d1ae09c5464", size = 35203952, upload-time = "2026-02-23T00:19:42.259Z" }, + { url = "https://files.pythonhosted.org/packages/8d/a5/9afd17de24f657fdfe4df9a3f1ea049b39aef7c06000c13db1530d81ccca/scipy-1.17.1-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:beeda3d4ae615106d7094f7e7cef6218392e4465cc95d25f900bebabfded0950", size = 34979063, upload-time = "2026-02-23T00:19:47.547Z" }, + { url = "https://files.pythonhosted.org/packages/8b/13/88b1d2384b424bf7c924f2038c1c409f8d88bb2a8d49d097861dd64a57b2/scipy-1.17.1-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:6609bc224e9568f65064cfa72edc0f24ee6655b47575954ec6339534b2798369", size = 37598449, upload-time = "2026-02-23T00:19:53.238Z" }, + { url = "https://files.pythonhosted.org/packages/35/e5/d6d0e51fc888f692a35134336866341c08655d92614f492c6860dc45bb2c/scipy-1.17.1-cp313-cp313-win_amd64.whl", hash = "sha256:37425bc9175607b0268f493d79a292c39f9d001a357bebb6b88fdfaff13f6448", size = 36510943, upload-time = "2026-02-23T00:20:50.89Z" }, + { url = "https://files.pythonhosted.org/packages/2a/fd/3be73c564e2a01e690e19cc618811540ba5354c67c8680dce3281123fb79/scipy-1.17.1-cp313-cp313-win_arm64.whl", hash = "sha256:5cf36e801231b6a2059bf354720274b7558746f3b1a4efb43fcf557ccd484a87", size = 24545621, upload-time = "2026-02-23T00:20:55.871Z" }, + { url = "https://files.pythonhosted.org/packages/6f/6b/17787db8b8114933a66f9dcc479a8272e4b4da75fe03b0c282f7b0ade8cd/scipy-1.17.1-cp313-cp313t-macosx_10_14_x86_64.whl", hash = "sha256:d59c30000a16d8edc7e64152e30220bfbd724c9bbb08368c054e24c651314f0a", size = 31936708, upload-time = "2026-02-23T00:19:58.694Z" }, + { url = "https://files.pythonhosted.org/packages/38/2e/524405c2b6392765ab1e2b722a41d5da33dc5c7b7278184a8ad29b6cb206/scipy-1.17.1-cp313-cp313t-macosx_12_0_arm64.whl", hash = "sha256:010f4333c96c9bb1a4516269e33cb5917b08ef2166d5556ca2fd9f082a9e6ea0", size = 28570135, upload-time = "2026-02-23T00:20:03.934Z" }, + { url = "https://files.pythonhosted.org/packages/fd/c3/5bd7199f4ea8556c0c8e39f04ccb014ac37d1468e6cfa6a95c6b3562b76e/scipy-1.17.1-cp313-cp313t-macosx_14_0_arm64.whl", hash = "sha256:2ceb2d3e01c5f1d83c4189737a42d9cb2fc38a6eeed225e7515eef71ad301dce", size = 20741977, upload-time = "2026-02-23T00:20:07.935Z" }, + { url = "https://files.pythonhosted.org/packages/d9/b8/8ccd9b766ad14c78386599708eb745f6b44f08400a5fd0ade7cf89b6fc93/scipy-1.17.1-cp313-cp313t-macosx_14_0_x86_64.whl", hash = "sha256:844e165636711ef41f80b4103ed234181646b98a53c8f05da12ca5ca289134f6", size = 23029601, upload-time = "2026-02-23T00:20:12.161Z" }, + { url = "https://files.pythonhosted.org/packages/6d/a0/3cb6f4d2fb3e17428ad2880333cac878909ad1a89f678527b5328b93c1d4/scipy-1.17.1-cp313-cp313t-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:158dd96d2207e21c966063e1635b1063cd7787b627b6f07305315dd73d9c679e", size = 33019667, upload-time = "2026-02-23T00:20:17.208Z" }, + { url = "https://files.pythonhosted.org/packages/f3/c3/2d834a5ac7bf3a0c806ad1508efc02dda3c8c61472a56132d7894c312dea/scipy-1.17.1-cp313-cp313t-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:74cbb80d93260fe2ffa334efa24cb8f2f0f622a9b9febf8b483c0b865bfb3475", size = 35264159, upload-time = "2026-02-23T00:20:23.087Z" }, + { url = "https://files.pythonhosted.org/packages/4d/77/d3ed4becfdbd217c52062fafe35a72388d1bd82c2d0ba5ca19d6fcc93e11/scipy-1.17.1-cp313-cp313t-musllinux_1_2_aarch64.whl", hash = "sha256:dbc12c9f3d185f5c737d801da555fb74b3dcfa1a50b66a1a93e09190f41fab50", size = 35102771, upload-time = "2026-02-23T00:20:28.636Z" }, + { url = "https://files.pythonhosted.org/packages/bd/12/d19da97efde68ca1ee5538bb261d5d2c062f0c055575128f11a2730e3ac1/scipy-1.17.1-cp313-cp313t-musllinux_1_2_x86_64.whl", hash = "sha256:94055a11dfebe37c656e70317e1996dc197e1a15bbcc351bcdd4610e128fe1ca", size = 37665910, upload-time = "2026-02-23T00:20:34.743Z" }, + { url = "https://files.pythonhosted.org/packages/06/1c/1172a88d507a4baaf72c5a09bb6c018fe2ae0ab622e5830b703a46cc9e44/scipy-1.17.1-cp313-cp313t-win_amd64.whl", hash = "sha256:e30bdeaa5deed6bc27b4cc490823cd0347d7dae09119b8803ae576ea0ce52e4c", size = 36562980, upload-time = "2026-02-23T00:20:40.575Z" }, + { url = "https://files.pythonhosted.org/packages/70/b0/eb757336e5a76dfa7911f63252e3b7d1de00935d7705cf772db5b45ec238/scipy-1.17.1-cp313-cp313t-win_arm64.whl", hash = "sha256:a720477885a9d2411f94a93d16f9d89bad0f28ca23c3f8daa521e2dcc3f44d49", size = 24856543, upload-time = "2026-02-23T00:20:45.313Z" }, + { url = "https://files.pythonhosted.org/packages/cf/83/333afb452af6f0fd70414dc04f898647ee1423979ce02efa75c3b0f2c28e/scipy-1.17.1-cp314-cp314-macosx_10_14_x86_64.whl", hash = "sha256:a48a72c77a310327f6a3a920092fa2b8fd03d7deaa60f093038f22d98e096717", size = 31584510, upload-time = "2026-02-23T00:21:01.015Z" }, + { url = "https://files.pythonhosted.org/packages/ed/a6/d05a85fd51daeb2e4ea71d102f15b34fedca8e931af02594193ae4fd25f7/scipy-1.17.1-cp314-cp314-macosx_12_0_arm64.whl", hash = "sha256:45abad819184f07240d8a696117a7aacd39787af9e0b719d00285549ed19a1e9", size = 28170131, upload-time = "2026-02-23T00:21:05.888Z" }, + { url = "https://files.pythonhosted.org/packages/db/7b/8624a203326675d7746a254083a187398090a179335b2e4a20e2ddc46e83/scipy-1.17.1-cp314-cp314-macosx_14_0_arm64.whl", hash = "sha256:3fd1fcdab3ea951b610dc4cef356d416d5802991e7e32b5254828d342f7b7e0b", size = 20342032, upload-time = "2026-02-23T00:21:09.904Z" }, + { url = "https://files.pythonhosted.org/packages/c9/35/2c342897c00775d688d8ff3987aced3426858fd89d5a0e26e020b660b301/scipy-1.17.1-cp314-cp314-macosx_14_0_x86_64.whl", hash = "sha256:7bdf2da170b67fdf10bca777614b1c7d96ae3ca5794fd9587dce41eb2966e866", size = 22678766, upload-time = "2026-02-23T00:21:14.313Z" }, + { url = "https://files.pythonhosted.org/packages/ef/f2/7cdb8eb308a1a6ae1e19f945913c82c23c0c442a462a46480ce487fdc0ac/scipy-1.17.1-cp314-cp314-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:adb2642e060a6549c343603a3851ba76ef0b74cc8c079a9a58121c7ec9fe2350", size = 32957007, upload-time = "2026-02-23T00:21:19.663Z" }, + { url = "https://files.pythonhosted.org/packages/0b/2e/7eea398450457ecb54e18e9d10110993fa65561c4f3add5e8eccd2b9cd41/scipy-1.17.1-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:eee2cfda04c00a857206a4330f0c5e3e56535494e30ca445eb19ec624ae75118", size = 35221333, upload-time = "2026-02-23T00:21:25.278Z" }, + { url = "https://files.pythonhosted.org/packages/d9/77/5b8509d03b77f093a0d52e606d3c4f79e8b06d1d38c441dacb1e26cacf46/scipy-1.17.1-cp314-cp314-musllinux_1_2_aarch64.whl", hash = "sha256:d2650c1fb97e184d12d8ba010493ee7b322864f7d3d00d3f9bb97d9c21de4068", size = 35042066, upload-time = "2026-02-23T00:21:31.358Z" }, + { url = "https://files.pythonhosted.org/packages/f9/df/18f80fb99df40b4070328d5ae5c596f2f00fffb50167e31439e932f29e7d/scipy-1.17.1-cp314-cp314-musllinux_1_2_x86_64.whl", hash = "sha256:08b900519463543aa604a06bec02461558a6e1cef8fdbb8098f77a48a83c8118", size = 37612763, upload-time = "2026-02-23T00:21:37.247Z" }, + { url = "https://files.pythonhosted.org/packages/4b/39/f0e8ea762a764a9dc52aa7dabcfad51a354819de1f0d4652b6a1122424d6/scipy-1.17.1-cp314-cp314-win_amd64.whl", hash = "sha256:3877ac408e14da24a6196de0ddcace62092bfc12a83823e92e49e40747e52c19", size = 37290984, upload-time = "2026-02-23T00:22:35.023Z" }, + { url = "https://files.pythonhosted.org/packages/7c/56/fe201e3b0f93d1a8bcf75d3379affd228a63d7e2d80ab45467a74b494947/scipy-1.17.1-cp314-cp314-win_arm64.whl", hash = "sha256:f8885db0bc2bffa59d5c1b72fad7a6a92d3e80e7257f967dd81abb553a90d293", size = 25192877, upload-time = "2026-02-23T00:22:39.798Z" }, + { url = "https://files.pythonhosted.org/packages/96/ad/f8c414e121f82e02d76f310f16db9899c4fcde36710329502a6b2a3c0392/scipy-1.17.1-cp314-cp314t-macosx_10_14_x86_64.whl", hash = "sha256:1cc682cea2ae55524432f3cdff9e9a3be743d52a7443d0cba9017c23c87ae2f6", size = 31949750, upload-time = "2026-02-23T00:21:42.289Z" }, + { url = "https://files.pythonhosted.org/packages/7c/b0/c741e8865d61b67c81e255f4f0a832846c064e426636cd7de84e74d209be/scipy-1.17.1-cp314-cp314t-macosx_12_0_arm64.whl", hash = "sha256:2040ad4d1795a0ae89bfc7e8429677f365d45aa9fd5e4587cf1ea737f927b4a1", size = 28585858, upload-time = "2026-02-23T00:21:47.706Z" }, + { url = "https://files.pythonhosted.org/packages/ed/1b/3985219c6177866628fa7c2595bfd23f193ceebbe472c98a08824b9466ff/scipy-1.17.1-cp314-cp314t-macosx_14_0_arm64.whl", hash = "sha256:131f5aaea57602008f9822e2115029b55d4b5f7c070287699fe45c661d051e39", size = 20757723, upload-time = "2026-02-23T00:21:52.039Z" }, + { url = "https://files.pythonhosted.org/packages/c0/19/2a04aa25050d656d6f7b9e7b685cc83d6957fb101665bfd9369ca6534563/scipy-1.17.1-cp314-cp314t-macosx_14_0_x86_64.whl", hash = "sha256:9cdc1a2fcfd5c52cfb3045feb399f7b3ce822abdde3a193a6b9a60b3cb5854ca", size = 23043098, upload-time = "2026-02-23T00:21:56.185Z" }, + { url = "https://files.pythonhosted.org/packages/86/f1/3383beb9b5d0dbddd030335bf8a8b32d4317185efe495374f134d8be6cce/scipy-1.17.1-cp314-cp314t-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:6e3dcd57ab780c741fde8dc68619de988b966db759a3c3152e8e9142c26295ad", size = 33030397, upload-time = "2026-02-23T00:22:01.404Z" }, + { url = "https://files.pythonhosted.org/packages/41/68/8f21e8a65a5a03f25a79165ec9d2b28c00e66dc80546cf5eb803aeeff35b/scipy-1.17.1-cp314-cp314t-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:a9956e4d4f4a301ebf6cde39850333a6b6110799d470dbbb1e25326ac447f52a", size = 35281163, upload-time = "2026-02-23T00:22:07.024Z" }, + { url = "https://files.pythonhosted.org/packages/84/8d/c8a5e19479554007a5632ed7529e665c315ae7492b4f946b0deb39870e39/scipy-1.17.1-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:a4328d245944d09fd639771de275701ccadf5f781ba0ff092ad141e017eccda4", size = 35116291, upload-time = "2026-02-23T00:22:12.585Z" }, + { url = "https://files.pythonhosted.org/packages/52/52/e57eceff0e342a1f50e274264ed47497b59e6a4e3118808ee58ddda7b74a/scipy-1.17.1-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:a77cbd07b940d326d39a1d1b37817e2ee4d79cb30e7338f3d0cddffae70fcaa2", size = 37682317, upload-time = "2026-02-23T00:22:18.513Z" }, + { url = "https://files.pythonhosted.org/packages/11/2f/b29eafe4a3fbc3d6de9662b36e028d5f039e72d345e05c250e121a230dd4/scipy-1.17.1-cp314-cp314t-win_amd64.whl", hash = "sha256:eb092099205ef62cd1782b006658db09e2fed75bffcae7cc0d44052d8aa0f484", size = 37345327, upload-time = "2026-02-23T00:22:24.442Z" }, + { url = "https://files.pythonhosted.org/packages/07/39/338d9219c4e87f3e708f18857ecd24d22a0c3094752393319553096b98af/scipy-1.17.1-cp314-cp314t-win_arm64.whl", hash = "sha256:200e1050faffacc162be6a486a984a0497866ec54149a01270adc8a59b7c7d21", size = 25489165, upload-time = "2026-02-23T00:22:29.563Z" }, +] + [[package]] name = "secretstorage" version = "3.5.0"