feat: expose evaluator row status for LLM evaluators by SeCuReDmE-main-dev · Pull Request #11333 · deepset-ai/haystack

SeCuReDmE-main-dev · 2026-05-18T04:04:44Z

Summary

This is Phase 1 for RFC #11332.

It adds a narrow top-level evaluation_statuses output to LLMEvaluator results so callers can distinguish rows that were successfully evaluated from rows that failed during generation or parsing when raise_on_failure=False is used.

Scope

Included:

LLMEvaluator now returns evaluation_statuses alongside existing results and meta.
Successful parsed rows are marked as evaluated.
generation/parsing failures that continue under raise_on_failure=False are marked as error.
ContextRelevanceEvaluator and FaithfulnessEvaluator pass through the new status list while preserving their current nan behavior.
focused tests for successful rows, invalid JSON, and generation failures.
release note added under releasenotes/notes.

Intentionally excluded:

no EvaluationRunResult reporting changes;
no retriever, agent, router, HITL, or governance changes;
no indeterminate status yet;
no public neutrosophic naming in code.

Phase 2 and Phase 3 remain dependent on maintainer feedback from the RFC and this PR.

Tests

C:\Users\jeans\.local\bin\uv.exe run pytest test/components/evaluators/test_llm_evaluator.py test/components/evaluators/test_context_relevance_evaluator.py test/components/evaluators/test_faithfulness_evaluator.py -q
C:\Users\jeans\.local\bin\uv.exe run ruff check haystack/components/evaluators test/components/evaluators
C:\Users\jeans\.local\bin\uv.exe run ruff format --check haystack/components/evaluators test/components/evaluators
C:\Users\jeans\.local\bin\uv.exe run reno lint .

Results:

38 passed, 2 skipped
ruff check: passed
ruff format --check: passed
reno lint: passed

vercel · 2026-05-18T04:04:51Z

@SeCuReDmE-main-dev is attempting to deploy a commit to the deepset Team on Vercel.

A member of the Team first needs to authorize it.

CLAassistant · 2026-05-18T04:04:52Z

All committers have signed the CLA.

SeCuReDmE-main-dev · 2026-05-18T04:27:26Z

Phase 2 base work is complete and stacked on top of this Phase 1 branch.

Downstream stacked PR: SeCuReDmE-main-dev#1
Outcome: BM25 now has an opt-in retrieval confidence metadata path via Document.meta.
Boundary: this stays BM25-only and does not redefine global retriever score semantics.

SeCuReDmE-main-dev · 2026-05-18T04:30:41Z

Phase 3 base work is complete and the stacked progression now reaches explicit HITL decision contracts.

Downstream stacked PR: SeCuReDmE-main-dev#2
Outcome: ToolExecutionDecision now carries explicit approved / modified / rejected status semantics.
Boundary: this stops at HITL contract enrichment and does not introduce Phase 4 runtime/governance changes.

SeCuReDmE-main-dev · 2026-05-18T04:58:38Z

Pause boundary — awaiting maintainer feedback

All three stacked phases are now validated and documented:

Phase 1 (this PR): evaluation_statuses for LLMEvaluator — 16 tests passing
Phase 2 (fork PR SeCuReDmE-main-dev/haystack_case_study#1): opt-in BM25 retrieval confidence metadata — 19 tests passing
Phase 3 (fork PR SeCuReDmE-main-dev/haystack_case_study#2): explicit HITL decision statuses — 26 tests passing

All branches pass ruff check and ruff format. All new fields are opt-in or additive with no breaking changes.

No further work will be performed until maintainers respond. The next action depends entirely on your feedback — whether that is naming guidance, scope adjustment, acceptance, or rejection.

The corresponding RFC is #11332. Happy to answer any questions.

SeCuReDmE-main-dev · 2026-05-18T08:16:59Z

@coderabbitai review

feat: expose LLM evaluator row statuses

7816680

SeCuReDmE-main-dev requested a review from a team as a code owner May 18, 2026 04:04

SeCuReDmE-main-dev requested review from anakin87 and removed request for a team May 18, 2026 04:04

github-actions Bot added the topic:tests label May 18, 2026

chore: add evaluator status release note

b281672

SeCuReDmE-main-dev mentioned this pull request May 18, 2026

RFC: Structured Evaluator Uncertainty and Error Semantics #11332

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: expose evaluator row status for LLM evaluators#11333

feat: expose evaluator row status for LLM evaluators#11333
SeCuReDmE-main-dev wants to merge 2 commits into
deepset-ai:mainfrom
SeCuReDmE-main-dev:feature/haystack-evaluator-uncertainty-phase1

SeCuReDmE-main-dev commented May 18, 2026 •

edited

Loading

Uh oh!

vercel Bot commented May 18, 2026

Uh oh!

CLAassistant commented May 18, 2026 •

edited

Loading

Uh oh!

SeCuReDmE-main-dev commented May 18, 2026

Uh oh!

SeCuReDmE-main-dev commented May 18, 2026

Uh oh!

SeCuReDmE-main-dev commented May 18, 2026

Uh oh!

SeCuReDmE-main-dev commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SeCuReDmE-main-dev commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Scope

Tests

Uh oh!

vercel Bot commented May 18, 2026

Uh oh!

CLAassistant commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SeCuReDmE-main-dev commented May 18, 2026

Uh oh!

SeCuReDmE-main-dev commented May 18, 2026

Uh oh!

SeCuReDmE-main-dev commented May 18, 2026

Pause boundary — awaiting maintainer feedback

Uh oh!

SeCuReDmE-main-dev commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SeCuReDmE-main-dev commented May 18, 2026 •

edited

Loading

CLAassistant commented May 18, 2026 •

edited

Loading