feat: Add GroundednessChecker — runtime groundedness guardrail for RAG pipelines#11031
feat: Add GroundednessChecker — runtime groundedness guardrail for RAG pipelines#11031JohnnyTarrr wants to merge 2 commits intodeepset-ai:mainfrom
Conversation
Adds a new validator component that verifies generated replies are grounded in retrieved documents at runtime, not just during offline evaluation. Addresses deepset-ai#10973. The component: - Sits after a Generator in a pipeline - Extracts factual claims from the generated reply using an LLM - Cross-references each claim against retrieved documents - Returns per-claim verdicts (supported/contradicted/unverifiable) - Computes an overall trust score (0-1) - Optionally strips contradicted claims from the output Works with any ChatGenerator (OpenAI, Ollama, Azure, etc.).
|
@JohnnyTarrr is attempting to deploy a commit to the deepset Team on Vercel. A member of the Team first needs to authorize it. |
|
Hi @JohnnyTarrr, this is a very clean initial implementation. Expanding Haystack’s guardrails this quickly is exactly why I opened the original issue! I took a look through groundedness_checker.py and I wanted to flag a critical limitation regarding context handling that I think we need to solve before this is merged. On Line 124, the component blindly concatenates all documents: context = "\n\n".join(doc.content for doc in documents if doc.content) And on Line 191, it enforces an arbitrary hard truncation: context=context[:8000] The Issue: This linear concatenation is highly vulnerable to the "Lost-in-the-Middle" degradation phenomenon (Liu et al., 2023). If a Generator outputs a valid claim, but the supporting Document happens to be placed in the dead-center of a 7,500-character prompt, the LLM judge will frequently fail to recognize the evidence and issue a False Negative (contradicted or unverifiable). Furthermore, truncating rigidly at [:8000] guarantees data destruction if the Retriever passes a large context window. Proposed Solution (Positional Context Batching): In the custom pipeline I built that sparked this feature request, I solved this by treating the chunks non-linearly: First, rank the documents array by relevance score. I strongly recommend we update the _verify_claims method to implement contextual batching rather than hard truncation before this goes to main. Let me know if you’d like me to draft that logic! |
|
@DYNOSuprovo Spot on — the hard truncation is a known limitation. Your batching approach is the right fix. Here's what I think the implementation looks like:
Since you've already built and tested this pattern, could you draft the |
|
Thanks @JohnnyTarrr! Happy to collaborate on this and get it bulletproof for the maintainers. To implement the Positional Context Batching cleanly without over-complicating the Haystack Here is a lightweight, production-ready Python implementation of the positional batching logic. Instead of arbitrarily truncating at 8000 characters, we sort by relevance, dynamically select the chunks, and explicitly construct the string to exploit LLM primacy/recency bias: def _build_positional_context(self, documents: list[Document], max_chars: int = 8000) -> str:
"""
Builds a context string from a list of documents, prioritizing the most relevant
documents by placing them at the extreme ends of the prompt (position 0 and -1)
to mitigate Lost-in-the-Middle context degradation.
"""
if not documents:
return ""
# Step 1: Ensure documents are sorted by relevance
# We use a stable sort so if scores are None, initial retriever order is preserved
ranked_docs = sorted(documents, key=lambda d: getattr(d, 'score', 0.0) or 0.0, reverse=True)
# Step 2: Select documents until we hit the char limit (preventing arbitrary mid-sentence truncation where possible)
selected_docs = []
current_len = 0
for doc in ranked_docs:
content = doc.content or ""
doc_len = len(content)
# Always include at least one document, but stop if adding the next breaches the limit
if current_len + doc_len > max_chars and selected_docs:
break
selected_docs.append(doc)
current_len += doc_len + 2 # +2 accounts for the "\n\n" separator
# Step 3: Positional Reordering
# Structure: [Most Relevant] -> [Least Relevant...] -> [Second Most Relevant]
if len(selected_docs) >= 3:
best_doc = selected_docs[0]
second_best_doc = selected_docs[1]
middle_docs = selected_docs[2:]
ordered_docs = [best_doc] + middle_docs + [second_best_doc]
else:
ordered_docs = selected_docs
context_str = "\n\n".join(d.content for d in ordered_docs if d.content)
# Hard fallback truncation just in case the first document itself exceeded max_chars
return context_str[:max_chars]If you add this as a helper method on the class, you can just replace your context = "\n\n".join(...) line in run() with context = self._build_positional_context(documents), and update _VERIFY_PROMPT to remove the hard slice. It natively guarantees that the two most important documents will always anchor the extremities of the prompt! Let me know if you run into any issues integrating this. (Note for co-author creds, if you use them: you can use Co-authored-by: DYNOSuprovo DYNOSuprovo@users.noreply.github.com in your commit message!) |
Replace hard truncation with positional context batching to mitigate Lost-in-the-Middle degradation (Liu et al., 2023). Documents are sorted by relevance score and reordered so the most relevant docs sit at the start and end of the context window, exploiting LLM primacy/recency bias. Co-authored-by: DYNOSuprovo <DYNOSuprovo@users.noreply.github.com>
|
@DYNOSuprovo Integrated — just pushed your positional context batching logic. Clean implementation, works perfectly. Changes in the latest commit:
You're credited as co-author on the commit. Thanks for the collaboration — the component is much stronger with this. @sjrl @julian-risch this should be ready for review now. |
Context
Addresses #10973 — runtime groundedness verification for RAG pipelines.
The existing
FaithfulnessEvaluatoris designed for offline batch evaluation. This PR adds a runtime validator that sits inside a live pipeline and actively intervenes on each query — extracting claims, cross-referencing against retrieved documents, and flagging or stripping unsupported content before it reaches the user.Built by the team at VeroQ, where we work on LLM output verification. We also offer a self-hosted Shield that provides both groundedness and factual verification as a Docker container: veroq-ai/veroq-self-hosted
What it does
GroundednessCheckeris a new component inhaystack.components.validatorsthat:replies(from a Generator) anddocuments(from a Retriever)supported/contradicted/unverifiable)trust_score(0-1)block_contradicted=True)Output
{ "verified_replies": ["Revenue was [CORRECTED: $2.1B] in Q3."], "trust_score": 0.0, "verdict": "has_contradictions", "claims": [ {"claim": "Revenue was $2.4B", "verdict": "contradicted", "explanation": "Context says $2.1B", "correction": "Revenue was $2.1B"} ], "is_trusted": False, }Design decisions
validators/because it's a runtime guardrail that modifies pipeline output, not an offline metric calculator.gpt-4o-miniwith JSON mode._is_warmed_uppattern fromLLMRanker.Tests
18 test methods covering:
to_dict+from_dict)warm_updelegation and idempotencyraise_on_failure(both True and False paths)Checklist
haystack/components/validators/validators/__init__.py