Context
Our current LoCoMo benchmark measures retrieval quality (R@5, R@10, MRR). Competitors like Backboard report end-to-end answer accuracy using LLM-as-Judge (GPT-4.1), scoring 90.1% overall. We need the same metric to compare directly.
What to Build
Adapt the evaluation pipeline to add an LLM-as-Judge step after retrieval:
- Retrieve context using BM search (existing)
- Pass retrieved context + question to an LLM (GPT-4.1 or Claude)
- LLM generates an answer
- A judge LLM evaluates: CORRECT or WRONG against ground truth
- Report accuracy by category (single_hop, multi_hop, open_domain, temporal)
Reference Implementation
Backboard's open benchmark: https://github.com/Backboard-io/Backboard-Locomo-Benchmark
- Uses GPT-4.1 as judge with fixed prompts and seed
- Publishes logs, prompts, and verdicts for every question
- Skips category 5 (adversarial) — we should include it
Expected Outcome
Direct comparison table:
| Method |
Single-Hop |
Multi-Hop |
Open Domain |
Temporal |
Overall |
| Backboard |
89.4% |
75.0% |
91.2% |
91.9% |
90.0% |
| Basic Memory |
? |
? |
? |
? |
? |
| Mem0 |
67.1% |
51.2% |
72.9% |
55.5% |
66.9% |
Our retrieval is already strong (86% R@5 vs Mem0's 66%). With a good LLM on top of our retrieved context, we should be competitive with Backboard.
Milestone
v0.19.0