-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Problem
On the full LoCoMo run, BM's content hit rate is 15.5% vs Mem0's 34.3% — despite BM winning on every retrieval metric (R@5, R@10, MRR).
This means BM finds the right document more often, but the retrieved text less often contains the exact answer string.
Root cause analysis
Content hit is measured by checking if expected_answer appears as a literal substring in the concatenated hit.text of the top results.
Two factors hurt BM here:
1. BM returns matched_chunk not full note content
bm tool search-notes returns matched_chunk (the specific chunk that matched) plus truncated content. The expected_answer might be in a different part of the same note that isn't in the returned text. The correct document is found (recall is high) but the answer text isn't in the returned snippet.
Fix opportunity (basic-memory core): Return more context around matched chunks, or return full note content when notes are small enough.
2. Mem0 stores atomic extracted memories
Mem0 extracts "important sentences" during ingestion, creating small atomic memory units. These are closer to answer phrasing by design. BM stores full conversation sessions and relies on chunk matching.
This is a fundamental architectural difference — Mem0 trades context for precision, BM preserves full context. But we could improve by:
- Extracting observations/facts at ingestion time (which BM already does via the observation system)
- Ensuring search returns observation-level hits, not just entity-level
3. Scoring methodology
The content_hit function does exact substring matching:
needle = expected_answer.strip().lower()
haystack = '\n'.join((hit.text or '') for hit in hits).lower()
return needle in haystackThis is brittle — semantically correct answers with different wording score as misses. Could supplement with fuzzy/semantic matching.
Benchmark evidence
Full LoCoMo run (locomo-full-20260226T055634Z):
| Provider | R@5 | Content Hit |
|---|---|---|
| bm-local | 74.3% | 15.5% |
| mem0-local | 64.6% | 34.3% |
Potential improvements
- Benchmark repo: Add fuzzy content matching (e.g., token overlap ratio) as supplementary metric
- BM core: Return more text context per search hit (full note for small notes, larger chunks for large notes)
- BM core: Ensure observation-level entities surface in search results with their full text
- BM benchmark provider: Try fetching full note via
read-notefor top-K hits to get complete content