Skip to content

Investigate low content hit rate for bm-local (15.5% vs Mem0 34.3%) #2

@bm-clawd

Description

@bm-clawd

Problem

On the full LoCoMo run, BM's content hit rate is 15.5% vs Mem0's 34.3% — despite BM winning on every retrieval metric (R@5, R@10, MRR).

This means BM finds the right document more often, but the retrieved text less often contains the exact answer string.

Root cause analysis

Content hit is measured by checking if expected_answer appears as a literal substring in the concatenated hit.text of the top results.

Two factors hurt BM here:

1. BM returns matched_chunk not full note content

bm tool search-notes returns matched_chunk (the specific chunk that matched) plus truncated content. The expected_answer might be in a different part of the same note that isn't in the returned text. The correct document is found (recall is high) but the answer text isn't in the returned snippet.

Fix opportunity (basic-memory core): Return more context around matched chunks, or return full note content when notes are small enough.

2. Mem0 stores atomic extracted memories

Mem0 extracts "important sentences" during ingestion, creating small atomic memory units. These are closer to answer phrasing by design. BM stores full conversation sessions and relies on chunk matching.

This is a fundamental architectural difference — Mem0 trades context for precision, BM preserves full context. But we could improve by:

  • Extracting observations/facts at ingestion time (which BM already does via the observation system)
  • Ensuring search returns observation-level hits, not just entity-level

3. Scoring methodology

The content_hit function does exact substring matching:

needle = expected_answer.strip().lower()
haystack = '\n'.join((hit.text or '') for hit in hits).lower()
return needle in haystack

This is brittle — semantically correct answers with different wording score as misses. Could supplement with fuzzy/semantic matching.

Benchmark evidence

Full LoCoMo run (locomo-full-20260226T055634Z):

Provider R@5 Content Hit
bm-local 74.3% 15.5%
mem0-local 64.6% 34.3%

Potential improvements

  1. Benchmark repo: Add fuzzy content matching (e.g., token overlap ratio) as supplementary metric
  2. BM core: Return more text context per search hit (full note for small notes, larger chunks for large notes)
  3. BM core: Ensure observation-level entities surface in search results with their full text
  4. BM benchmark provider: Try fetching full note via read-note for top-K hits to get complete content

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions