Skip to content

Retrieval pipeline improvements: reranking, length normalization, noise filtering, adaptive recall #666

@bm-clawd

Description

@bm-clawd

Context

Analyzed memory-lancedb-pro (LanceDB-based OpenClaw memory plugin) and their demo. While architecturally different from BM (opaque vectors vs human-readable plain text), their retrieval pipeline has several techniques worth adopting.

Our LoCoMo benchmark baseline: Recall@5 76.4%, Recall@10 85.5%, MRR 0.658. Weakest areas: single_hop 57.7%, temporal 59.1%. Root cause identified in #577: RRF scoring flattens results, FTS outperforms vector for observations.

Proposed Improvements

1. Cross-encoder reranking (highest impact)

After initial hybrid retrieval (vector + FTS), run a second pass through a cross-encoder reranker (e.g., Jina reranker-v3, Voyage rerank-2.5). This would re-score candidates based on query-document semantic relevance rather than just embedding similarity.

memory-lancedb-pro uses: 60% cross-encoder score + 40% original fused score, with graceful fallback to cosine similarity on API failure.

Implementation options:

  • Cloud API: Jina, Voyage, Pinecone (cheap per-query cost)
  • Local: BAAI/bge-reranker via Ollama-compatible endpoint
  • Config-driven: optional, off by default, provider-agnostic

This directly addresses #577 (RRF flattening) and should significantly improve Recall@5 and MRR.

2. Length normalization

Long notes/observations currently have an outsized influence on search results. A length normalization step (anchor around 500 chars) would prevent verbose entries from dominating over precise, short observations.

3. Noise filtering / adaptive retrieval

Skip memory retrieval entirely for queries that don't need it: greetings, slash commands, simple confirmations, emoji-only messages. Also filter low-quality content from capture: agent refusals, meta-questions, boilerplate.

This reduces wasted retrieval cycles and keeps the index cleaner.

4. MMR diversity filtering

When multiple results are very similar (cosine > 0.85), demote duplicates to increase result diversity. Prevents near-duplicate observations from consuming all top-K slots.

What we already do better

  • Human-readable plain text (their memories are opaque vector rows)
  • Knowledge graph with relational structure (observations + relations)
  • Bidirectional editing (humans can correct memories by editing files)
  • Schema system for structured note types
  • Git history as provenance trail

Priority

  1. Cross-encoder reranking (biggest recall improvement per effort)
  2. Length normalization (quick win)
  3. MMR diversity (moderate effort)
  4. Noise filtering / adaptive retrieval (nice to have)

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions