Context
MemMachine tests two modes:
- Memory mode: single retrieval, context injected, LLM answers (87.5%)
- Agent mode: LLM uses memory as a tool, can do multiple retrieval rounds (88.1%)
Agent mode scores higher because the LLM can refine its queries — ask a broad question, look at results, ask a more specific follow-up.
Why this matters for BM
BM already supports this naturally via MCP. An LLM using BM tools can:
search_notes('Sarah restaurant')
- Look at results, realize it needs temporal context
search_notes('Sarah lunch May 2023')
- Combine both result sets to answer
We should benchmark both modes:
- Single-shot: one search call, inject context, LLM answers (comparable to memory mode)
- Agent (MCP): LLM has access to search_notes + read_note + build_context tools, can do multiple rounds
The agent mode result shows what BM can do when paired with a capable LLM — which is the real-world usage pattern.
Implementation
- Single-shot: existing benchmark + LLM-as-Judge (#615)
- Agent mode: give the eval LLM MCP tool access to BM, let it search freely, then judge the answer
- Report both scores separately
Related
Milestone
v0.19.0