feat: LoCoMo accuracy improvements — auto-read + multi-word OR by kaghni · Pull Request #67 · activeloopai/hivemind

kaghni · 2026-04-21T19:53:44Z

Summary

Two focused improvements to memory retrieval accuracy, benchmarked against the LoCoMo conversational memory benchmark on a balanced 60-question subset (15 per category × 4 non-adversarial categories, drawn across all 10 conversations).

Benchmark results

Model: claude -p --model haiku. Judge: openai/gpt-5-mini. See deeplake-blitz PR #1 for methodology and raw results.

Config	Overall	single-hop	temporal	multi-hop	open-domain
`origin/main` (this branch's base)	46.7%	—	—	—	—
+ auto-read (`64f2246`)	58.3%	—	—	—	—
+ multi-word OR (`04bf1d5`, this PR)	59.2%	—	—	—	—

Net: +12.5% absolute over origin/main on a category-balanced 60-question sample.

All 918 existing tests pass.

What's in the PR

1. `feat(pre-tool-use): auto-rewrite unsupported interpreter reads to cat` (`64f2246`)

When the agent invokes an unsupported interpreter (python/python3/node/deno/bun/ruby/perl) on a single memory path with no shell metacharacters, rewrite the call to cat '<path>' instead of returning the generic [RETRY REQUIRED] guidance. Avoids burning a turn when the intent is clearly a plain file read.

Narrow safety conditions:

Command starts with a recognized read-like interpreter.
No shell metacharacters ($, `, ;, |, &, <, >, (), \).
Memory path matches conservative [\w./_-]+ regex (no trailing quotes/parens can glue onto it).
Final cat command wraps the path in single quotes with '\''-escape.

Composite commands (curl, wget, chains, pipes, substitutions) still fall through to the RETRY guidance.

2. `feat(grep-core): multi-word OR split for non-regex patterns` (`04bf1d5`)

When a non-regex grep pattern contains multiple words, split into per-word OR filters instead of requiring the full literal phrase to appear verbatim. Natural-language queries like Melanie kids dinosaurs now match sessions that mention any of the three tokens, not only those that contain all three in that exact order.

Single-word and regex patterns are unchanged.

Adds an optional multiWordPatterns field on SearchOptions. buildGrepSearchOptions populates it when the pattern splits into >1 word-long tokens. searchDeeplakeTables consumes it in the non-contentScanOnly branch, falling back to escapedPattern. All downstream OR joining is handled by the existing buildContentFilter helper.

What was considered and dropped

During development, a third change — prepending extracted date_time to the first grep match per file — was committed and tested. It regressed from 59.2% → 50.8%, so it was removed from this PR. Session date extraction may still be a valid idea in a different form (e.g. as a separate structured field instead of inline prefix), but the naive prefix hurts more than it helps on this benchmark.

A previous branch (closed PR #55) had 9 commits based on the pre-refactor codebase. Only these two improvements survived the port to main's new grep-core.ts/pre-tool-use.ts architecture and actually improved accuracy under proper benchmarking conditions. The rest were either obsolete (already done by grep-core.ts UNION ALL, PR #64's index fix) or didn't carry over cleanly.

Methodology note

Early benchmark runs (on PR #55) had two bugs that inflated apparent gains:

Benchmark repo checked out to wrong branch, so --subset flag was silently ignored and benchmark ran all 1540 questions with many timeouts → 0-5% false "results".
Benchmark script set DEEPLAKE_CAPTURE=false but main's new code reads HIVEMIND_CAPTURE, so session writes leaked into locomo_sessions mid-run, polluting search results.

Both fixes live in the deeplake-blitz PR. All numbers above are post-fix.

Test plan

npm test passes (918/918 tests, including 32 pre-tool-use.test.ts tests that specifically cover the RETRY-vs-cat rewrite boundary for curl, command substitution, backticks, interpreter commands, etc.)
npm run build produces matching bundles
Balanced 60q LoCoMo benchmark: +12.5% over main

When the agent invokes an unsupported interpreter (python/python3/node/ deno/bun/ruby/perl) on a single memory path with no shell metacharacters, rewrite the call to cat <path> instead of returning the generic RETRY guidance. This avoids burning a turn when the intent is clearly a plain file read. Narrow safety conditions: - Command starts with a recognized read-like interpreter. - No shell metacharacters (dollar-paren, backtick, semicolon, pipe, ampersand, angle brackets, parens, backslash). - Memory path matches a conservative [word/./_/-] regex (no trailing quotes or parens can glue onto the path). - Final cat command wraps the path in single quotes and escapes embedded single quotes, preventing injection. Composite or non-interpreter commands (curl, wget, etc.) still get the RETRY guidance. Tests: 918/918 passing.

When a non-regex grep pattern contains multiple words, split into per-word OR filters instead of requiring the full literal phrase to appear verbatim in the summary. Natural-language queries like Melanie kids dinosaurs now match sessions that mention any of the three tokens, not only those that contain all three in that exact order. Single-word and regex patterns are unchanged. Implementation: buildGrepSearchOptions now populates a new optional multiWordPatterns field on SearchOptions when the pattern splits into >1 word-long tokens. searchDeeplakeTables consumes that field in the non-contentScanOnly branch, falling back to the single escapedPattern otherwise. All downstream OR joining is handled by the existing buildContentFilter helper. Tests: 918/918 passing.

claude · 2026-04-21T19:54:07Z

Claude finished @kaghni's task in 4m 3s —— View job

Left 2 inline comments:

src/hooks/pre-tool-use.ts:238 — path regex misses absolute-path invocations (/home/…/.deeplake/memory/…) that touchesMemory() would catch, silently falling back to RETRY guidance even though the rewrite would be safe.
src/hooks/pre-tool-use.ts:244 — no positive-path test: the existing tests only confirm that commands with shell metacharacters still return RETRY guidance; there's no assertion that a clean python3 ~/.deeplake/memory/data.json actually produces cat '/data.json'.

github-actions · 2026-04-21T19:54:33Z

Coverage Report

Scope: files changed in this PR. Enforced threshold: 90% per metric (per file via vitest.config.ts).

Status	Category	Percentage	Covered / Total
🟢	Lines	99.50% (🎯 90%)	399 / 401
🟢	Statements	97.75% (🎯 90%)	478 / 489
🟢	Functions	96.15% (🎯 90%)	50 / 52
🟢	Branches	90.73% (🎯 90%)	470 / 518

File Coverage — 2 files changed

File	Stmts	Branches	Functions	Lines
`src/hooks/pre-tool-use.ts`	🟢 98.7%	🟢 90.4%	🟢 93.8%	🟢 99.0%
`src/shell/grep-core.ts`	🟢 97.0%	🟢 91.0%	🟢 97.2%	🟢 100.0%

_{Generated for commit f6852f7.}

claude · 2026-04-21T19:58:12Z

+    const hasShellMeta = /[$`;|&<>()\\]/.test(cmd);
+    if (isReadLike && !hasShellMeta) {
+      const pathMatch = cmd.match(/~\/\.deeplake\/memory\/[\w./_-]+/)
+        || toolPath.match(/~\/\.deeplake\/memory\/[\w./_-]+/);


The path regex only matches ~/ prefix, so python3 /home/runner/.deeplake/memory/data.json (an absolute path) passes touchesMemory() and isReadLike, but silently falls through to RETRY guidance because neither cmd.match(...) nor toolPath.match(...) fires. Extend the regex to also match the MEMORY_PATH absolute prefix, or use rewritePaths on the whole cmd/toolPath and inspect the result.

Suggested change

|| toolPath.match(/~\/\.deeplake\/memory\/[\w./_-]+/);

const pathMatch = cmd.match(/(?:~\/\.deeplake\/memory|\/[^\s]*\.deeplake\/memory)\/[\w./_-]+/)

|| toolPath.match(/(?:~\/\.deeplake\/memory|\/[^\s]*\.deeplake\/memory)\/[\w./_-]+/);

claude · 2026-04-21T19:58:18Z

+      if (cleanPath && !cleanPath.endsWith("/")) {
+        logFn(`unsupported command on file, converting to cat: ${cleanPath}`);
+        return buildAllowDecision(
+          `cat '${cleanPath.replace(/'/g, "'\\''")}'`,


No test exercises the new happy path — the existing tests only verify that commands with shell metacharacters still produce RETRY guidance (the negative path). Add a test that python3 ~/.deeplake/memory/data.json → updatedCommand contains cat '/data.json' to prevent silent regressions in the rewrite logic.

Addresses two PR review comments on the auto-read conversion: 1. Path regex only matched tilde-prefixed memory paths. When haiku generated an absolute path (e.g. an expanded home path pointing into the memory tree), touchesMemory() returned true and isReadLike passed, but pathMatch never fired, so cleanPath was empty and auto-read silently fell through to the RETRY guidance — dead code for the most common case. Fix: run rewritePaths() on both cmd and toolPath first. rewritePaths already normalizes all three variants (tilde, dollar-HOME, absolute home dir). Then extract the normalized /-leading path from the result. 2. No happy-path test for the rewrite itself. The existing 32 pre-tool-use tests all covered the negative path — commands with shell metacharacters still returning RETRY. Nothing asserted that python3 on a clean memory file actually becomes cat. The regex bug above could have silently regressed with no test failure. Added: - 9 pre-tool-use.test.ts cases covering python3/node/bun/deno/ruby happy path, absolute-path variant, trailing-slash rejection, shell-meta fall-through, dollar-HOME variable fall-through (must NOT rewrite because dollar is a shell metacharacter), single-quote escape check. - 5 grep-core.test.ts cases covering buildGrepSearchOptions multi-word splitting: basic, single-word, short-token filter, regex fallback, 4-word cap. Tests: 934 passing total (up from 918). Branch coverage for grep-core 89.88% and pre-tool-use 87.25% — improved but still below the 90% thresholds enforced by the coverage CI job.

Coverage CI was failing at 87.25% (pre-tool-use) and 89.88% (grep-core), both below the 90% threshold. The subprocess-based e2e tests in pre-tool-use.test.ts exercise the rewrite-to-cat path but vitest c8 does not instrument subprocesses, so those branches looked uncovered. Added two targeted test batches that call the exported functions directly: - pre-tool-use-branches.test.ts: 7 new cases for processPreToolUse covering the auto-read happy path (python3/node/ruby on tilde and absolute memory paths, description assertion) and its guards (trailing-slash rejection, shell-meta fall-through, non-interpreter command fall-through). - grep-core.test.ts: 6 new cases for extractRegexAlternationPrefilters covering the null-return branches (char class, anchor, parens, quantifier, empty branch, leading/trailing pipe, trailing escape) plus dedupe and no-pipe fast-return. Results: 947 tests (up from 934). Branch coverage pre-tool-use 87.25% -> 90.43%, grep-core 89.88% -> 91.01%.

efenocchi · 2026-04-21T22:00:10Z

LGTM

kaghni added 2 commits April 21, 2026 18:35

claude Bot reviewed Apr 21, 2026

View reviewed changes

kaghni added 2 commits April 21, 2026 21:10

efenocchi approved these changes Apr 21, 2026

View reviewed changes

kaghni mentioned this pull request Apr 21, 2026

Revert PR #66 (OpenClaw coexist) — merged by mistake #68

Merged

3 tasks

kaghni merged commit cada222 into main Apr 21, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: LoCoMo accuracy improvements — auto-read + multi-word OR#67

feat: LoCoMo accuracy improvements — auto-read + multi-word OR#67
kaghni merged 4 commits intomainfrom
feat/locomo-accuracy-v3

kaghni commented Apr 21, 2026

Uh oh!

claude Bot commented Apr 21, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 21, 2026 •

edited

Loading

Uh oh!

claude Bot Apr 21, 2026

Uh oh!

claude Bot Apr 21, 2026

Uh oh!

efenocchi commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	\|\| toolPath.match(/~\/\.deeplake\/memory\/[\w./_-]+/);
	const pathMatch = cmd.match(/(?:~\/\.deeplake\/memory\|\/[^\s]*\.deeplake\/memory)\/[\w./_-]+/)
	\|\| toolPath.match(/(?:~\/\.deeplake\/memory\|\/[^\s]*\.deeplake\/memory)\/[\w./_-]+/);

Conversation

kaghni commented Apr 21, 2026

Summary

Benchmark results

What's in the PR

1. feat(pre-tool-use): auto-rewrite unsupported interpreter reads to cat (64f2246)

2. feat(grep-core): multi-word OR split for non-regex patterns (04bf1d5)

What was considered and dropped

Methodology note

Test plan

Uh oh!

claude Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coverage Report

Uh oh!

claude Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

efenocchi commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. `feat(pre-tool-use): auto-rewrite unsupported interpreter reads to cat` (`64f2246`)

2. `feat(grep-core): multi-word OR split for non-regex patterns` (`04bf1d5`)

claude Bot commented Apr 21, 2026 •

edited

Loading

github-actions Bot commented Apr 21, 2026 •

edited

Loading