feat: LoCoMo accuracy improvements — auto-read + multi-word OR#67
feat: LoCoMo accuracy improvements — auto-read + multi-word OR#67
Conversation
When the agent invokes an unsupported interpreter (python/python3/node/
deno/bun/ruby/perl) on a single memory path with no shell metacharacters,
rewrite the call to cat <path> instead of returning the generic
RETRY guidance. This avoids burning a turn when the intent is clearly a
plain file read.
Narrow safety conditions:
- Command starts with a recognized read-like interpreter.
- No shell metacharacters (dollar-paren, backtick, semicolon, pipe,
ampersand, angle brackets, parens, backslash).
- Memory path matches a conservative [word/./_/-] regex (no trailing
quotes or parens can glue onto the path).
- Final cat command wraps the path in single quotes and escapes embedded
single quotes, preventing injection.
Composite or non-interpreter commands (curl, wget, etc.) still get the
RETRY guidance.
Tests: 918/918 passing.
When a non-regex grep pattern contains multiple words, split into per-word OR filters instead of requiring the full literal phrase to appear verbatim in the summary. Natural-language queries like Melanie kids dinosaurs now match sessions that mention any of the three tokens, not only those that contain all three in that exact order. Single-word and regex patterns are unchanged. Implementation: buildGrepSearchOptions now populates a new optional multiWordPatterns field on SearchOptions when the pattern splits into >1 word-long tokens. searchDeeplakeTables consumes that field in the non-contentScanOnly branch, falling back to the single escapedPattern otherwise. All downstream OR joining is handled by the existing buildContentFilter helper. Tests: 918/918 passing.
|
Claude finished @kaghni's task in 4m 3s —— View job Left 2 inline comments:
|
Coverage ReportScope: files changed in this PR. Enforced threshold: 90% per metric (per file via
File Coverage — 2 files changed
Generated for commit f6852f7. |
| const hasShellMeta = /[$`;|&<>()\\]/.test(cmd); | ||
| if (isReadLike && !hasShellMeta) { | ||
| const pathMatch = cmd.match(/~\/\.deeplake\/memory\/[\w./_-]+/) | ||
| || toolPath.match(/~\/\.deeplake\/memory\/[\w./_-]+/); |
There was a problem hiding this comment.
The path regex only matches ~/ prefix, so python3 /home/runner/.deeplake/memory/data.json (an absolute path) passes touchesMemory() and isReadLike, but silently falls through to RETRY guidance because neither cmd.match(...) nor toolPath.match(...) fires. Extend the regex to also match the MEMORY_PATH absolute prefix, or use rewritePaths on the whole cmd/toolPath and inspect the result.
| || toolPath.match(/~\/\.deeplake\/memory\/[\w./_-]+/); | |
| const pathMatch = cmd.match(/(?:~\/\.deeplake\/memory|\/[^\s]*\.deeplake\/memory)\/[\w./_-]+/) | |
| || toolPath.match(/(?:~\/\.deeplake\/memory|\/[^\s]*\.deeplake\/memory)\/[\w./_-]+/); |
| if (cleanPath && !cleanPath.endsWith("/")) { | ||
| logFn(`unsupported command on file, converting to cat: ${cleanPath}`); | ||
| return buildAllowDecision( | ||
| `cat '${cleanPath.replace(/'/g, "'\\''")}'`, |
There was a problem hiding this comment.
No test exercises the new happy path — the existing tests only verify that commands with shell metacharacters still produce RETRY guidance (the negative path). Add a test that python3 ~/.deeplake/memory/data.json → updatedCommand contains cat '/data.json' to prevent silent regressions in the rewrite logic.
Addresses two PR review comments on the auto-read conversion:
1. Path regex only matched tilde-prefixed memory paths. When haiku
generated an absolute path (e.g. an expanded home path pointing into
the memory tree), touchesMemory() returned true and isReadLike
passed, but pathMatch never fired, so cleanPath was empty and
auto-read silently fell through to the RETRY guidance — dead code
for the most common case.
Fix: run rewritePaths() on both cmd and toolPath first. rewritePaths
already normalizes all three variants (tilde, dollar-HOME, absolute
home dir). Then extract the normalized /-leading path from the result.
2. No happy-path test for the rewrite itself. The existing 32
pre-tool-use tests all covered the negative path — commands with
shell metacharacters still returning RETRY. Nothing asserted that
python3 on a clean memory file actually becomes cat. The regex bug
above could have silently regressed with no test failure.
Added:
- 9 pre-tool-use.test.ts cases covering python3/node/bun/deno/ruby
happy path, absolute-path variant, trailing-slash rejection,
shell-meta fall-through, dollar-HOME variable fall-through (must
NOT rewrite because dollar is a shell metacharacter), single-quote
escape check.
- 5 grep-core.test.ts cases covering buildGrepSearchOptions
multi-word splitting: basic, single-word, short-token filter,
regex fallback, 4-word cap.
Tests: 934 passing total (up from 918). Branch coverage for grep-core
89.88% and pre-tool-use 87.25% — improved but still below the 90%
thresholds enforced by the coverage CI job.
Coverage CI was failing at 87.25% (pre-tool-use) and 89.88% (grep-core), both below the 90% threshold. The subprocess-based e2e tests in pre-tool-use.test.ts exercise the rewrite-to-cat path but vitest c8 does not instrument subprocesses, so those branches looked uncovered. Added two targeted test batches that call the exported functions directly: - pre-tool-use-branches.test.ts: 7 new cases for processPreToolUse covering the auto-read happy path (python3/node/ruby on tilde and absolute memory paths, description assertion) and its guards (trailing-slash rejection, shell-meta fall-through, non-interpreter command fall-through). - grep-core.test.ts: 6 new cases for extractRegexAlternationPrefilters covering the null-return branches (char class, anchor, parens, quantifier, empty branch, leading/trailing pipe, trailing escape) plus dedupe and no-pipe fast-return. Results: 947 tests (up from 934). Branch coverage pre-tool-use 87.25% -> 90.43%, grep-core 89.88% -> 91.01%.
|
LGTM |
Summary
Two focused improvements to memory retrieval accuracy, benchmarked against the LoCoMo conversational memory benchmark on a balanced 60-question subset (15 per category × 4 non-adversarial categories, drawn across all 10 conversations).
Benchmark results
Model:
claude -p --model haiku. Judge:openai/gpt-5-mini. See deeplake-blitz PR #1 for methodology and raw results.origin/main(this branch's base)64f2246)04bf1d5, this PR)Net: +12.5% absolute over
origin/mainon a category-balanced 60-question sample.All 918 existing tests pass.
What's in the PR
1.
feat(pre-tool-use): auto-rewrite unsupported interpreter reads to cat(64f2246)When the agent invokes an unsupported interpreter (
python/python3/node/deno/bun/ruby/perl) on a single memory path with no shell metacharacters, rewrite the call tocat '<path>'instead of returning the generic[RETRY REQUIRED]guidance. Avoids burning a turn when the intent is clearly a plain file read.Narrow safety conditions:
$,`,;,|,&,<,>,(),\).[\w./_-]+regex (no trailing quotes/parens can glue onto it).catcommand wraps the path in single quotes with'\''-escape.Composite commands (curl, wget, chains, pipes, substitutions) still fall through to the RETRY guidance.
2.
feat(grep-core): multi-word OR split for non-regex patterns(04bf1d5)When a non-regex grep pattern contains multiple words, split into per-word OR filters instead of requiring the full literal phrase to appear verbatim. Natural-language queries like
Melanie kids dinosaursnow match sessions that mention any of the three tokens, not only those that contain all three in that exact order.Single-word and regex patterns are unchanged.
Adds an optional
multiWordPatternsfield onSearchOptions.buildGrepSearchOptionspopulates it when the pattern splits into >1 word-long tokens.searchDeeplakeTablesconsumes it in the non-contentScanOnly branch, falling back toescapedPattern. All downstream OR joining is handled by the existingbuildContentFilterhelper.What was considered and dropped
During development, a third change — prepending extracted
date_timeto the first grep match per file — was committed and tested. It regressed from 59.2% → 50.8%, so it was removed from this PR. Session date extraction may still be a valid idea in a different form (e.g. as a separate structured field instead of inline prefix), but the naive prefix hurts more than it helps on this benchmark.A previous branch (closed PR #55) had 9 commits based on the pre-refactor codebase. Only these two improvements survived the port to
main's newgrep-core.ts/pre-tool-use.tsarchitecture and actually improved accuracy under proper benchmarking conditions. The rest were either obsolete (already done bygrep-core.tsUNION ALL, PR #64's index fix) or didn't carry over cleanly.Methodology note
Early benchmark runs (on PR #55) had two bugs that inflated apparent gains:
--subsetflag was silently ignored and benchmark ran all 1540 questions with many timeouts → 0-5% false "results".DEEPLAKE_CAPTURE=falsebutmain's new code readsHIVEMIND_CAPTURE, so session writes leaked intolocomo_sessionsmid-run, polluting search results.Both fixes live in the deeplake-blitz PR. All numbers above are post-fix.
Test plan
npm testpasses (918/918 tests, including 32 pre-tool-use.test.ts tests that specifically cover the RETRY-vs-cat rewrite boundary for curl, command substitution, backticks, interpreter commands, etc.)npm run buildproduces matching bundlesmain