Skip to content

Feat/locomo accuracy improvements#55

Closed
kaghni wants to merge 9 commits intomainfrom
feat/locomo-accuracy-improvements
Closed

Feat/locomo accuracy improvements#55
kaghni wants to merge 9 commits intomainfrom
feat/locomo-accuracy-improvements

Conversation

@kaghni
Copy link
Copy Markdown
Collaborator

@kaghni kaghni commented Apr 16, 2026

Summary

Version Bump

To trigger a release, bump "version" in package.json before merging.

Change type Version bump Example
Bug fix patch (1.2.0 → 1.2.1) "version": "1.2.1"
New feature minor (1.2.0 → 1.3.0) "version": "1.3.0"
Breaking change major (1.2.0 → 2.0.0) "version": "2.0.0"

If you don't bump the version, no release will be created.

Test plan

  • Tests pass locally (npm test)
  • Relevant new tests added
  • Version bumped in package.json, or no release needed for this change

kaghni added 6 commits April 16, 2026 09:20
When the primary table name ends with _sessions, automatically search
the companion _memory table (wiki-style summaries) and prepend matches.
This gives Claude structured context alongside raw session data.

LoCoMo benchmark (20 questions, conv 0):
  Before: 45.0% (stock 0.6.32, sessions-only grep)
  After:  62.5% (cross-table enrichment)
  Delta:  +17.5%

Improvements: Q7 Q8 Q9 Q15 Q17 flipped WRONG->CORRECT
Regressions: Q14 Q18 (2 questions regressed)
Net: +5 correct, -2 regressed = +3 net correct answers

Still below Zep (75.1%) and Memobase (75.8%) targets.
The virtual index.md was querying paths LIKE /summaries/% which only
matched the default hivemind table layout. For benchmark tables like
locomo_sessions, paths are /locomo_sessions/sessions/... so the index
was always empty.

Now auto-detects companion memory table (X_sessions -> X_memory) and
falls back to the primary table. Descriptions from the memory table
provide richer index entries.

LoCoMo benchmark (30 questions):
  Before: 38.3% (cross-table only, broken index)
  After:  41.7% (cross-table + working index)
  Delta:  +3.4% (6 improved, 4 regressed)
When Claude tries an unsupported command (python3, node, etc.) on a
specific memory file path, automatically convert it to a cat/read
instead of returning a hard RETRY error. This reduces retrieval
failures since haiku often gives up after seeing the retry message.

LoCoMo benchmark (30 questions):
  Before: 41.7% (cross-table + index fix)
  After:  45.0% (+ auto-read conversion)
  Delta:  +3.3%
  Retrieval failures: 8 -> 5
When grep matches entries in the companion memory table, return the
full summary content (not just matching lines). Summaries are compact
structured wiki entries (Key Facts, People, Entities sections) that
give Claude complete context in a single grep call, reducing the need
for multiple follow-up reads.

Also moved output array declaration before cross-table section to fix
scope issue.

LoCoMo benchmark (30 questions):
  Before: 45.0% (auto-read conversion)
  After:  51.7% (full summaries in grep)
  Delta:  +6.7% (5 improved, 3 regressed)
  Retrieval failures: 5 (unchanged)

Passed baseline (64.4% flat files). Still below Zep (75.1%).
When grep pattern contains multiple words (e.g. "Caroline birthday"),
split into individual words and use SQL OR to match any word. This
casts a wider net and finds sessions that mention any of the search
terms, improving recall for complex queries.

Single-word patterns remain unchanged (exact LIKE match).

LoCoMo benchmark (30 questions):
  Before: 51.7% (full summaries in grep)
  After:  56.7% (+ smart multi-word OR grep)
  Delta:  +5.0%

Cumulative improvement from stock 0.6.32: 38.3% -> 56.7% (+18.4%)
Approaching Zep (75.1%) and baseline flat files (64.4%).
Extract date_time from matched session JSON and prepend to the first
matching line per file. This gives Claude temporal context without
needing to read the full session, significantly helping time-based
questions like "When did X happen?".

LoCoMo benchmark (30 questions):
  Before: 56.7% (smart multi-word OR grep)
  After:  61.7% (+ session date in grep output)
  Delta:  +5.0%

Cumulative from stock 0.6.32: 38.3% -> 61.7% (+23.4%)
Approaching baseline flat files (64.4%). Closing on Zep (75.1%).
@claude
Copy link
Copy Markdown

claude Bot commented Apr 16, 2026

Claude finished @kaghni's task in 1m 41s —— View job


Review

  • Command injection (pre-tool-use.ts:202): cat ${cleanPath} is unsafe — \S+ can capture ;cmd or $(cmd) suffixes, enabling shell injection. Suggest quoting the path.
  • Inconsistent multi-word search (grep-direct.ts:160): companion memory table uses full-phrase LIKE while primary table uses per-word OR expansion — multi-word queries won't benefit from OR matching in the companion table.
  • Spurious @ts-ignore (grep-direct.ts:207): the expression is valid TS; the suppressor could silently mask a real error.

Comment thread src/hooks/pre-tool-use.ts Outdated
hookEventName: "PreToolUse",
permissionDecision: "allow",
updatedInput: {
command: `cat ${cleanPath}`,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Command injection: cleanPath is extracted via \S+ which matches non-whitespace including ;, |, $(...) — a crafted command like ~/.deeplake/memory/foo;touch /tmp/x would become cat /foo;touch /tmp/x. Wrap in single quotes and escape embedded single quotes.

Suggested change
command: `cat ${cleanPath}`,
command: `cat '${cleanPath.replace(/'/g, "'\\''")}'`,

Comment thread src/hooks/grep-direct.ts Outdated
: (sessionsTable !== table ? sessionsTable : null);
if (memoryTable && memoryTable !== table) {
try {
const contentFilter = ` AND summary ${likeOp} '%${escapedLike}%'`;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent search logic: the primary table uses per-word OR filters for multi-word queries (line 131), but the companion memory table always uses escapedLike (full phrase), so multi-word searches won't find matches in the companion table that the primary search would catch. Apply the same word-filter logic here.

Suggested change
const contentFilter = ` AND summary ${likeOp} '%${escapedLike}%'`;
const words2 = pattern.split(/\s+/).filter((w: string) => w.length > 2);
const contentFilter = words2.length > 1
? ` AND (${words2.slice(0, 4).map((w: string) => `summary ${likeOp} '%${sqlLike(w)}%'`).join(" OR ")})`
: ` AND summary ${likeOp} '%${escapedLike}%'`;

Comment thread src/hooks/grep-direct.ts
const prefix = multi ? `${p}:` : "";
const ln = lineNumber ? `${i + 1}:` : "";
matched.push(`${prefix}${ln}${lines[i]}`);
matched.push(`${prefix}${sessionDate}${ln}${lines[i]}`);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The @ts-ignore is unnecessary — sessionDate && (sessionDate = "") is valid TypeScript; the suppressor could hide a real type error on the next line. Use a plain if instead.

Suggested change
matched.push(`${prefix}${sessionDate}${ln}${lines[i]}`);
if (sessionDate) sessionDate = "";

kaghni added 3 commits April 17, 2026 10:50
Two fixes from PR #55 review:

1. SECURITY: Quote cleanPath in auto-read conversion to prevent command
   injection. Previously cleanPath (from user-provided memory path)
   was interpolated raw into `cat ${cleanPath}`, allowing crafted
   paths like "foo;touch /tmp/x" to execute arbitrary commands.
   Now wraps in single quotes and escapes embedded quotes.

2. CORRECTNESS: Apply same multi-word OR logic to companion memory
   table search. Primary table split multi-word queries into per-word
   OR filters, but companion memory table always used the full phrase,
   causing multi-word searches to miss matches in the companion table.

LoCoMo benchmark impact: within variance (63-70% on 30q).
Security fix essential; consistency fix makes search behavior uniform.
- Replace @ts-ignore comment with a plain if guard (no need to
  suppress the diagnostic — assigning to a let is valid TypeScript).
- Rebuild claude-code and codex bundles so they match the committed
  source. CI was failing on bundle/ drift — the accuracy commits
  landed source changes but never re-ran npm run build.
The earlier auto-read change was too broad: it converted any unsupported
command containing a memory-path into a cat, even when the path was
embedded inside command substitution, backticks, or attached to a
network request such as curl -d @<path> or wget. That hid the RETRY
guidance for commands that are genuinely unsafe to run and broke 4
pre-tool-use tests.

Tightened the check to two conditions:
- Command starts with a known read-like interpreter (python/python3/
  node/deno/bun/ruby/perl) — not curl, wget, or arbitrary tools.
- No shell metacharacters (dollar-paren, backtick, semicolon, pipe,
  ampersand, angle brackets, parens, backslash).

Also tightened the path regex from non-whitespace to word/dot/slash/
underscore/dash so trailing metacharacters cannot glue onto the
extracted path.

Tests: 353/353 passing (was 349/353 with 4 pre-tool-use failures).
@kaghni
Copy link
Copy Markdown
Collaborator Author

kaghni commented Apr 21, 2026

Superseded by a fresh branch on top of main. The first 3 commits of this branch (dual-table grep, index.md fix) have been effectively merged via #64 and the grep-core.ts refactor. Only 4 unique accuracy improvements remain, which we'll re-apply to main's new architecture in a new PR.

Balanced 60-question benchmark:

  • origin/main (latest): 13.3%
  • this branch: 40.8% (+27.5%)

See deeplake-blitz PR #1 for the full comparison and methodology.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant