Skip to content

feat: scoped retrieval for monorepos (fix low recall on large repos) #61

@rajkumarsakthivel

Description

@rajkumarsakthivel

Problem

Recall@10 drops to 0.07 on monorepos (benchmarked on fiber, 396 files, 4,382 chunks). The retriever's top-10 gets diluted across the entire search space instead of surfacing the specific file.

Proposed fix: scoped retrieval

When a query mentions or implies a subdirectory, scope the search to that subdirectory first. Fall back to full-repo search if scoped results are insufficient.

Approach

  1. Query-time scope detection: extract subdirectory hints from the query (e.g., "how does middleware/logger work?" → scope to middleware/)
  2. File-path prefix filter on vector search: filter chunks by file_path LIKE 'middleware/%' before ranking
  3. Fallback: if scoped search returns fewer than top_k results above confidence threshold, expand to full repo
  4. Optional explicit scope: allow context_search("logger", scope="middleware/") as an MCP tool parameter

Expected impact

  • Monorepo recall should jump from 0.07 to 0.5+ (most queries target a specific package/directory)
  • No impact on single-repo performance (scope detection returns nothing, falls through to full search)

Benchmark plan

Re-run fiber benchmark with scoped retrieval. Target: Recall@10 > 0.50.

Related

  • fiber benchmark: benchmarks/results/fiber.md (Recall@10 = 0.07)
  • chi benchmark: benchmarks/results/chi.md (Recall@10 = 0.67, less affected)
  • PR benchmark: add Go benchmarks for chi and fiber #36 analysis: "Scoping retrieval to a subdirectory based on query context would address this"

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions