refactor(ingestion): prune vendored dirs in source walk (no rglob descent)#55
Open
cdeust wants to merge 1 commit into
Open
refactor(ingestion): prune vendored dirs in source walk (no rglob descent)#55cdeust wants to merge 1 commit into
cdeust wants to merge 1 commit into
Conversation
…cent)
codebase_analyze's collect_source_files walked the whole tree via
root.rglob("*") and rejected IGNORE_DIRS entries only AFTER enumeration.
rglob can't prune mid-iteration, so a repo carrying a vendored subtree (a
154M deps/ of ~8K files, node_modules, site-packages) stalled the walk for
minutes on the event loop — the same asymmetry that caused the wiki_drift
hang (fixed in 619bf9a), here on the ingestion side.
- Extract the canonical pruned-walk idiom (os.walk(followlinks=False) +
in-place dirnames[:] filter on IGNORE_DIRS) into one module,
handlers/source_walk.py::walk_pruned — single source of truth.
- Route both _collect_unbounded and _collect_bounded through walk_pruned;
ignored subtrees are now never descended into. _file_matches keeps its
IGNORE_DIRS/lang/size checks as defense-in-depth.
- Replace seed_project_stages' private _walk_pruned with the shared one
(removes the duplicate; drops the now-unused os import).
Preserves behavior: same files returned, bounded-candidate memory property
(ADR-0045 §R2) intact. New tests_py/handlers/test_source_walk.py proves
ignored subtrees (deps/node_modules/site-packages/.venv, nested + symlinked)
are pruned. 17 passed (incl. existing collector suite).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
codebase_analyze'scollect_source_fileswalked the whole tree viaroot.rglob("*")and rejectedIGNORE_DIRSentries only after enumeration.rglobcan't prune mid-iteration, so a repo carrying a vendored subtree (a 154Mdeps/of ~8K files,node_modules,site-packages) stalled the walk for minutes — on the asyncio event loop, blocking every concurrent tool call.This is the ingestion-side counterpart of the wiki-drift hang fixed in
619bf9a(PR #53): samerglob-vs-pruned-walk asymmetry, different code path. It is a latent prod bug — any user repo with a venv / node_modules / deps at the scan root triggers it.How (root-cause, one source of truth)
os.walk(followlinks=False)+ in-placedirnames[:]filter onIGNORE_DIRS— intohandlers/source_walk.py::walk_pruned._collect_unboundedand_collect_boundedthrough it; ignored subtrees are now never descended into._file_matcheskeeps itsIGNORE_DIRS/lang/size checks as defense-in-depth.seed_project_stages' private_walk_prunedwith the shared function (removes the duplicate; drops the now-unusedosimport).IGNORE_DIRSalready containeddeps/site-packages/node_modules/etc., so no constant change was needed.Behavior is preserved — same files returned, the bounded-candidate memory property (ADR-0045 §R2) intact — only the descent is pruned.
Verification
tests_py/handlers/test_source_walk.py: proves ignored subtrees (deps / node_modules / site-packages / .venv, nested + symlinked) are pruned and not followed.test_codebase_analyze_rglob.pycollector suite), ruff format + check clean.Scope note
This is item 3 of 3 of the Phase-5 throttling follow-on (
docs/program/phase-5-pool-admission-design.md). The remaining two —asyncio.to_threadoffload at the handler boundary and per-tool admission semaphores — are gated on the BLOCKING benchmark suite (§7) and are deferred to a follow-up PR where the DB/benchmark harness is available. This PR is the standalone, fully-verifiable latent-bug fix.🤖 Generated with Claude Code