poc: WASM/wazero tree-sitter backend (speed + stability vs cgo PR #80)#81
Draft
dvcdsys wants to merge 4 commits into
Draft
poc: WASM/wazero tree-sitter backend (speed + stability vs cgo PR #80)#81dvcdsys wants to merge 4 commits into
dvcdsys wants to merge 4 commits into
Conversation
Alternative to feat/chunker-cgo-treesitter: the official tree-sitter C runtime + TypeScript grammar compiled to a standalone wasm32-wasi reactor module (build.sh, via zig cc) and driven from Go through wazero — no cgo, no JS, no third-party parser. Only the wazero host (wasmts.go) is bespoke; the parser is unmodified upstream C. wasm_store.c is gated by TREE_SITTER_FEATURE_WASM (we don't define it), so the stock amalgamation compiles to wasi with no stubs. Measured on the same 852-file vscode TypeScript corpus (full-tree walk): backend wall files/s ERROR trees editorOptions.ts gotreesitter (pure-Go) 13.83s 62 13 8.77s -> ERROR WASM (wazero, pure-Go) ~2.5s ~330 0 49ms cgo (native) 1.26s 675 0 17ms - WASM ~2x slower than cgo, ~5x faster than gotreesitter, correct (0 errors). - Overhead is the per-node host<->guest call boundary (~3 calls/node x 2.68M nodes), not memory — slot-pooling barely moved it. A batched "serialize subtree" export would close most of the gap (future work). - Stability: tree-sitter is robust on adversarial input under both backends; WASM additionally CONTAINS faults (resource/guest trap -> recoverable Go error, host alive) where cgo would SIGSEGV the whole process. Insurance vs unknown C bugs, not a fix for an observed crash. Trade-off vs cgo: ~2x parse cost (largely invisible end-to-end since embeddings dominate) in exchange for CGO_ENABLED=0 builds, crash-isolation, and a likely smaller binary; cost is the engineering effort to build/bundle all 31 grammars and flesh out the node API. README.md has the full comparison. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…d skip, doc-comment attachment Replace gotreesitter with the official tree-sitter C runtime + 31 grammars compiled to one wasm32-wasi module (ts-core.wasm.br, brotli ~3MB) driven via wazero. No cgo: traps are contained (parse falls back to sliding window, the process survives), and the binary stays CGO_ENABLED=0. Memory design (measured on the prod-shaped churn workload): - linear memory is mmap-backed (experimental.WithMemoryAllocator) instead of wazero's default Go-heap append-grow: no realloc-copy garbage on growth and munmap-on-close returns recycled instances' memory to the OS immediately. Churn heapSys 1135→391MB, peak RSS 1070→535MB; full-repo chunking peak RSS 1516→787MB. - engine pool: hard concurrency cap (dashboard-tunable), 256MiB per-instance linear-memory ceiling (2× headroom over the worst measured instance at the indexer's 512KiB file cap), high-water-mark recycling, 1 idle instance. Chunker quality fixes: - minified/bundled js/ts/css (.min., .bundle.js, >2KiB lines) skip the parser straight to sliding window — the pathological input class that ballooned instances for near-zero semantic value. - a declaration's doc comment now attaches to its chunk (language-agnostic via tree-sitter's extra flag + same-row wrapper climb; verified for Go, TS, C, Python, Rust, Java). Generated files stop spraying comment-only micro chunks: openapi.gen.go 893→517 chunks, median 114→256B, symbols/refs byte-identical. Memory-stress harnesses are committed but gated behind CIX_MEMSTRESS=1. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…am (OOM fix) Two new runtime-config fields, end to end (DB migrations 16/17 → runtimecfg → admin API → openapi → dashboard): - chunk_max_concurrent — the wasm chunker's instance-concurrency cap, decoupled from embedding concurrency; resizes the live limiter without a restart. Env: CIX_CHUNK_MAX_CONCURRENT; per-instance memory knobs stay env-only (CIX_CHUNK_MEM_LIMIT_PAGES, CIX_CHUNK_RECYCLE_GROWTH_MB, CIX_CHUNK_MAX_IDLE). - llama_cache_ram_mib — llama-server's HOST prompt cache cap (--cache-ram). Upstream defaults this to 8 GiB (ggml-org/llama.cpp#16391), which is pure waste for an embeddings-only sidecar: prompts are never reused, but the cache fills anyway. Observed on prod: llama-server RSS 365MB→11.3GB within minutes of indexing vscode@main, then cgroup OOM kill — twice at the 10G limit, again at 16G. With --cache-ram 0 (our default; -1 = unlimited) it plateaus at ~900MB under the same load. Env: CIX_LLAMA_CACHE_RAM; shown in the dashboard's Runtime parameters card, applied via Save & Restart. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…vation A full-reindex wipe ran as ONE transaction: DELETE of all refs/symbols/ file_hashes plus the trigram-FTS rows. On a vscode-sized project (~445k refs, tens of thousands of FTS rows — each FTS delete re-tokenizes its content) that held SQLite's single writer for minutes, starving every concurrent writer past busy_timeout. Prod symptom: the jobs worker logged `claim failed: SQLITE_BUSY` on every 5s poll tick for the whole wipe. - BeginIndexing full wipe: file_hashes first (its own statement — once gone, every file looks dirty, so a crash mid-wipe just resumes on the next run), then symbols/refs in 20k-row batches, then chunks_fts/chunks_meta via the batched chunksfts.DeleteByProject (500 rows per tx — FTS deletes are the expensive ones). The writer is released between batches. - projects.Delete: same batched FTS wipe, project row deleted last so a failed wipe is resumable. - jobs worker: SQLITE_BUSY on claim is expected contention, not a fault — log the streak start as WARN with a once-a-minute heartbeat instead of an ERROR per tick, and log when it clears. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Draft / PoC for comparison — not for merge. Alternative to the cgo backend in #80, to decide direction.
Official tree-sitter C runtime + TypeScript grammar → standalone
wasm32-wasimodule (zig cc), driven from Go via wazero. No cgo, no JS, no third-party parser — only the wazero host (poc/wasm-treesitter/wasmts.go) is ours.Speed — same 852-file vscode TS corpus, full-tree walk
editorOptions.ts~2× slower than cgo, ~5× faster than gotreesitter, correct. Overhead is the per-node host↔guest call boundary (mitigable with a batched subtree export).
Stability
tree-sitter is robust on adversarial input under both backends. WASM additionally contains guest faults (resource/trap → recoverable Go error, host alive) where cgo would SIGSEGV the whole process. Insurance vs unknown C bugs.
Decision framing
~2× parse cost (largely invisible end-to-end — embeddings dominate) in exchange for
CGO_ENABLED=0builds, crash-isolation, and a likely smaller binary. Cost: engineering effort to build/bundle all 31 grammars + flesh out the node API. Full write-up inpoc/wasm-treesitter/README.md.🤖 Generated with Claude Code