simjoin: exact weighted-cosine similarity join (L2AP, CPU+GPU) + Python bindings (v0.3.0)#1
Merged
Merged
Conversation
…Python bindings (v0.3.0)
simjoin is an exact all-pairs weighted-cosine similarity join over sparse non-negative
vectors — every pair with cos >= t, no approximation — the principled exact replacement
for shingle-candidate + verify near-duplicate detection (Type-3 code clones).
- src/simjoin.rs: SOTA L2AP (inverted index + Cauchy-Schwarz prefix prune) + branchless
sorted-merge dot + cache-packed prune state + rayon parallel probe. Asserted bit-identical
to an O(n^2) brute-force oracle on fuzzed corpora. API: Corpus::{from_rows, from_token_docs},
cosine_join, cosine_join_with(Concurrency), CosineJoiner handle.
- src/simjoin_gpu.rs: Metal batch_cosine kernel. The verify is memory-bandwidth-bound; the
Apple GPU clears the random-gather dots ~3x the CPU (53 vs 22 GB/s). Metal is f32-only, so
f64 byte-parity is preserved via GPU-filter + CPU-exact-reverify (gpu+cpu), with a pure-f32
fast path (gpu, <=1 differing pair per millions). ~1.8-2x end-to-end on real top-300 PyPI.
- Three backends behind the existing Concurrency enum (cpu / gpu+cpu / gpu), CPU fallback.
- Python: difflib_fast.cosine_join(docs, t, concurrency, threads) + CosineJoiner — token docs
to TF-IDF in Rust, auto-parallel (rayon, GIL released) like ratio_many.
- examples/{simjoin_bench, simjoin_pypi, simjoin_gpu_bench}; profiling cargo feature.
- README + benchmarks.md: simjoin section. Bump 0.2.0 -> 0.3.0.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds
simjoin— an exact all-pairs weighted-cosine similarity join over sparse non-negative vectors (every pair withcos ≥ t, no approximation). The principled exact replacement for shingle-candidate + verify near-duplicate detection (Type-3 code clones: functions × IDF-weighted canonical lines).What's in it
src/simjoin.rs— SOTA L2AP (inverted index + Cauchy–Schwarz prefix prune) + branchless sorted-merge dot + cache-packed prune state + rayon parallel probe. Asserted bit-identical to an O(n²) brute-force oracle on fuzzed corpora (the crate's "two implementations, one answer" gate). API:Corpus::{from_rows, from_token_docs},cosine_join,cosine_join_with(Concurrency),CosineJoinerhandle.src/simjoin_gpu.rs— Metalbatch_cosinekernel. The verify is memory-bandwidth-bound; the Apple GPU clears the random-gather dots ~3× the CPU (53 vs 22 GB/s). Metal is f32-only, so f64 byte-parity is preserved via GPU-filter + CPU-exact-reverify (gpu+cpu), plus a pure-f32 fast path (gpu, ≤1 differing pair per millions). ~1.8–2× end-to-end on the real top-300 PyPI corpus (287k functions, 3.1M clone pairs).Concurrencyenum (cpu/gpu+cpu/gpu), transparent CPU fallback.difflib_fast.cosine_join(docs, t, concurrency, threads)+CosineJoiner— token docs → TF-IDF in Rust, auto-parallel (rayon, GIL released) likeratio_many.examples/{simjoin_bench, simjoin_pypi, simjoin_gpu_bench};profilingcargo feature; README + benchmarks.md §6.Verification
cargo test(CPU parity vs brute force) +cargo test --features gpu(gpu_hybrid_matches_cpu: exact backends byte-identical) — green.cargo publish --dry-runclean.Version bump 0.2.0 → 0.3.0.