Skip to content

simjoin: exact weighted-cosine similarity join (L2AP, CPU+GPU) + Python bindings (v0.3.0)#1

Merged
prostomarkeloff merged 1 commit into
mainfrom
simjoin-l2ap
May 30, 2026
Merged

simjoin: exact weighted-cosine similarity join (L2AP, CPU+GPU) + Python bindings (v0.3.0)#1
prostomarkeloff merged 1 commit into
mainfrom
simjoin-l2ap

Conversation

@prostomarkeloff
Copy link
Copy Markdown
Owner

Adds simjoin — an exact all-pairs weighted-cosine similarity join over sparse non-negative vectors (every pair with cos ≥ t, no approximation). The principled exact replacement for shingle-candidate + verify near-duplicate detection (Type-3 code clones: functions × IDF-weighted canonical lines).

What's in it

  • src/simjoin.rs — SOTA L2AP (inverted index + Cauchy–Schwarz prefix prune) + branchless sorted-merge dot + cache-packed prune state + rayon parallel probe. Asserted bit-identical to an O(n²) brute-force oracle on fuzzed corpora (the crate's "two implementations, one answer" gate). API: Corpus::{from_rows, from_token_docs}, cosine_join, cosine_join_with(Concurrency), CosineJoiner handle.
  • src/simjoin_gpu.rs — Metal batch_cosine kernel. The verify is memory-bandwidth-bound; the Apple GPU clears the random-gather dots ~3× the CPU (53 vs 22 GB/s). Metal is f32-only, so f64 byte-parity is preserved via GPU-filter + CPU-exact-reverify (gpu+cpu), plus a pure-f32 fast path (gpu, ≤1 differing pair per millions). ~1.8–2× end-to-end on the real top-300 PyPI corpus (287k functions, 3.1M clone pairs).
  • 3 backends behind the existing Concurrency enum (cpu / gpu+cpu / gpu), transparent CPU fallback.
  • Python: difflib_fast.cosine_join(docs, t, concurrency, threads) + CosineJoiner — token docs → TF-IDF in Rust, auto-parallel (rayon, GIL released) like ratio_many.
  • examples/{simjoin_bench, simjoin_pypi, simjoin_gpu_bench}; profiling cargo feature; README + benchmarks.md §6.

Verification

  • cargo test (CPU parity vs brute force) + cargo test --features gpu (gpu_hybrid_matches_cpu: exact backends byte-identical) — green.
  • clippy clean on default / profiling / gpu / python,gpu; cargo publish --dry-run clean.

Version bump 0.2.0 → 0.3.0.

…Python bindings (v0.3.0)

simjoin is an exact all-pairs weighted-cosine similarity join over sparse non-negative
vectors — every pair with cos >= t, no approximation — the principled exact replacement
for shingle-candidate + verify near-duplicate detection (Type-3 code clones).

- src/simjoin.rs: SOTA L2AP (inverted index + Cauchy-Schwarz prefix prune) + branchless
  sorted-merge dot + cache-packed prune state + rayon parallel probe. Asserted bit-identical
  to an O(n^2) brute-force oracle on fuzzed corpora. API: Corpus::{from_rows, from_token_docs},
  cosine_join, cosine_join_with(Concurrency), CosineJoiner handle.
- src/simjoin_gpu.rs: Metal batch_cosine kernel. The verify is memory-bandwidth-bound; the
  Apple GPU clears the random-gather dots ~3x the CPU (53 vs 22 GB/s). Metal is f32-only, so
  f64 byte-parity is preserved via GPU-filter + CPU-exact-reverify (gpu+cpu), with a pure-f32
  fast path (gpu, <=1 differing pair per millions). ~1.8-2x end-to-end on real top-300 PyPI.
- Three backends behind the existing Concurrency enum (cpu / gpu+cpu / gpu), CPU fallback.
- Python: difflib_fast.cosine_join(docs, t, concurrency, threads) + CosineJoiner — token docs
  to TF-IDF in Rust, auto-parallel (rayon, GIL released) like ratio_many.
- examples/{simjoin_bench, simjoin_pypi, simjoin_gpu_bench}; profiling cargo feature.
- README + benchmarks.md: simjoin section. Bump 0.2.0 -> 0.3.0.
@prostomarkeloff prostomarkeloff merged commit 17d1f0d into main May 30, 2026
8 checks passed
@prostomarkeloff prostomarkeloff deleted the simjoin-l2ap branch May 30, 2026 13:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant