Release v0.3.5: simjoin CPU hot-path speedup (accumulate + verify), bit-identical by prostomarkeloff · Pull Request #2 · prostomarkeloff/difflib-fast

prostomarkeloff · 2026-06-02T10:26:27Z

What

Speeds up the simjoin CPU join hot path, byte-for-byte identical output. Bumps the crate to v0.3.5.

Profiling the real find-dup-defs patternology join (n≈3216 functions, dense bi/tri-gram vectors, θ=0.85) with samply phase-split + Apple-Silicon PMU counters showed the bottleneck there is not cos_full (that's the sparse PyPI-287k regime) but the ~5M candidate touches in accumulate and their per-candidate bound check — throughput was held down by branch mispredicts, not memory.

Changes (all output-preserving)

accumulate — split the acc/touched borrows so the acc base pointer is no longer reloaded from the Scratch struct on every posting entry; and make first-touch branchless (unconditional store into touched, committed only by bumping the length). Same accumulator values + same touched order ⇒ bit-identical. Removes the data-dependent mispredict that capped IPC (3.0 → 5.1, cond-mispred 5.8 → 1.2 /1k insns).
verify_pruned / collect_survivors / cosine_join_counts — a cheap monotone pre-bound: since xpn is monotonic, xpn[kstar] ≤ ‖probe‖, so a candidate failing a + ‖probe‖·pnorm < need also fails the exact L2AP bound — prune it without the per-candidate partition_point. Survivor set is bit-identical. Gated on di.len() >= PREBOUND_MIN_DIMS so the sparse regime (short rows, cheap binary search, high survivor rate) pays nothing.
examples/simjoin_pypi: add SJ_NSUB env to bench a row subset (small-n regime).

Results (byte-for-byte unchanged)

workload	speedup
find-dup-defs patternology join (cycles 309M → 180M)	~1.4–1.55×
PyPI type3 287k join	~1.05× (no regression)

Correctness

Parity is the crate's hard gate — all green:

indexed_join_matches_bruteforce (O(n²) oracle, 400 fuzzed corpora)
gpu_hybrid_matches_cpu (Metal f32-filter / CPU-f64-reverify)
Both real corpora: emitted pair sets identical (1849 / 4 427 097)
cargo clippy strict (all + pedantic, incl. gpu) clean; full test suite green

…entical Profiling the real find-dup-defs *patternology* join (n≈3216, dense bi/tri-gram vectors, θ=0.85) with samply + Apple-PMU counters showed the bottleneck is NOT cos_full (the PyPI-287k regime) but the 5M candidate touches in accumulate + their per-candidate bound check — IPC was held down by branch mispredicts, not memory. Three output-preserving changes: - accumulate: split the acc/touched borrows so the `acc` base pointer is no longer reloaded from the Scratch struct on every posting entry; and make first-touch branchless — an unconditional store into `touched` committed only by bumping the length (`tlen += first`). Same accumulator values and same touched order, so bit-identical; the data-dependent mispredict that capped throughput is gone (IPC 3.0 -> 5.1, cond-mispred 5.8 -> 1.2 /1k). - verify_pruned / collect_survivors / cosine_join_counts: a cheap monotone pre-bound. Since xpn is monotonic, xpn[kstar] <= ‖probe‖, so a candidate failing `a + ‖probe‖·pnorm < need` also fails the exact L2AP bound — prune it without the per-candidate `partition_point`. Survivor set is bit-identical. Gated on `di.len() >= PREBOUND_MIN_DIMS` so the sparse regime (short rows, cheap partition_point, high survivor rate) pays nothing. Result, byte-for-byte unchanged (fuzz + GPU-hybrid parity gates pass): - find-dup-defs patternology join: ~1.4-1.55x (cycles 309M -> 180M) - PyPI type3 287k join: ~1.05x (no regression) examples/simjoin_pypi: add SJ_NSUB env to bench a row subset (small-n regime).

…it-identical

prostomarkeloff added 2 commits June 2, 2026 13:16

Release v0.3.5: simjoin CPU hot-path speedup (accumulate + verify), b…

596d285

…it-identical

prostomarkeloff merged commit d6b0c99 into main Jun 2, 2026
7 checks passed

prostomarkeloff deleted the simjoin-cpu-grind branch June 2, 2026 10:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release v0.3.5: simjoin CPU hot-path speedup (accumulate + verify), bit-identical#2

Release v0.3.5: simjoin CPU hot-path speedup (accumulate + verify), bit-identical#2
prostomarkeloff merged 2 commits into
mainfrom
simjoin-cpu-grind

prostomarkeloff commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

prostomarkeloff commented Jun 2, 2026

What

Changes (all output-preserving)

Results (byte-for-byte unchanged)

Correctness

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant