Release v0.3.5: simjoin CPU hot-path speedup (accumulate + verify), bit-identical#2
Merged
Merged
Conversation
…entical Profiling the real find-dup-defs *patternology* join (n≈3216, dense bi/tri-gram vectors, θ=0.85) with samply + Apple-PMU counters showed the bottleneck is NOT cos_full (the PyPI-287k regime) but the 5M candidate touches in accumulate + their per-candidate bound check — IPC was held down by branch mispredicts, not memory. Three output-preserving changes: - accumulate: split the acc/touched borrows so the `acc` base pointer is no longer reloaded from the Scratch struct on every posting entry; and make first-touch branchless — an unconditional store into `touched` committed only by bumping the length (`tlen += first`). Same accumulator values and same touched order, so bit-identical; the data-dependent mispredict that capped throughput is gone (IPC 3.0 -> 5.1, cond-mispred 5.8 -> 1.2 /1k). - verify_pruned / collect_survivors / cosine_join_counts: a cheap monotone pre-bound. Since xpn is monotonic, xpn[kstar] <= ‖probe‖, so a candidate failing `a + ‖probe‖·pnorm < need` also fails the exact L2AP bound — prune it without the per-candidate `partition_point`. Survivor set is bit-identical. Gated on `di.len() >= PREBOUND_MIN_DIMS` so the sparse regime (short rows, cheap partition_point, high survivor rate) pays nothing. Result, byte-for-byte unchanged (fuzz + GPU-hybrid parity gates pass): - find-dup-defs patternology join: ~1.4-1.55x (cycles 309M -> 180M) - PyPI type3 287k join: ~1.05x (no regression) examples/simjoin_pypi: add SJ_NSUB env to bench a row subset (small-n regime).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Speeds up the
simjoinCPU join hot path, byte-for-byte identical output. Bumps the crate to v0.3.5.Profiling the real find-dup-defs patternology join (n≈3216 functions, dense bi/tri-gram vectors, θ=0.85) with samply phase-split + Apple-Silicon PMU counters showed the bottleneck there is not
cos_full(that's the sparse PyPI-287k regime) but the ~5M candidate touches inaccumulateand their per-candidate bound check — throughput was held down by branch mispredicts, not memory.Changes (all output-preserving)
accumulate— split theacc/touchedborrows so theaccbase pointer is no longer reloaded from theScratchstruct on every posting entry; and make first-touch branchless (unconditional store intotouched, committed only by bumping the length). Same accumulator values + same touched order ⇒ bit-identical. Removes the data-dependent mispredict that capped IPC (3.0 → 5.1, cond-mispred 5.8 → 1.2 /1k insns).verify_pruned/collect_survivors/cosine_join_counts— a cheap monotone pre-bound: sincexpnis monotonic,xpn[kstar] ≤ ‖probe‖, so a candidate failinga + ‖probe‖·pnorm < needalso fails the exact L2AP bound — prune it without the per-candidatepartition_point. Survivor set is bit-identical. Gated ondi.len() >= PREBOUND_MIN_DIMSso the sparse regime (short rows, cheap binary search, high survivor rate) pays nothing.examples/simjoin_pypi: addSJ_NSUBenv to bench a row subset (small-n regime).Results (byte-for-byte unchanged)
Correctness
Parity is the crate's hard gate — all green:
indexed_join_matches_bruteforce(O(n²) oracle, 400 fuzzed corpora)gpu_hybrid_matches_cpu(Metal f32-filter / CPU-f64-reverify)cargo clippystrict (all + pedantic, incl.gpu) clean; full test suite green