Skip to content

Release v0.3.5: simjoin CPU hot-path speedup (accumulate + verify), bit-identical#2

Merged
prostomarkeloff merged 2 commits into
mainfrom
simjoin-cpu-grind
Jun 2, 2026
Merged

Release v0.3.5: simjoin CPU hot-path speedup (accumulate + verify), bit-identical#2
prostomarkeloff merged 2 commits into
mainfrom
simjoin-cpu-grind

Conversation

@prostomarkeloff
Copy link
Copy Markdown
Owner

What

Speeds up the simjoin CPU join hot path, byte-for-byte identical output. Bumps the crate to v0.3.5.

Profiling the real find-dup-defs patternology join (n≈3216 functions, dense bi/tri-gram vectors, θ=0.85) with samply phase-split + Apple-Silicon PMU counters showed the bottleneck there is not cos_full (that's the sparse PyPI-287k regime) but the ~5M candidate touches in accumulate and their per-candidate bound check — throughput was held down by branch mispredicts, not memory.

Changes (all output-preserving)

  • accumulate — split the acc/touched borrows so the acc base pointer is no longer reloaded from the Scratch struct on every posting entry; and make first-touch branchless (unconditional store into touched, committed only by bumping the length). Same accumulator values + same touched order ⇒ bit-identical. Removes the data-dependent mispredict that capped IPC (3.0 → 5.1, cond-mispred 5.8 → 1.2 /1k insns).
  • verify_pruned / collect_survivors / cosine_join_counts — a cheap monotone pre-bound: since xpn is monotonic, xpn[kstar] ≤ ‖probe‖, so a candidate failing a + ‖probe‖·pnorm < need also fails the exact L2AP bound — prune it without the per-candidate partition_point. Survivor set is bit-identical. Gated on di.len() >= PREBOUND_MIN_DIMS so the sparse regime (short rows, cheap binary search, high survivor rate) pays nothing.
  • examples/simjoin_pypi: add SJ_NSUB env to bench a row subset (small-n regime).

Results (byte-for-byte unchanged)

workload speedup
find-dup-defs patternology join (cycles 309M → 180M) ~1.4–1.55×
PyPI type3 287k join ~1.05× (no regression)

Correctness

Parity is the crate's hard gate — all green:

  • indexed_join_matches_bruteforce (O(n²) oracle, 400 fuzzed corpora)
  • gpu_hybrid_matches_cpu (Metal f32-filter / CPU-f64-reverify)
  • Both real corpora: emitted pair sets identical (1849 / 4 427 097)
  • cargo clippy strict (all + pedantic, incl. gpu) clean; full test suite green

…entical

Profiling the real find-dup-defs *patternology* join (n≈3216, dense bi/tri-gram
vectors, θ=0.85) with samply + Apple-PMU counters showed the bottleneck is NOT
cos_full (the PyPI-287k regime) but the 5M candidate touches in accumulate +
their per-candidate bound check — IPC was held down by branch mispredicts, not
memory. Three output-preserving changes:

- accumulate: split the acc/touched borrows so the `acc` base pointer is no
  longer reloaded from the Scratch struct on every posting entry; and make
  first-touch branchless — an unconditional store into `touched` committed only
  by bumping the length (`tlen += first`). Same accumulator values and same
  touched order, so bit-identical; the data-dependent mispredict that capped
  throughput is gone (IPC 3.0 -> 5.1, cond-mispred 5.8 -> 1.2 /1k).

- verify_pruned / collect_survivors / cosine_join_counts: a cheap monotone
  pre-bound. Since xpn is monotonic, xpn[kstar] <= ‖probe‖, so a candidate
  failing `a + ‖probe‖·pnorm < need` also fails the exact L2AP bound — prune it
  without the per-candidate `partition_point`. Survivor set is bit-identical.
  Gated on `di.len() >= PREBOUND_MIN_DIMS` so the sparse regime (short rows,
  cheap partition_point, high survivor rate) pays nothing.

Result, byte-for-byte unchanged (fuzz + GPU-hybrid parity gates pass):
  - find-dup-defs patternology join: ~1.4-1.55x (cycles 309M -> 180M)
  - PyPI type3 287k join: ~1.05x (no regression)

examples/simjoin_pypi: add SJ_NSUB env to bench a row subset (small-n regime).
@prostomarkeloff prostomarkeloff merged commit d6b0c99 into main Jun 2, 2026
7 checks passed
@prostomarkeloff prostomarkeloff deleted the simjoin-cpu-grind branch June 2, 2026 10:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant