[AURON #2160] perf: SIMD short-circuit in JoinHashMap probe#2161
[AURON #2160] perf: SIMD short-circuit in JoinHashMap probe#2161yew1eb wants to merge 1 commit intoapache:masterfrom
Conversation
[AURON-2160] Optimize join hash map probe by checking hash_matched first before computing empty mask. This reduces ~50% SIMD instructions when hash hit rate is high (typical join scenarios). Before: Always compute both hash_matched and empty SIMD masks. After: Only compute empty mask when hash_matched has no hits.
There was a problem hiding this comment.
Pull request overview
Optimizes the SIMD-based probe path in the native-engine join hash map by short-circuiting the “empty slot” SIMD comparison when a hash match is found, targeting reduced instruction count in typical high-hit-rate join workloads.
Changes:
- Splits the probe condition into a fast-path (hash match) and slow-path (empty slot) to avoid an unconditional empty-mask SIMD compare.
- Returns
MapValue::EMPTYdirectly when an empty slot is detected in the probed group.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let hash_matched = self.map[e].hashes.simd_eq(Simd::splat(hashes[i])); | ||
| let empty = self.map[e].hashes.simd_eq(Simd::splat(0)); | ||
|
|
||
| if let Some(pos) = (hash_matched | empty).first_set() { | ||
| // Fast path: check hash match first (common case) | ||
| if let Some(pos) = hash_matched.first_set() { | ||
| hashes[i] = unsafe { | ||
| // safety: transmute MapValue(u32) to u32 | ||
| std::mem::transmute(self.map[e].values[pos]) | ||
| }; | ||
| break; | ||
| } | ||
|
|
||
| // Slow path: check empty slot only when no match | ||
| let empty = self.map[e].hashes.simd_eq(Simd::splat(0)); | ||
| if empty.any() { |
There was a problem hiding this comment.
The correctness of checking hash_matched.first_set() before computing the empty mask relies on an invariant that, within a MapValueGroup, all occupied lanes are packed from the beginning (i.e., there cannot be an empty lane before a later occupied lane). That invariant currently holds because insertion always uses empty.first_set(), but it's not stated here and a future change (e.g., tombstones/deletes or a different insertion strategy) could break this lookup logic. Please document this invariant explicitly (or add a debug-only assertion) so this fast-path doesn't become subtly incorrect later.
| hashes[i] = unsafe { | ||
| // safety: transmute MapValue(u32) to u32 | ||
| std::mem::transmute(MapValue::EMPTY) | ||
| }; |
There was a problem hiding this comment.
This adds another std::mem::transmute(MapValue::EMPTY) to convert a MapValue into u32. Since MapValue is not marked #[repr(transparent)], transmutes between MapValue and u32 are not layout-guaranteed by Rust's spec. Consider replacing this with a safe conversion (e.g., accessing the inner .0 field within the module, or adding an as_u32() method / From<MapValue> for u32), or mark MapValue as #[repr(transparent)] to make the conversion sound.
| hashes[i] = unsafe { | |
| // safety: transmute MapValue(u32) to u32 | |
| std::mem::transmute(MapValue::EMPTY) | |
| }; | |
| hashes[i] = MapValue::EMPTY.0; |
ShreyeshArangath
left a comment
There was a problem hiding this comment.
How was the performance tested? Can you share some logs/numbers in the PR description as well?
We should probably set up a microbenchmark for lookup_many with controlled hit rates (0%, 50%, 100%) if possible, WDYT?
Which issue does this PR close?
Closes #2160
Rationale for this change
Optimize join hash map probe by checking hash_matched first before computing empty mask. This reduces ~50% SIMD instructions when hash hit rate is high (typical join scenarios).
What changes are included in this PR?
Are there any user-facing changes?
How was this patch tested?