Skip to content

[AURON #2160] perf: SIMD short-circuit in JoinHashMap probe#2161

Open
yew1eb wants to merge 1 commit intoapache:masterfrom
yew1eb:AURON_2160
Open

[AURON #2160] perf: SIMD short-circuit in JoinHashMap probe#2161
yew1eb wants to merge 1 commit intoapache:masterfrom
yew1eb:AURON_2160

Conversation

@yew1eb
Copy link
Copy Markdown
Contributor

@yew1eb yew1eb commented Apr 3, 2026

Which issue does this PR close?

Closes #2160

Rationale for this change

Optimize join hash map probe by checking hash_matched first before computing empty mask. This reduces ~50% SIMD instructions when hash hit rate is high (typical join scenarios).

What changes are included in this PR?

Are there any user-facing changes?

How was this patch tested?

[AURON-2160] Optimize join hash map probe by checking hash_matched
first before computing empty mask. This reduces ~50% SIMD instructions
when hash hit rate is high (typical join scenarios).

Before: Always compute both hash_matched and empty SIMD masks.
After: Only compute empty mask when hash_matched has no hits.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Optimizes the SIMD-based probe path in the native-engine join hash map by short-circuiting the “empty slot” SIMD comparison when a hash match is found, targeting reduced instruction count in typical high-hit-rate join workloads.

Changes:

  • Splits the probe condition into a fast-path (hash match) and slow-path (empty slot) to avoid an unconditional empty-mask SIMD compare.
  • Returns MapValue::EMPTY directly when an empty slot is detected in the probed group.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 255 to +268
let hash_matched = self.map[e].hashes.simd_eq(Simd::splat(hashes[i]));
let empty = self.map[e].hashes.simd_eq(Simd::splat(0));

if let Some(pos) = (hash_matched | empty).first_set() {
// Fast path: check hash match first (common case)
if let Some(pos) = hash_matched.first_set() {
hashes[i] = unsafe {
// safety: transmute MapValue(u32) to u32
std::mem::transmute(self.map[e].values[pos])
};
break;
}

// Slow path: check empty slot only when no match
let empty = self.map[e].hashes.simd_eq(Simd::splat(0));
if empty.any() {
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The correctness of checking hash_matched.first_set() before computing the empty mask relies on an invariant that, within a MapValueGroup, all occupied lanes are packed from the beginning (i.e., there cannot be an empty lane before a later occupied lane). That invariant currently holds because insertion always uses empty.first_set(), but it's not stated here and a future change (e.g., tombstones/deletes or a different insertion strategy) could break this lookup logic. Please document this invariant explicitly (or add a debug-only assertion) so this fast-path doesn't become subtly incorrect later.

Copilot uses AI. Check for mistakes.
Comment on lines +269 to +272
hashes[i] = unsafe {
// safety: transmute MapValue(u32) to u32
std::mem::transmute(MapValue::EMPTY)
};
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds another std::mem::transmute(MapValue::EMPTY) to convert a MapValue into u32. Since MapValue is not marked #[repr(transparent)], transmutes between MapValue and u32 are not layout-guaranteed by Rust's spec. Consider replacing this with a safe conversion (e.g., accessing the inner .0 field within the module, or adding an as_u32() method / From<MapValue> for u32), or mark MapValue as #[repr(transparent)] to make the conversion sound.

Suggested change
hashes[i] = unsafe {
// safety: transmute MapValue(u32) to u32
std::mem::transmute(MapValue::EMPTY)
};
hashes[i] = MapValue::EMPTY.0;

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

@ShreyeshArangath ShreyeshArangath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How was the performance tested? Can you share some logs/numbers in the PR description as well?

We should probably set up a microbenchmark for lookup_many with controlled hit rates (0%, 50%, 100%) if possible, WDYT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

JoinHashMap probe SIMD short-circuit optimization

3 participants