Align HyperLogLog add() bit handling with Spark #792

sleeepyjack · 2026-01-23T00:52:47Z

Closes #696

Summary

Align HyperLogLog register selection and leading‑zero computation in add() with Spark’s
HyperLogLog++ helper logic.
Fix a leading‑zero discrepancy that could shift estimates by 1 for small sketches (e.g.,
standard_deviation=0.3, precision p=4).
Align bias interpolation anchor selection with Spark (use insertion‑point anchor for bias
correction rather than “closest” entry).

Changes

Register index: Use the top p bits of the hash (h >> (bits - p)) to select the register,
matching Spark’s idx = x >>> (64 - p).
Leading zeros: Compute leading zeros on the remaining bits using Spark’s padded approach:
clz((h << p) | (1 << (p - 1))) + 1. The padding bit prevents the shifted value from becoming
all‑zero, which would otherwise inflate the zero count.
Bias anchor: Use Spark’s insertion‑point anchor for bias interpolation (the index returned
by binary search when no exact match exists). cuco previously snapped to the nearest neighbor,
which shifts the low..high bias window by one entry and can move the rounded estimate by 1.

The previous implementation mixed low‑bit register indexing with a different zero‑count scheme.
When the remaining bits were all zero after shifting, clz(h << p) + 1 could over‑count leading
zeros relative to Spark’s (h << p) | wPadding method. This mismatch changes the register values
and can bias the final estimate by a small amount. Using Spark’s MSB‑based index and padded
leading‑zero count brings cuco’s results in line with Spark for the reported reproducer.

PointKernel

@sleeepyjack can you please add a unit test to excercise this res matching?

sleeepyjack · 2026-01-24T02:39:46Z

@res-life would you mind running your Spark experiments with the updated cuco HLL to see if this PR fixes the issue?

Align HyperLogLog add() bit handling with Spark

9188983

sleeepyjack self-assigned this Jan 23, 2026

sleeepyjack requested a review from PointKernel as a code owner January 23, 2026 00:52

sleeepyjack added type: bug Something isn't working helps: rapids Helps or needed by RAPIDS P0: Must have Critical feature or bug fix topic: hyperloglog Issue related to hyperloglog labels Jan 23, 2026

sleeepyjack mentioned this pull request Jan 23, 2026

Migrate cuco HLL NVIDIA/cccl#6666

Open

6 tasks

PointKernel reviewed Jan 23, 2026

View reviewed changes

sleeepyjack added 3 commits January 23, 2026 18:31

Use correct hasher

d5050ec

Fix insertion point anchor to match Spark's behavior

4c55330

Add unit test

c2c4a8c

sleeepyjack requested a review from PointKernel January 24, 2026 02:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Align HyperLogLog add() bit handling with Spark #792

Align HyperLogLog add() bit handling with Spark #792

sleeepyjack commented Jan 23, 2026 •

edited

Loading

Uh oh!

PointKernel left a comment

Uh oh!

sleeepyjack commented Jan 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Align HyperLogLog add() bit handling with Spark #792

Are you sure you want to change the base?

Align HyperLogLog add() bit handling with Spark #792

Conversation

sleeepyjack commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Uh oh!

PointKernel left a comment

Choose a reason for hiding this comment

Uh oh!

sleeepyjack commented Jan 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sleeepyjack commented Jan 23, 2026 •

edited

Loading