Align HyperLogLog add() bit handling with Spark #792
+53
−40
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Closes #696
Summary
add()with Spark’sHyperLogLog++ helper logic.
standard_deviation=0.3, precisionp=4).correction rather than “closest” entry).
Changes
pbits of the hash (h >> (bits - p)) to select the register,matching Spark’s
idx = x >>> (64 - p).clz((h << p) | (1 << (p - 1))) + 1. The padding bit prevents the shifted value from becomingall‑zero, which would otherwise inflate the zero count.
by binary search when no exact match exists). cuco previously snapped to the nearest neighbor,
which shifts the
low..highbias window by one entry and can move the rounded estimate by 1.The previous implementation mixed low‑bit register indexing with a different zero‑count scheme.
When the remaining bits were all zero after shifting,
clz(h << p) + 1could over‑count leadingzeros relative to Spark’s
(h << p) | wPaddingmethod. This mismatch changes the register valuesand can bias the final estimate by a small amount. Using Spark’s MSB‑based index and padded
leading‑zero count brings cuco’s results in line with Spark for the reported reproducer.