Skip to content

Conversation

@sleeepyjack
Copy link
Collaborator

@sleeepyjack sleeepyjack commented Jan 23, 2026

Closes #696

Summary

  • Align HyperLogLog register selection and leading‑zero computation in add() with Spark’s
    HyperLogLog++ helper logic.
  • Fix a leading‑zero discrepancy that could shift estimates by 1 for small sketches (e.g.,
    standard_deviation=0.3, precision p=4).
  • Align bias interpolation anchor selection with Spark (use insertion‑point anchor for bias
    correction rather than “closest” entry).

Changes

  • Register index: Use the top p bits of the hash (h >> (bits - p)) to select the register,
    matching Spark’s idx = x >>> (64 - p).
  • Leading zeros: Compute leading zeros on the remaining bits using Spark’s padded approach:
    clz((h << p) | (1 << (p - 1))) + 1. The padding bit prevents the shifted value from becoming
    all‑zero, which would otherwise inflate the zero count.
  • Bias anchor: Use Spark’s insertion‑point anchor for bias interpolation (the index returned
    by binary search when no exact match exists). cuco previously snapped to the nearest neighbor,
    which shifts the low..high bias window by one entry and can move the rounded estimate by 1.

The previous implementation mixed low‑bit register indexing with a different zero‑count scheme.
When the remaining bits were all zero after shifting, clz(h << p) + 1 could over‑count leading
zeros relative to Spark’s (h << p) | wPadding method. This mismatch changes the register values
and can bias the final estimate by a small amount. Using Spark’s MSB‑based index and padded
leading‑zero count brings cuco’s results in line with Spark for the reported reproducer.

@sleeepyjack sleeepyjack self-assigned this Jan 23, 2026
@sleeepyjack sleeepyjack added type: bug Something isn't working helps: rapids Helps or needed by RAPIDS P0: Must have Critical feature or bug fix topic: hyperloglog Issue related to hyperloglog labels Jan 23, 2026
@sleeepyjack sleeepyjack mentioned this pull request Jan 23, 2026
6 tasks
Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sleeepyjack can you please add a unit test to excercise this res matching?

@sleeepyjack
Copy link
Collaborator Author

@res-life would you mind running your Spark experiments with the updated cuco HLL to see if this PR fixes the issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

helps: rapids Helps or needed by RAPIDS P0: Must have Critical feature or bug fix topic: hyperloglog Issue related to hyperloglog type: bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]: HLLPP in cuCo has different behavior with Spark

2 participants