Apache Iceberg version
None
Please describe the bug 🐞
Describe the bug
bytes_required (pyiceberg/utils/decimal.py) is documented to
return the minimum number of bytes for a value, but it
returns one byte too many for negatives equal to -2^(8k-1) (e.g.
-128, -32768).
You can see the contradiction directly it claims "minimum" but
returns 2 for a value that clearly fits in 1 byte:
from pyiceberg.utils.decimal import bytes_required
print(bytes_required(-128)) # 2
print((-128).to_bytes(1, "big", signed=True)) # b'\x80' <-
-128 fits in 1 byte
This matters because the decimal bucket transform hashes these
bytes, so the extra byte makes PyIceberg pick a different bucket
than Spark/Java for the same value, the same row can land in a
different partition depending on the engine.
Reproducer
from decimal import Decimal
from pyiceberg.types import DecimalType
from pyiceberg.transforms import BucketTransform
dt = DecimalType(precision=5, scale=2)
print(BucketTransform(num_buckets=16).transform(dt)(Decimal("-1.
28"))) # 12; should be 13
Expected behavior
Minimal byte encoding (-128 → 1 byte), matching the Iceberg spec
and Spark/Java, so bucketing agrees across engines (bucket 13
here).
Low frequency (only values equal to -2^(8k-1)), but a genuine
cross-engine correctness bug.
Willingness to contribute
Apache Iceberg version
None
Please describe the bug 🐞
Describe the bug
bytes_required(pyiceberg/utils/decimal.py) is documented toreturn the minimum number of bytes for a value, but it
returns one byte too many for negatives equal to -2^(8k-1) (e.g.
-128, -32768).
You can see the contradiction directly it claims "minimum" but
returns 2 for a value that clearly fits in 1 byte:
This matters because the decimal bucket transform hashes these
bytes, so the extra byte makes PyIceberg pick a different bucket
than Spark/Java for the same value, the same row can land in a
different partition depending on the engine.
Reproducer
Expected behavior
Minimal byte encoding (-128 → 1 byte), matching the Iceberg spec
and Spark/Java, so bucketing agrees across engines (bucket 13
here).
Low frequency (only values equal to -2^(8k-1)), but a genuine
cross-engine correctness bug.
Willingness to contribute