feat(encoding): support u128 bitpacking for decimal128 columns#6858
feat(encoding): support u128 bitpacking for decimal128 columns#6858LuciferYang wants to merge 4 commits into
Conversation
decimal128(7,2) columns were stored at full 128-bit width without compression because BitPacking only supported u8/u16/u32/u64. This adds scalar u128 bitpacking, reducing decimal128 storage from 131 bits/value to ~24 bits/value (5.6x compression on TPC-DS store_sales). File size: 34 GiB → 16 GiB for store_sales SF=100.
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
Decoding performance has degraded. I need to explore possible optimizations to decide whether to move forward with this change. |
Per-chunk u128 inline bitpacking dispatch for decimal128 columns.
Each chunk's `Stat::BitWidth` selects one of five kernels via the
single source of truth `u128_kernel_for(bit_width)`:
| bit_width | kernel |
|-----------|------------------------------------------------|
| 0 | zero-fill, no body bytes |
| 1..=32 | reinterpret body as &[u32], FastLanes u32 SIMD |
| 33..=64 | reinterpret body as &[u64], FastLanes u64 SIMD |
| 65..=127 | scalar sequential u128 bit-stream |
| 128 | memcpy identity (one u128 per value) |
Encoder (`pack_u128_chunk`) and decoder (`unpack_u128_chunk`) both
dispatch through the same enum, so the dispatch table is a single
place to change. The 16-byte chunk header layout is unchanged and
the body byte count stays at `bit_width * ELEMS_PER_CHUNK / 8` for
every kernel; what changes at the same `bit_width` is the body
byte interpretation. This is documented in the module rustdoc as a
deliberate format-contract design — `bit_width` IS the kernel
selector, no separate negotiation channel — and is safe because
u128 inline bitpacking is unreleased so no in-the-wild reader can
mis-decode.
# Performance
q4 (6× decimal128 sum) on store_sales SF=100, MinIO 100ms RTT,
JMH avgt=10:
| Format | q4 wall-time | vs PR#6858 baseline |
|-------------------------|--------------------------|---------------------|
| parquet | 2667.779 ± 139.766 ms | — |
| lance_bp128 (this PR) | 5573.007 ± 127.962 ms | 3.7× speedup; -73% |
| lance (plain MiniBlock) | 6420.842 ± 118.629 ms | — |
Disk size unchanged at 16 GiB / 305 fragments (header + body byte
count are identical to the PR#6858 baseline; only the kernel
choice changes per chunk). The IO savings (-53% vs plain
MiniBlock) now translate into a net win on high-RTT object
storage instead of being dwarfed by scalar u128 unpack CPU cost.
# Sign safety (load-bearing invariant)
Narrow branches reinterpret `&[u128]` as `&[u32]` / `&[u64]` and
require every value to satisfy `v >> bit_width == 0`. This is
guaranteed by `FixedWidthDataBlock::max_bit_widths`, whose
OR-fold + `leading_zeros` algorithm guarantees the high lanes are
all zero for any `bit_width < 128`. Negative i128 values (after
`bytemuck::cast_slice` to u128) have the high bit set and force
`bit_width = 128`, routing them to the memcpy branch. Defenses:
- Module-level rustdoc statement (`bitpacking.rs:64-71`).
- `pack_u128_chunk` `# Safety` block citing the upstream source.
- Release-build `assert!(src[0] >> bit_width == 0)` sentinel on
both NarrowU32 and NarrowU64 entry points (defense-in-depth
against a hypothetical future `Stat::BitWidth` regression that
would otherwise silently truncate).
- Per-value `debug_assert!` in dev/test builds.
- Pinned by `test_bit_width_or_fold_invariant_for_u128_narrow_dispatch`
in `statistics.rs` covering all five regimes including `-1i128`.
# Soundness
`append_u128_chunk` writes the body via safe `Vec::resize(..,
0u128)` + `&mut [u128]` slice indexing. The remaining `unsafe {
pack_u128_chunk(..) }` call has an inline SAFETY comment proving
all three preconditions (`src.len() == ELEMS_PER_CHUNK`,
`bit_width <= 128`, `dest.len() * 16 == bit_width * ELEMS_PER_CHUNK
/ 8`). Panic safety: u128 has no invalid bit patterns and no Drop,
so a panic mid-pack leaves the Vec at its declared length holding
well-defined u128s. Compile-time `assert!(ELEMS_PER_CHUNK
.is_multiple_of(128))` catches future drift in the framing
constant.
# Decoder boundary validation
`unchunk_u128_dispatch` validates 4 corruption modes before any
`unsafe` call:
- under-length buffer (< 16 bytes header) → `InvalidInput`
- `num_values > ELEMS_PER_CHUNK` → `InvalidInput`
- header out of range (`raw > 128`) → `InvalidInput`
- body length mismatch → `InvalidInput`
`num_values == 0` short-circuits *after* validation to skip the
16 KiB scratch unpack. All 4 paths plus the short-circuit have
explicit negative tests + 1 positive happy-path test.
# Test coverage
- 5 per-regime hand tests (widths 24, 40, 80, 128 all-negatives,
mixed-sign-small-magnitude memcpy routing).
- Exhaustive `test_u128_kernel_for_exhaustive_table_and_roundtrip`
iterating `0..=128`, forces high-bit injection per width.
- `test_inline_bitpack_u128_mixed_widths_per_chunk_dispatch` for
cross-chunk independence.
- 4 negative tests for decoder entry-point validation.
- 1 positive test for `num_values == 0` short-circuit.
- 1 sign-safety regression test in `statistics.rs`.
- 2 proptests:
- `proptest_inline_bitpack_u128_roundtrip`: single-width × full
range × 1..=3072 values × 128-case proptest budget.
- `proptest_inline_bitpack_u128_mixed_widths`: 1..=3 chunks ×
independent widths × xorshift seed=0 fixed-point salt.
# Scope
This dispatch lives only in the `InlineBitpacking` (miniblock)
path. `OutOfLineBitpacking` (full-zip) is intentionally NOT
touched — full-zip frames whole pages with a single bit_width
and has no per-chunk header to carry a kernel selector, so the
same machinery would either need a new format field or change
cross-page framing semantics. Documented at the
`OutOfLineBitpacking` rustdoc block.
# Out-of-scope (not included; deferred to follow-ups)
- Stack scratch sizes (16 KiB worst case acceptable for now).
- Per-chunk `vec![]` allocations.
- `kernel_body_bytes(kernel, bit_width)` helper to centralize the
framing math currently duplicated across 3 sites.
- `assert_eq!(bit_widths_array.len(), data.num_values
.div_ceil(ELEMS_PER_CHUNK))` end-to-end contract pin.
- Per-call SAFETY comment on the decoder `unpack_u128_chunk` call
(function-level `# Safety` block already covers it).
- `BlockDecompressor` u128 arm not directly tested (routes
through the same `unchunk_u128_dispatch` as `MiniBlockDecompressor`).
- File size 1916 lines (cohesive single-module dispatch).
# Verification
- `cargo build -p lance-encoding --tests` ✓
- `cargo test -p lance-encoding --lib` ✓ (389 passed, 0 failed,
5 ignored — full unfiltered suite)
- `cargo clippy --release -p lance-encoding --tests -- -D warnings` ✓
- `cargo fmt -p lance-encoding -- --check` ✓
Adds two criterion benches that characterize the per-chunk u128 bitpacking dispatch end-to-end through the public encode_batch / decode_batch APIs. Each bench scans four bit_width regimes corresponding to the four non-trivial dispatch arms (the bw=0 Zero arm has no observable per-value cost and is omitted): | bit_width | dispatch arm | typical workload | |-----------|------------------|----------------------------------| | 24 | NarrowU32 (SIMD) | decimal128(7,2)-typical | | 40 | NarrowU64 (SIMD) | decimal128(12,2) / store_sales | | 100 | SequentialU128 | scalar-fallback (rare in OLTP) | | 128 | Memcpy | full-precision / signed columns | Per-chunk Stat::BitWidth is pinned to the target bw deterministically: - bw < 128: every value forced to [2^(bw-1), 2^bw - 1] by setting bit (bw-1) and masking the higher bits → per-chunk OR-fold = bw exactly. - bw == 128: bounded negatives in [-(2^64), -1]. Every negative i128 has bit 127 set in its u128 reinterpretation → OR-fold = u128::MAX → bw = 128. The bounded magnitude (≤ 2^64) keeps every value within Decimal128(38, 0) precision regardless of arrow's data-validation policy across versions. Style matches the existing bench_decode_compressed / bench_encode_compressed: single product code path, parameter-scan over bit_width buckets, public API (no dispatch table reflection), 5M rows × 16 bytes = 80 MB throughput per case. Fixed seed 0xDEAD_BEEF (with `| 1` salt to defend against the xorshift seed=0 fixed point) for reproducibility. Local numbers on Apple M-series, criterion 100 samples × 5s/case: decode_decimal128: bw024_narrow_u32: 15.55 ms / 4.79 GiB/s bw040_narrow_u64: 16.63 ms / 4.48 GiB/s bw100_sequential_u128: 15.38 ms / 4.84 GiB/s bw128_memcpy: 6.37 ms / 11.70 GiB/s encode_decimal128: bw024_narrow_u32: 11.60 ms / 6.43 GiB/s bw040_narrow_u64: 14.82 ms / 5.03 GiB/s bw100_sequential_u128: 38.32 ms / 1.94 GiB/s bw128_memcpy: 29.40 ms / 2.53 GiB/s The encode side cleanly shows SequentialU128 as the slowest arm (≈2× NarrowU64, ≈3× NarrowU32), motivating the per-chunk dispatch: data with bit_width ≤ 64 routes to FastLanes SIMD instead of the scalar bit-stream kernel. Decode is dominated by pipeline overhead (scheduler, repdef, runtime) so kernel differences are compressed, matching the end-to-end JMH numbers (-10% q4 wall-time on 100ms-RTT MinIO vs the kernel-level ~3× speedup observed at the BitPacking trait boundary). Run with: cargo bench -p lance-encoding --bench decoder -- decode_decimal128 cargo bench -p lance-encoding --bench encoder -- encode_decimal128
Two CI failures on PR#6858: 1. typos check fails on `mis-decoded` (line 61) — the `mis` token trips crate-ci/typos. Reword to `decode incorrectly`. 2. rustdoc -D warnings fails on `[crate::statistics::FixedWidthDataBlock::max_bit_widths]` (line 68) — `max_bit_widths` is a private fn (no `pub`), so the intra-doc link cannot resolve. Drop the `[]` link syntax and keep the reference as plain code formatting; reader can grep the codebase if curious. Verification: - `RUSTDOCFLAGS="-D warnings" cargo doc -p lance-encoding --no-deps` ✓ - `cargo fmt -p lance-encoding -- --check` ✓
Reproducer + bench-tool source for the End-to-end A/B in the descriptionThe numbers in the # drop the source below at rust/examples/src/decimal128_bp128_ab.rs,
# add the [[example]] entry below to rust/examples/Cargo.toml,
# then:
cargo build --release --example decimal128_bp128_ab -p lance-examples
B=./target/release/examples/decimal128_bp128_ab
"$B" write file:///tmp/dec128.lance --rows 5000000 --iters 5
"$B" read file:///tmp/dec128.lance --iters 5
"$B" write s3://benchmark/dec128.lance --s3-endpoint http://localhost:9000 --rows 5000000 --iters 5
"$B" read s3://benchmark/dec128.lance --s3-endpoint http://localhost:9000 --iters 5The
|
Spark end-to-end Query (TPC-DS SF=100
|
| Format | Query mean (ms) | 99.9 % CI | On-disk size | Δ vs Lance baseline |
|---|---|---|---|---|
| Lance (baseline, no bp128) | 6410.229 | ± 113.802 | 34 GiB | — |
| Lance with bp128 + dispatch | 5750.606 | ± 475.850 | 16 GiB | −10.3 % |
The −10 % wall-time win is more modest than the −76 % we see on the lance Rust API MinIO read of the same shape — bp128 only optimizes the decode kernel, and the Spark end-to-end stack has additional overhead outside that scope which this PR doesn't touch (we haven't fully attributed where it lives). The disk-side win (34 GiB → 16 GiB, a 2.1× reduction; matches the −53 % already in the PR description) is the same regardless of who's reading. For S3-style deployments where compute is decoupled from storage, the size cut alone usually pays for itself before any decode delta is considered.
Squash of upstream PR lance-format#6858 (4 commits): - feat(encoding): add u128 bitpacking support for decimal128 columns - feat(encoding): per-chunk u128 bitpacking dispatch (u32/u64/u128/memcpy) - test(encoding): add Decimal128 inline-bitpacking decode/encode benches - fix(encoding): rustdoc intra-doc link + typo in u128 dispatch module Per-chunk Stat::BitWidth selects one of five kernels via u128_kernel_for(bit_width): 0 -> zero-fill 1..=32 -> reinterpret as &[u32], FastLanes u32 SIMD 33..=64 -> reinterpret as &[u64], FastLanes u64 SIMD 65..=127-> scalar sequential u128 bit-stream 128 -> memcpy identity Co-Authored-By: yangjie01 <yangjie01@baidu.com> Change-Id: I7744f629f880d397fac1d104f1db61755f21ab6b
Closes #6857.
What
Extends the miniblock inline-bitpacking chooser to also consider
bits = 128and adds u128 BitPacking with per-chunk SIMD dispatch (NarrowU32 / NarrowU64 / SequentialU128 / Memcpy keyed byStat::BitWidth), sodecimal128columns whose values fit in <128 bits encode/decode through FastLanes lanes instead of a scalar bit-stream kernel.Impact
On-disk size
Measured on TPC-DS SF=100
store_sales(288 M rows, 12 ×decimal128(7,2)columns):~53 % reduction, schema / row count / file format version (v2.1) unchanged.
Decode / encode throughput
Apple M-series, criterion 100 samples × 5s / case, single
Decimal128(38, 0)column × 5 M rows (80 MiB logical i128). Each variant pins every chunk'sStat::BitWidthto the target dispatch arm.Encode (
cargo bench -p lance-encoding --bench encoder -- encode_decimal128):Decode (
cargo bench -p lance-encoding --bench decoder -- decode_decimal128):Decode adds the cost of decompression that didn't exist on the plain-MiniBlock path; encode-narrow regimes are net faster because the SIMD pack replaces the generic-fixed-width path's ~2 GiB/s ceiling. The −53 % disk size typically wins net wall-time when storage / network bandwidth is the bottleneck; pure CPU-bound in-memory decode is the explicit trade-off.
End-to-end (read + write, local-disk + MinIO)
Apple M-series, lance Rust public API end-to-end. 5,000,000 rows × 6 ×
decimal128(7,2), deterministic xorshift values within precision-7 range so every chunk'sStat::BitWidthlands on the NarrowU32 dispatch arm. 5 iterations per scenario; iter 0 is cold, warm mean is over the 4 subsequent iterations. All 8 runs produced an identical per-column sum, so this is a correctness-preserving A/B. Read times are end-to-end (Dataset::open+ scan + per-columnarrow::compute::sum).Local-disk read is mostly CPU-bound on warm page cache, and is still net faster — chunked decode plus 5.2× less data on disk outweighs the slower pure-kernel decode in practice. MinIO read is bandwidth-bound and the size reduction maps directly into the speedup. Both write regimes are dominated by the SIMD pack replacing the generic-fixed-width path's ~2 GiB/s ceiling; the wider gap on local-disk write reflects that local fsync is not the bottleneck. Reproducer + the bench tool's full source are in a PR comment below.
Changes
rust/compression/bitpacking/src/lib.rs— scalar u128 BitPacking kernel.rust/lance-encoding/src/encodings/physical/bitpacking.rs— u128 miniblock encode/decode wiring + per-chunk dispatch viau128_kernel_for(bit_width)(single source of truth for both pack and unpack; one match arm per kernel).rust/lance-encoding/src/compression.rs— chooser now matchesbits ∈ {8, 16, 32, 64, 128}.rust/lance-encoding/src/statistics.rs— stat plumbing for the 128 case + OR-fold sign-safety contract test.rust/lance-encoding/benches/{decoder,encoder}.rs—bench_decode_decimal128/bench_encode_decimal128covering all four non-trivial dispatch arms.No new public API, no on-wire format change (the new bit-width is already valid for v2.1 readers — the encoder just didn't previously emit it).
Testing
width = 63boundary.0..=128u128_kernel_fortable + roundtrip.Stat::BitWidthOR-fold contract.unchunk_u128_dispatchcorruption guards: under-length buffer, oversizednum_values, header out-of-range, body length mismatch, and thenum_values == 0short-circuit (both negative and positive coverage).compress → decompresstest through the miniblock path.store_salesverifies row count, schema, and reads back identically.