From 18e2ea668c4184305f247800d92b5af495f0f705 Mon Sep 17 00:00:00 2001 From: Davide Angelocola Date: Sat, 13 Jun 2026 23:23:22 +0200 Subject: [PATCH] =?UTF-8?q?docs(adr):=20ADR=200010=20=E2=80=94=20pushdown?= =?UTF-8?q?=20math,=20per-encoding=20table,=20PoC=20findings?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds the concrete math behind compute pushdown: encode the threshold into the encoding's integer domain, integer-compare, decode only matches; plus a boundary re-check for FP edge cases. Per-encoding table identifies which encodings support pushdown (ALP, FoR, Bitpacked, Dict — order-preserving invertible) and which do not (ZigZag — order-non-preserving; Pco/Zstd/FSST — opaque blocks). Adds PoC findings from worktree lazy-alp-f64-poc: - lazy ALP is +10.6% on full fold (drops the hasFilter() gate) - ALP-level pushdown is +100% at sel=0.1 - residual 2-3× JNI gap at low sel is Bitpacked SIMD, ADR 0005 territory Drops the hasFilter() gate from the final design and notes the supersession inline on the earlier sections so the reasoning chain stays readable. Co-Authored-By: Claude Sonnet 4.6 --- docs/adr/0010-lazy-decode.md | 149 ++++++++++++++++++++++++++++------- 1 file changed, 120 insertions(+), 29 deletions(-) diff --git a/docs/adr/0010-lazy-decode.md b/docs/adr/0010-lazy-decode.md index 71c29dda..ca5d685d 100644 --- a/docs/adr/0010-lazy-decode.md +++ b/docs/adr/0010-lazy-decode.md @@ -121,6 +121,12 @@ Two observations: ### API gate — eager unless a filter is present +> **Update (Phase 3 PoC):** measurement invalidated this gate. Lazy is +> *strictly faster* than eager even on full-fold workloads because the +> materialization pass disappears entirely. Keep this section for the +> reasoning chain, but the final design drops the `hasFilter()` gate — +> see the PoC findings under Phase 3. + ``` ScanOptions.hasFilter() == false → eager path (today), zero change ScanOptions.hasFilter() == true → lazy + compute pushdown @@ -207,6 +213,11 @@ keeps third-party encodings on equal footing with built-in ones. ### Phase 2 — Lazy ALP variant + filter gate +> **Update (Phase 3 PoC):** the `hasFilter()` gate is dropped in the +> final design — `AlpDoubleArray` is returned unconditionally whenever +> the encoded source is full-size writable and there are no patches. +> Section preserved for the reasoning chain. + Add the first lazy implementation: ```java @@ -252,15 +263,71 @@ boolean signal. ### Phase 3 — compute pushdown +#### The core trick — encode the threshold, integer-compare + +Pushdown works whenever the encoding is an **order-preserving +invertible** function of one integer source. For ALP, +`value = encoded * scale` with `scale > 0`, so: + +``` +value > threshold +⇔ encoded * scale > threshold +⇔ encoded > threshold / scale +``` + +Encode the threshold once, then the inner loop is pure integer compare +on the source longs — no `cvt+mul` per rejected row. Decode happens +only for matches: + +```java +long enc_lo = (long) Math.floor(threshold / scale); +for (long i = 0; i < n; i++) { + long lv = encoded.getAtIndex(LE_LONG, i); + if (lv > enc_lo) { + double v = (double) lv * scale; + if (v > threshold) { // rare boundary check, see below + sum += v; + } + } +} +``` + +The boundary check exists because `floor(threshold / scale) * scale` +may not equal `threshold` exactly in FP — values whose encoded form is +`enc_lo + 1` could decode to either side of the threshold. The check +fires at most once per boundary and is effectively free. + +Patches break the encoding invariant: rows that ALP couldn't encode +are stored as raw doubles. Either compare them separately (rare, small +count) or fall back to full materialization when patches are present +above some threshold. + +#### Per-encoding applicability + +| Encoding | Encoded form | Pushdown predicate | Order-preserving? | +|----------|--------------|--------------------|-------------------| +| ALP | `int * scale` | `enc > floor(threshold / scale)` | yes (scale > 0) | +| FoR | `int + ref` | `enc > threshold - ref` | yes | +| Bitpacked| `int` in k bits | `enc > threshold` | yes | +| Dict | `index → values[index]` | resolve match-set of indices once, then `idx ∈ set` | yes (via indirection) | +| ZigZag | `(u >> 1) ^ -(u & 1)` | **no** — encoded order ≠ decoded order; decode required | +| Pco / Zstd / FSST | compressed block | no — must decompress | + +**Composition.** `ALP(FoR(Bitpacked))` is the common chain in this +project's OHLC data. The composed pushdown threshold is +`enc_bp = floor(threshold / scale) - ref`, compared directly against the +raw bitpacked output. Each step is monotonic invertible, so the chain +composes — but only if every link supports pushdown. + +#### Kernel SPI + `ScanIterator` routes `ScanOptions.rowFilter()` through a kernel SPI before falling back to materialization. Initial kernels: -- `CompareKernel`: `compare(arr, scalar, op) → BoolArray`. For - `AlpDoubleArray`, encode the scalar to the int domain - (`enc = round(scalar / scale)`) and compare ints. For - `ForLongArray` (when it lands), subtract the reference and compare - ints. Falls back to materialization when the scalar does not - round-trip through the encoding. +- `CompareKernel`: `compare(arr, scalar, op) → BoolArray`. Uses the + encoding-specific encoded-threshold formula. Falls back to + materialization when the threshold can't be encoded (NaN / Inf / + out-of-range / non-order-preserving encoding). - `BetweenKernel`: same approach for two scalars. - `TakeKernel`: `take(arr, indices)` — decode only the requested indices. Unblocks the take/slice/projection wins from phase 0. @@ -284,6 +351,53 @@ intersecting selection vectors; `OR` unions them. Columns referenced only by the filter (not by projection) are decoded just enough to test and are not delivered to the consumer. +#### PoC findings (worktree `lazy-alp-f64-poc`) + +A throwaway prototype on the `lazy-alp-f64-poc` worktree dropped +`final` from `DoubleArray`, added `AlpDoubleArray extends DoubleArray` +with overridden `getDouble` / `forEachDouble` / `fold` and a +`sumWhereGt` pushdown method. ALP decoder returned the lazy variant +unconditionally (no `hasFilter()` gate) whenever there were no patches +and the source segment was full-size writable. + +Measured at 10M rows on the OHLC fixture: + +| Workload | Eager baseline | Lazy + ALP pushdown | Δ | JNI | +|----------|---------------:|--------------------:|--:|----:| +| `javaReadClose` (full fold, no filter) | 68.5 | **75.8** | **+10.6%** | 53 (Java already wins) | +| `javaFilterClose` sel=0.001 | 96 | 110 | +15% | 363 | +| `javaFilterClose` sel=0.01 | 96 | 107 | +11% | 347 | +| `javaFilterClose` sel=0.1 | 49 | **98** | **+100%** | 196 | +| `javaFilterClose` sel=1.0 | 82 | 87 | +6% | 58 (Java wins) | + +**Takeaways:** + +1. Lazy on full fold is **strictly faster** than eager — eliminating + the materialization pass halves memory traffic. This invalidates + the earlier "gate lazy behind `hasFilter()`" rule from phase 2; + the gate is unnecessary and lazy can be the default. +2. ALP-level pushdown captures a real win (especially the +100% at + sel=0.1), but **Bitpacked unpack is now the dominant cost** at low + selectivity (~24% of runnable time per JFR). It runs for every row + regardless of how selective the predicate is. +3. The residual 2-3× JNI advantage at sel ≤ 0.01 is **SIMD bitpacked + unpack** in fastlanes, not the lack of more compute pushdown. To + close it the predicate must reach the bitpacked decoder *and* the + unpack must vectorize — which is the territory of + [ADR 0005](0005-vector-api-adoption.md), not this ADR. + +**Design implications carried forward to the final phases:** + +- Drop the `hasFilter()` gate. Lazy is the default for encodings that + support it; eager is just `LazyXxx.materialize(arena)`. +- The pushdown method on `AlpDoubleArray` takes the predicate, not + just the threshold — `sumWhereGt` was a PoC shortcut. Production + shape is `compareGt(threshold) → BoolArray` or + `filterMap(pred, fn) → DoubleArray`. +- The PoC subclass-with-buffer-field trick works fine in practice (no + field tax measured), but the public shape stays as the open + interface from phase 1 — the PoC just took the shortest path. + ### Future — extend the lazy family Once ALP proves the shape, add (no API change, just new permits): @@ -294,29 +408,6 @@ Once ALP proves the shape, add (no API change, just new permits): - Composed: `AlpForBitpackedDoubleArray` fuses three transforms into one expression evaluated per access -### Phase 2 — compute pushdown - -`ScanIterator` routes `ScanOptions.rowFilter()` through a kernel SPI -before falling back to materialization. Initial kernels: - -- `CompareKernel`: `compare(arr, scalar, op) → BoolArray`. For ALP, - encode the scalar to the int domain (`enc = round(scalar / scale)`) - and compare ints. For FoR, subtract the reference and compare ints. - Falls back to materialization when the scalar does not round-trip - through the encoding (e.g. ALP threshold that is not representable as - `int * 10^(f-e)` exactly). -- `BetweenKernel`: same approach for two scalars. -- `TakeKernel`: `take(arr, indices)` — decode only the requested - indices. Unblocks the take/slice/projection wins from phase 0. -- `SumKernel`, `MinKernel`, `MaxKernel`: deferred. `sum(ALP) = - sum(int) * scale + patch_correction` is straightforward but not on the - critical path. - -For multi-column filters: `AND` evaluates kernels in column order, -intersecting selection vectors; `OR` unions them. Columns referenced -only by the filter (not by projection) are decoded just enough to test -and are not delivered to the consumer. - ## Consequences ### Positive