diff --git a/docs/adr/0010-lazy-decode.md b/docs/adr/0010-lazy-decode.md index 71c29dda..ca5d685d 100644 --- a/docs/adr/0010-lazy-decode.md +++ b/docs/adr/0010-lazy-decode.md @@ -121,6 +121,12 @@ Two observations: ### API gate — eager unless a filter is present +> **Update (Phase 3 PoC):** measurement invalidated this gate. Lazy is +> *strictly faster* than eager even on full-fold workloads because the +> materialization pass disappears entirely. Keep this section for the +> reasoning chain, but the final design drops the `hasFilter()` gate — +> see the PoC findings under Phase 3. + ``` ScanOptions.hasFilter() == false → eager path (today), zero change ScanOptions.hasFilter() == true → lazy + compute pushdown @@ -207,6 +213,11 @@ keeps third-party encodings on equal footing with built-in ones. ### Phase 2 — Lazy ALP variant + filter gate +> **Update (Phase 3 PoC):** the `hasFilter()` gate is dropped in the +> final design — `AlpDoubleArray` is returned unconditionally whenever +> the encoded source is full-size writable and there are no patches. +> Section preserved for the reasoning chain. + Add the first lazy implementation: ```java @@ -252,15 +263,71 @@ boolean signal. ### Phase 3 — compute pushdown +#### The core trick — encode the threshold, integer-compare + +Pushdown works whenever the encoding is an **order-preserving +invertible** function of one integer source. For ALP, +`value = encoded * scale` with `scale > 0`, so: + +``` +value > threshold +⇔ encoded * scale > threshold +⇔ encoded > threshold / scale +``` + +Encode the threshold once, then the inner loop is pure integer compare +on the source longs — no `cvt+mul` per rejected row. Decode happens +only for matches: + +```java +long enc_lo = (long) Math.floor(threshold / scale); +for (long i = 0; i < n; i++) { + long lv = encoded.getAtIndex(LE_LONG, i); + if (lv > enc_lo) { + double v = (double) lv * scale; + if (v > threshold) { // rare boundary check, see below + sum += v; + } + } +} +``` + +The boundary check exists because `floor(threshold / scale) * scale` +may not equal `threshold` exactly in FP — values whose encoded form is +`enc_lo + 1` could decode to either side of the threshold. The check +fires at most once per boundary and is effectively free. + +Patches break the encoding invariant: rows that ALP couldn't encode +are stored as raw doubles. Either compare them separately (rare, small +count) or fall back to full materialization when patches are present +above some threshold. + +#### Per-encoding applicability + +| Encoding | Encoded form | Pushdown predicate | Order-preserving? | +|----------|--------------|--------------------|-------------------| +| ALP | `int * scale` | `enc > floor(threshold / scale)` | yes (scale > 0) | +| FoR | `int + ref` | `enc > threshold - ref` | yes | +| Bitpacked| `int` in k bits | `enc > threshold` | yes | +| Dict | `index → values[index]` | resolve match-set of indices once, then `idx ∈ set` | yes (via indirection) | +| ZigZag | `(u >> 1) ^ -(u & 1)` | **no** — encoded order ≠ decoded order; decode required | +| Pco / Zstd / FSST | compressed block | no — must decompress | + +**Composition.** `ALP(FoR(Bitpacked))` is the common chain in this +project's OHLC data. The composed pushdown threshold is +`enc_bp = floor(threshold / scale) - ref`, compared directly against the +raw bitpacked output. Each step is monotonic invertible, so the chain +composes — but only if every link supports pushdown. + +#### Kernel SPI + `ScanIterator` routes `ScanOptions.rowFilter()` through a kernel SPI before falling back to materialization. Initial kernels: -- `CompareKernel`: `compare(arr, scalar, op) → BoolArray`. For - `AlpDoubleArray`, encode the scalar to the int domain - (`enc = round(scalar / scale)`) and compare ints. For - `ForLongArray` (when it lands), subtract the reference and compare - ints. Falls back to materialization when the scalar does not - round-trip through the encoding. +- `CompareKernel`: `compare(arr, scalar, op) → BoolArray`. Uses the + encoding-specific encoded-threshold formula. Falls back to + materialization when the threshold can't be encoded (NaN / Inf / + out-of-range / non-order-preserving encoding). - `BetweenKernel`: same approach for two scalars. - `TakeKernel`: `take(arr, indices)` — decode only the requested indices. Unblocks the take/slice/projection wins from phase 0. @@ -284,6 +351,53 @@ intersecting selection vectors; `OR` unions them. Columns referenced only by the filter (not by projection) are decoded just enough to test and are not delivered to the consumer. +#### PoC findings (worktree `lazy-alp-f64-poc`) + +A throwaway prototype on the `lazy-alp-f64-poc` worktree dropped +`final` from `DoubleArray`, added `AlpDoubleArray extends DoubleArray` +with overridden `getDouble` / `forEachDouble` / `fold` and a +`sumWhereGt` pushdown method. ALP decoder returned the lazy variant +unconditionally (no `hasFilter()` gate) whenever there were no patches +and the source segment was full-size writable. + +Measured at 10M rows on the OHLC fixture: + +| Workload | Eager baseline | Lazy + ALP pushdown | Δ | JNI | +|----------|---------------:|--------------------:|--:|----:| +| `javaReadClose` (full fold, no filter) | 68.5 | **75.8** | **+10.6%** | 53 (Java already wins) | +| `javaFilterClose` sel=0.001 | 96 | 110 | +15% | 363 | +| `javaFilterClose` sel=0.01 | 96 | 107 | +11% | 347 | +| `javaFilterClose` sel=0.1 | 49 | **98** | **+100%** | 196 | +| `javaFilterClose` sel=1.0 | 82 | 87 | +6% | 58 (Java wins) | + +**Takeaways:** + +1. Lazy on full fold is **strictly faster** than eager — eliminating + the materialization pass halves memory traffic. This invalidates + the earlier "gate lazy behind `hasFilter()`" rule from phase 2; + the gate is unnecessary and lazy can be the default. +2. ALP-level pushdown captures a real win (especially the +100% at + sel=0.1), but **Bitpacked unpack is now the dominant cost** at low + selectivity (~24% of runnable time per JFR). It runs for every row + regardless of how selective the predicate is. +3. The residual 2-3× JNI advantage at sel ≤ 0.01 is **SIMD bitpacked + unpack** in fastlanes, not the lack of more compute pushdown. To + close it the predicate must reach the bitpacked decoder *and* the + unpack must vectorize — which is the territory of + [ADR 0005](0005-vector-api-adoption.md), not this ADR. + +**Design implications carried forward to the final phases:** + +- Drop the `hasFilter()` gate. Lazy is the default for encodings that + support it; eager is just `LazyXxx.materialize(arena)`. +- The pushdown method on `AlpDoubleArray` takes the predicate, not + just the threshold — `sumWhereGt` was a PoC shortcut. Production + shape is `compareGt(threshold) → BoolArray` or + `filterMap(pred, fn) → DoubleArray`. +- The PoC subclass-with-buffer-field trick works fine in practice (no + field tax measured), but the public shape stays as the open + interface from phase 1 — the PoC just took the shortest path. + ### Future — extend the lazy family Once ALP proves the shape, add (no API change, just new permits): @@ -294,29 +408,6 @@ Once ALP proves the shape, add (no API change, just new permits): - Composed: `AlpForBitpackedDoubleArray` fuses three transforms into one expression evaluated per access -### Phase 2 — compute pushdown - -`ScanIterator` routes `ScanOptions.rowFilter()` through a kernel SPI -before falling back to materialization. Initial kernels: - -- `CompareKernel`: `compare(arr, scalar, op) → BoolArray`. For ALP, - encode the scalar to the int domain (`enc = round(scalar / scale)`) - and compare ints. For FoR, subtract the reference and compare ints. - Falls back to materialization when the scalar does not round-trip - through the encoding (e.g. ALP threshold that is not representable as - `int * 10^(f-e)` exactly). -- `BetweenKernel`: same approach for two scalars. -- `TakeKernel`: `take(arr, indices)` — decode only the requested - indices. Unblocks the take/slice/projection wins from phase 0. -- `SumKernel`, `MinKernel`, `MaxKernel`: deferred. `sum(ALP) = - sum(int) * scale + patch_correction` is straightforward but not on the - critical path. - -For multi-column filters: `AND` evaluates kernels in column order, -intersecting selection vectors; `OR` unions them. Columns referenced -only by the filter (not by projection) are decoded just enough to test -and are not delivered to the consumer. - ## Consequences ### Positive