feat(cubestore): pre-filter in-memory chunks on worker before IPC by waralexrom · Pull Request #11040 · cube-js/cube

waralexrom · 2026-06-08T22:53:18Z

Summary

Trims in-memory chunks on the worker by the dedup-safe unique-key pushable predicate before they are serialized and shipped over IPC to the select subprocess. Parquet data is already pruned by the predicate; in-memory chunks previously crossed IPC whole, so a partition that survives range-pruning still shipped rows of many keys that the query immediately discards. This closes that gap.

Changes

Extract the dedup-safe pushable-filter selection into a shared dedup_safe_unique_key_filter reused by the scan-time FilterExec and the new worker pre-filter, so the two cannot diverge.
Compute the per-index pushable predicate once at planning time (choose_index_ext, from the same filters that gate partition pruning), strip column qualifiers, and carry it proto-encoded in PlanningMeta.pushable_chunk_filters (1:1 with indices, #[serde(default)] for back-compat).
On the worker, decode the predicate and trim the loaded chunk batches between load_in_memory_chunks and serialization (trace op chunks.prefilter). The subprocess still re-applies the predicate, so this only reduces IPC payload and is never relied on for correctness.
A chunk referenced by more than one scan (self-join / self-union of one index) is left untrimmed: the subprocess shares batches by chunk id, and a scan with a different or no predicate would otherwise lose rows. Reference counting spans all index snapshots, not just predicate-bearing ones.
Best-effort: any decode/build/eval failure is logged and skipped rather than failing the query.

Testing

New regression test prefilter_chunks_shared_scan (self-union of one index, one branch key-filtered, the other not) — reproduced the shared-chunk bug (60 vs expected 130) before the cross-snapshot reference-count fix, green after.
filter_pushdown_unique_key, unique_key_and_multi_*, *_stream_table, limit_pushdown_unique_key — pass in-process and multi-process (real select-subprocess IPC), no regressions.
cargo build -p cubestore clean; cargo fmt --check passes (pre-commit hook).

claude · 2026-06-08T22:53:34Z

Claude finished @waralexrom's task in 1m 37s —— View job

Code review

Solid, well-scoped change. The shared-chunk reference count across all index snapshots (not just predicate-bearing ones) is the key correctness insight, and the new prefilter_chunks_shared_scan test pins it down precisely. Defense-in-depth (subprocess re-applies the predicate; planner-side encode failures degrade to no pre-filter) keeps the surface area for correctness regressions small.

A few minor observations — nothing blocking:

1. in_memory_chunk_filter_groups doc is slightly misleading — serialized_plan.rs:1086

The doc says the result is "grouped by the proto-encoded pushable predicate", but the implementation produces one entry per index snapshot — two snapshots with identical predicate bytes yield two entries, not one merged group. Not a correctness issue (downstream just iterates), but the comment overstates the grouping. Consider either deduping by bytes or rewording (e.g., "one entry per index snapshot that carries a pre-filter").

2. phys.as_ref().unwrap() — query_executor.rs:2433

After the explicit if phys.is_none() build block, the immediate phys.as_ref().unwrap() is safe but slightly awkward. Binding the freshly built Arc directly avoids the unwrap:

let p = match phys.as_ref() {
    Some(p) => p.clone(),
    None => { /* build, set phys, return p */ }
};

Pure style.

3. Backward-compat direction is one-way

#[serde(default)] on pushable_chunk_filters covers old plan → new worker. The reverse — new router → old worker — relies on serde's default of ignoring unknown fields, which is true for bincode-derived serializers but worth confirming for whatever transport actually serializes PlanningMeta in your mixed-version rollouts. If a rolling upgrade is in scope, a quick check would be cheap insurance.

4. Filter cost for low-selectivity predicates

filter_record_batch reallocates the entire batch even when the predicate selects most rows. For the intended use case (selective unique-key filters in multi-tenant scans) this is a clear win, but if predicates ever land here that match >90% of rows, the IPC-saving math flips. Not worth gating on, but a future selectivity heuristic could short-circuit when the mask popcount is close to row count.

5. Test gap: pre-filter actually fires when it should

prefilter_chunks_shared_scan proves the skip path is correct (shared chunks aren't wrongly trimmed). It doesn't directly verify the active path trims anything — both branches produce the same final result whether the pre-filter ran or not, since the subprocess re-applies the predicate. A counter on chunks.prefilter (or an EXPLAIN ANALYZE DETAILED assertion now that #11016 landed) would catch a future refactor that silently disables the pre-filter. Not required for this PR.

Nits

query_executor.rs:2359 doc comment for dedup_safe_unique_key_filter is very good — the "commutes with last-row dedup" framing is exactly the right invariant to call out.
in_memory_chunk_filter_groups walks index_snapshots twice (once for chunk_counts, once for emit). Fine at current sizes; mentionable only if profiling ever points here.

LGTM modulo the doc tweak on item 1.

codecov · 2026-06-08T23:17:00Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 58.51%. Comparing base (d91df4b) to head (a12c836).

Additional details and impacted files

@@           Coverage Diff           @@
##           master   #11040   +/-   ##
=======================================
  Coverage   58.51%   58.51%           
=======================================
  Files         216      216           
  Lines       17265    17265           
  Branches     3524     3524           
=======================================
  Hits        10102    10102           
  Misses       6648     6648           
  Partials      515      515

Flag	Coverage Δ
cube-backend	`58.51% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

claude · 2026-06-09T06:47:24Z

Claude finished @waralexrom's task in 40s —— View job

Code review (re-review after sync)

Two new commits since the prior review:

392e472 — doc rewording on in_memory_chunk_filter_groups (addresses item ReferenceError: self is not defined - Issue with whatwg-fetch when using the core lib with Node.js #1 from my prior review — accurately describes "one entry per index snapshot that carries a pre-filter" rather than the misleading "grouped by predicate"). ✅
fb10cf6 — prefilter_chunks_shared_scan excluded from the migration suite. Reasonable: the test relies on a specific shared-snapshot planner shape, and migration replays don't need to re-validate the prefilter correctness path. No production code touched.

The substantive change (392e472's parent) is unchanged from my earlier review — still LGTM. Already approved by @paveltiunov.

Items #2–#5 from my prior review remain non-blocking polish, not regressions:

phys.as_ref().unwrap() after the build block in query_executor.rs:2433 — pure style.
New-router → old-worker compat depends on the transport's tolerance for unknown serde fields; worth a sanity check if mixed-version rollouts are in scope.
Selectivity heuristic for filter_record_batch — future work, only matters if low-selectivity predicates ever land here.
No assertion that the prefilter actually fires (skip path is covered; active path is not directly observable from the test result since the subprocess re-applies). An EXPLAIN ANALYZE DETAILED assertion on chunks.prefilter (now that feat(cubestore): EXPLAIN ANALYZE DETAILED per-query trace #11016 landed) would close that gap in a follow-up.

Trim in-memory chunks by the dedup-safe unique-key pushable predicate on the worker before they are serialized and shipped over IPC to the select subprocess. The predicate is computed once at planning time (from the same filters that gate partition pruning), carried in PlanningMeta, and applied to the loaded batches; the subprocess still re-applies it, so this only reduces IPC payload and is never relied on for correctness. A chunk referenced by more than one scan (self-join/self-union of one index) is left untrimmed: the subprocess shares batches by chunk id and a scan with a different or no predicate would otherwise lose rows. The pre-filter is best-effort - any failure is logged and skipped rather than failing the query.

…suite Migration tests run each registered case against a pre-baked data dir from the fixtures tarball; a brand-new test has none. Also clarify the in_memory_chunk_filter_groups doc (one entry per snapshot, not merged by predicate).

claude · 2026-06-09T09:17:36Z

Claude finished @waralexrom's task in 33s —— View job

Code review (sync re-check)

Diffed origin/master...HEAD, listed review threads, inspected new commits.

No new substantive changes since the prior re-review:

a5b9014 — feature commit (unchanged from the LGTM'd version).
a12c836 — excludes prefilter_chunks_shared_scan from the migration suite. Test-only; reasonable given the test depends on a specific shared-snapshot planner shape that migration replays don't need to re-validate.

Already approved by @paveltiunov; no new threads from human reviewers to consider. Prior non-blocking observations (items #2–#5 from my earlier review — phys.as_ref().unwrap() style, new-router→old-worker serde compat sanity check, low-selectivity heuristic for filter_record_batch, and an assertion that the active prefilter path fires) still stand as optional follow-ups and don't gate this PR.

LGTM.
· branch: cubestore-mem-chunks-filter

waralexrom requested a review from a team as a code owner June 8, 2026 22:53

github-actions Bot added cube store Issues relating to Cube Store rust Pull requests that update Rust code labels Jun 8, 2026

vercel Bot deployed to Preview June 8, 2026 22:55 View deployment

paveltiunov approved these changes Jun 8, 2026

View reviewed changes

waralexrom added 2 commits June 9, 2026 11:14

waralexrom force-pushed the cubestore-mem-chunks-filter branch from fb10cf6 to a12c836 Compare June 9, 2026 09:17

vercel Bot deployed to Preview June 9, 2026 09:19 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cubestore): pre-filter in-memory chunks on worker before IPC#11040

feat(cubestore): pre-filter in-memory chunks on worker before IPC#11040
waralexrom wants to merge 2 commits into
masterfrom
cubestore-mem-chunks-filter

waralexrom commented Jun 8, 2026 •

edited

Loading

Uh oh!

claude Bot commented Jun 8, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 8, 2026 •

edited

Loading

Uh oh!

claude Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

claude Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

waralexrom commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing

Uh oh!

claude Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code review

Uh oh!

codecov Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

claude Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code review (re-review after sync)

Uh oh!

claude Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code review (sync re-check)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

waralexrom commented Jun 8, 2026 •

edited

Loading

claude Bot commented Jun 8, 2026 •

edited

Loading

codecov Bot commented Jun 8, 2026 •

edited

Loading

claude Bot commented Jun 9, 2026 •

edited

Loading

claude Bot commented Jun 9, 2026 •

edited

Loading