feat(cubestore): pre-filter in-memory chunks on worker before IPC#11040
feat(cubestore): pre-filter in-memory chunks on worker before IPC#11040waralexrom wants to merge 2 commits into
Conversation
|
Claude finished @waralexrom's task in 1m 37s —— View job Code reviewSolid, well-scoped change. The shared-chunk reference count across all index snapshots (not just predicate-bearing ones) is the key correctness insight, and the new A few minor observations — nothing blocking: 1. The doc says the result is "grouped by the proto-encoded pushable predicate", but the implementation produces one entry per index snapshot — two snapshots with identical predicate bytes yield two entries, not one merged group. Not a correctness issue (downstream just iterates), but the comment overstates the grouping. Consider either deduping by 2. After the explicit let p = match phys.as_ref() {
Some(p) => p.clone(),
None => { /* build, set phys, return p */ }
};Pure style. 3. Backward-compat direction is one-way
4. Filter cost for low-selectivity predicates
5. Test gap: pre-filter actually fires when it should
Nits
LGTM modulo the doc tweak on item 1. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #11040 +/- ##
=======================================
Coverage 58.51% 58.51%
=======================================
Files 216 216
Lines 17265 17265
Branches 3524 3524
=======================================
Hits 10102 10102
Misses 6648 6648
Partials 515 515
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
Claude finished @waralexrom's task in 40s —— View job Code review (re-review after sync)Two new commits since the prior review:
The substantive change ( Items #2–#5 from my prior review remain non-blocking polish, not regressions:
|
Trim in-memory chunks by the dedup-safe unique-key pushable predicate on the worker before they are serialized and shipped over IPC to the select subprocess. The predicate is computed once at planning time (from the same filters that gate partition pruning), carried in PlanningMeta, and applied to the loaded batches; the subprocess still re-applies it, so this only reduces IPC payload and is never relied on for correctness. A chunk referenced by more than one scan (self-join/self-union of one index) is left untrimmed: the subprocess shares batches by chunk id and a scan with a different or no predicate would otherwise lose rows. The pre-filter is best-effort - any failure is logged and skipped rather than failing the query.
…suite Migration tests run each registered case against a pre-baked data dir from the fixtures tarball; a brand-new test has none. Also clarify the in_memory_chunk_filter_groups doc (one entry per snapshot, not merged by predicate).
fb10cf6 to
a12c836
Compare
|
Claude finished @waralexrom's task in 33s —— View job Code review (sync re-check)
No new substantive changes since the prior re-review:
Already approved by @paveltiunov; no new threads from human reviewers to consider. Prior non-blocking observations (items #2–#5 from my earlier review — LGTM. |
Summary
Trims in-memory chunks on the worker by the dedup-safe unique-key pushable predicate before they are serialized and shipped over IPC to the select subprocess. Parquet data is already pruned by the predicate; in-memory chunks previously crossed IPC whole, so a partition that survives range-pruning still shipped rows of many keys that the query immediately discards. This closes that gap.
Changes
dedup_safe_unique_key_filterreused by the scan-timeFilterExecand the new worker pre-filter, so the two cannot diverge.choose_index_ext, from the same filters that gate partition pruning), strip column qualifiers, and carry it proto-encoded inPlanningMeta.pushable_chunk_filters(1:1 withindices,#[serde(default)]for back-compat).load_in_memory_chunksand serialization (trace opchunks.prefilter). The subprocess still re-applies the predicate, so this only reduces IPC payload and is never relied on for correctness.Testing
prefilter_chunks_shared_scan(self-union of one index, one branch key-filtered, the other not) — reproduced the shared-chunk bug (60 vs expected 130) before the cross-snapshot reference-count fix, green after.filter_pushdown_unique_key,unique_key_and_multi_*,*_stream_table,limit_pushdown_unique_key— pass in-process and multi-process (real select-subprocess IPC), no regressions.cargo build -p cubestoreclean;cargo fmt --checkpasses (pre-commit hook).