Skip to content

Antalya 26.3 Hybrid: added support for segment pruning#1788

Open
mkmkme wants to merge 3 commits into
antalya-26.3from
mkmkme/antalya-26.3/hybrid-segment-pruning
Open

Antalya 26.3 Hybrid: added support for segment pruning#1788
mkmkme wants to merge 3 commits into
antalya-26.3from
mkmkme/antalya-26.3/hybrid-segment-pruning

Conversation

@mkmkme
Copy link
Copy Markdown
Collaborator

@mkmkme mkmkme commented May 14, 2026

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Hybrid tables now can ignore the segments that can't be reached by the combination of WHERE and segment predicate

Documentation entry for user-facing changes

...

CI/CD Options

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • All Regression
  • Disable CI Cache

Regression jobs to run:

  • Fast suites (mostly <1h)
  • Aggregate Functions (2h)
  • Alter (1.5h)
  • Benchmark (30m)
  • ClickHouse Keeper (1h)
  • Iceberg (2h)
  • LDAP (1h)
  • Parquet (1.5h)
  • RBAC (1.5h)
  • SSL Server (1h)
  • S3 (2h)
  • S3 Export (2h)
  • Swarms (30m)
  • Tiered Storage (2h)

@mkmkme mkmkme added antalya port-antalya PRs to be ported to all new Antalya releases hybrid antalya-26.3 labels May 14, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 14, 2026

Workflow [PR], commit [00ca0b4]

@svb-alt svb-alt requested a review from arthurpassos May 14, 2026 12:13
arthurpassos
arthurpassos previously approved these changes May 14, 2026
Copy link
Copy Markdown
Collaborator

@arthurpassos arthurpassos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't help much here tbh

Comment on lines +101 to +102

if (col_ranges.empty())
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it that when there are no ranges you can actually prune the segment?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

col_ranges being empty at this stage (after we proved that extractPlainRangesForColumn means that the segment's own predicate projected on the column i has no values.

@mkmkme
Copy link
Copy Markdown
Collaborator Author

mkmkme commented May 15, 2026

the tests failing seems unrelated to my changes.

@alsugiliazova
Copy link
Copy Markdown
Member

Verification: PR #1788

PR-added tests — all GREEN

03646_hybrid_segment_pruning6/6 OK across every stateless job that picked it up:

Job Status
Stateless tests (amd_asan, distributed plan, parallel, 4/4) OK
Stateless tests (amd_debug, distributed plan, s3 storage, parallel) OK
Stateless tests (amd_debug, parallel) OK
Stateless tests (arm_asan, azure, parallel, 4/4) OK
Stateless tests (arm_asan, targeted) OK
Stateless tests (arm_binary, parallel) OK
Fast test SKIPPED (expected, no-fasttest)

The new gtest gtest_hybrid_segment_pruner is built into the unit-tests binary; the unit-tests job is green.

The Hybrid segment-pruning path has clean positive coverage at both unit and stateless levels.

CI overview (head commit)

  • PR test workflow: 46 success / 48 skipped / 3 failure
  • Regression workflow: 28 success / 69 skipped / 3 failure (chronic baseline; actually one fewer suite than typical)

PR test-workflow failures

Check Verdict
Integration tests (amd_asan, db disk, old analyzer, 1/6) — 9 × test_storage_kafka/test_batch_fast.py::* + Timeout Job-level timeout cascade. The Kafka batch-fast subset failed together and the job reported Timeout. The shared Timeout symptom has 51 fails across 9 PRs in last 30d — recurring integration-job timeout, not PR-caused. PR diff is in Hybrid/Distributed, not Kafka.
Stateless tests (arm_asan, azure, sequential, 1/2)03760_backup_tar_archive Pre-existing flake (18 fails / 9 PRs in 30d). Unrelated to segment pruning.
Stress test (arm_asan, s3)Cannot start clickhouse-server Stress infra flake (39 fails / 15 PRs in 30d). Server fails to come up in stress harness; not a product-test FAIL.

Regression-workflow failures (chronic baseline on antalya-26.3)

Suite Fails
Swarms (Aarch64 + Release) 230
Parquet (Aarch64 + Release) 18
S3Export part (Aarch64 + Release) 14

Same fingerprint as sibling antalya-26.3 PRs (1776, 1782, 1783, 1775, 1773, …). On this run S3Export partition was actually green; no new failure modes introduced.

Caveat — partial frontport

PR lands on antalya-26.3 while companion features from antalya-26.1 are still being frontported in parallel. Final re-verify recommended once the rest of the bundle lands.

Verdict

Safe to merge.

@alsugiliazova
Copy link
Copy Markdown
Member

Audit: PR #1788 — Antalya 26.3 Hybrid: added support for segment pruning

AI audit note: This review comment was generated by AI (Cursor agent, audit-review skill).

Scope reviewed

Head acccd3f9 against base antalya-26.3. Feature scope: new HybridSegmentPruner (range-based unsat detection over the Hybrid identity key), extension of KeyCondition::extractPlainRangesForColumn for multi-column projection, integration into StorageDistributed::{getStorageSnapshot, getQueryProcessingStage, computeHybridPruningVerdict, substituteHybridWatermarks, getHybridWatermarkSnapshot, read}, new local-plan code path that bypasses ClusterProxy::executeQuery when the base segment is pruned but additional segments survive, and the new gtest plus the SQL test 03646_hybrid_segment_pruning. Method: static analysis (call graph + transition matrix + logical fault injection). No local build or test execution.

Confirmed defects

Low: Misleading parameter comment in createLocalPlan call

  • Impact: Documentation drift; future maintainers will be misled about which createLocalPlan parameter they are setting. Runtime behavior is correct because false happens to be the intended value of the actual parameter.
  • Anchor: src/Storages/StorageDistributed.cppStorageDistributed::read, the new cluster.empty() && !additional_query_infos.empty() block (around lines 1556-1558).
  • Trigger: Code review of the new local-plan code path.
  • Why defect: The 7th positional argument of createLocalPlan is bool build_logical_plan (see src/Processors/QueryPlan/DistributedCreateLocalPlan.h), but the call labels it /*has_missing_objects=*/false. There is no has_missing_objects parameter on this overload — that name belongs to createLocalPlanForParallelReplicas, not createLocalPlan. The mirroring block in ClusterProxy::executeQuery passes a plain false, consistent with the actual signature.
  • Fix direction: Replace the inline comment with /*build_logical_plan=*/false, or drop the named comment entirely to match the existing call in ClusterProxy::executeQuery.
  • Regression test direction: No runtime test needed; a single-line edit verified by build.

Low: HybridSegmentPruner builds the identity key over getColumns().getAll(), which includes ALIAS and EPHEMERAL columns

  • Impact: When the Hybrid table has alias / ephemeral columns, those columns are added to the identity key, but the user-side filter_actions_dag references the underlying expression (the Planner has resolved the alias). The user KeyCondition then cannot match those alias entries in the DAG and they sit unused; only the analysis cost is paid. Worst realistic outcome: an exception in HybridSegmentPruner's constructor (e.g. an ALIAS expression failing in KeyDescription::getKeyFromAST) — which getQueryProcessingStage does not catch — would convert a previously-working SELECT from a Hybrid table with such an alias into a hard error.
  • Anchor: src/Storages/StorageDistributed.cppStorageDistributed::computeHybridPruningVerdict: NamesAndTypesList hybrid_columns = storage_snapshot->metadata->getColumns().getAll();
  • Trigger: Hybrid table whose CREATE includes an ALIAS column (<name> <type> ALIAS <expr>) or an EPHEMERAL column.
  • Why defect: getAll() returns ordinary + materialized + alias + ephemeral. Only getAllPhysical() (or a similarly filtered list) is appropriate for an "identity key over comparable physical columns".
  • Fix direction: Replace getAll() with getAllPhysical() (or equivalent) before passing to filterComparable. Optionally also drop MATERIALIZED columns whose expression is non-trivial.
  • Regression test direction: Add a query-level test creating a Hybrid table that has an ALIAS column on top of a watermarked column; run a SELECT with a WHERE that should prune; assert the plan does not throw and pruning still applies on the watermark column.

Coverage summary

Item Detail
Scope reviewed New HybridSegmentPruner class (src/Storages/HybridSegmentPruner.{h,cpp}); KeyCondition::extractPlainRangesForColumn (multi-column range extraction); StorageDistributed::{getStorageSnapshot, getQueryProcessingStage, computeHybridPruningVerdict, substituteHybridWatermarks, getHybridWatermarkSnapshot, read} changes; the new local-plan code path for empty cluster + surviving additional segments; the gtest (gtest_hybrid_segment_pruner.cpp); and the SQL test (03646_hybrid_segment_pruning).
Categories failed Misleading parameter comment in createLocalPlan; identity key built from getAll() rather than physical columns.
Categories passed Lifetime of KeyCondition after the source DAG goes out of scope (RPN is self-contained, no dangling pointers); thread-safety of MultiVersion<WatermarkParams> snapshot freezing via HybridSnapshotData; consistency of pruning verdict between getQueryProcessingStage and read (same query_info.filter_actions_dag and same storage_snapshot); soundness of per-column projection in extractPlainRangesForColumn for atoms / AND / OR (existential projection over-approximation); reachability of the FUNCTION_NOT branch (only top-level not(...) wrappers around opaque sub-expressions appear, which surface as FUNCTION_UNKNOWN and short-circuit return false before NOT is processed, so the multi-column NOT-after-projection unsoundness is not exploitable in practice); col_ranges.empty() -> prune is sound (per-column emptiness implies whole-segment emptiness); rect bounds-check in checkInHyperrectangle (rect length equals identity_key.column_names.size() which equals user_condition.num_key_columns); empty-cluster early-returns (analyzer returns empty plan, legacy adds NullSource, "case 2" handles surviving additional segments without dereferencing cluster.getShardsAddresses().front() from updateSettingsAndClientInfoForCluster); processing-stage math after base pruning (nodes recomputed from empty cluster + surviving segment count, Complete stage chosen iff one survivor); exception safety on missing watermark (substitution throws BAD_ARGUMENTS consistently in both phases); double pruning verdict computation matches because filter DAG, storage snapshot, segments, and watermark snapshot are all stable across the two calls; gtest covers range contradiction, overlap, OR-with-non-mandatory-conjunct, bounded DNF with constant folding, and unsupported-atom-in-OR (conservative keep); SQL test covers baseline, cold/hot pruned individually, all-pruned, three-segment table, OR with non-mandatory conjunct, JOIN (conservative skip), PREWHERE participation, and alias-shadows-Hybrid-column.
Assumptions / limits Static reasoning only; the gtest and SQL test were not executed in this audit. The "FUNCTION_NOT unreachable" argument relies on the current cloneDAGWithInversionPushDown behavior (not(...) wrapper added only around non-relational, non-and/or/cast/materialize/indexHint subtrees, which surface as FUNCTION_UNKNOWN); a future change that emits FUNCTION_NOT around a known atom would expose the multi-column NOT-after-projection bug and should be revisited.

mkmkme added 2 commits May 15, 2026 15:39
This was suggested during the verification stage. Replacing getAll()
with getAllPhysical() avoids potentially wasted work.
@mkmkme
Copy link
Copy Markdown
Collaborator Author

mkmkme commented May 15, 2026

@alsugiliazova, I've pushed two commits resolving both issues.

Low: Misleading parameter comment in createLocalPlan call

This was an unrelated to this PR issue, but I fixed it nonetheless to avoid any confusions in the future.

Low: HybridSegmentPruner builds the identity key over getColumns().getAll(), which includes ALIAS and EPHEMERAL columns

Replaced getAll() with getAllPhysical(). I must say, though, that audit-report slightly overstated the worse-case scenario here. An exception cannot happen in pruner's constructor, because the alias's expression never reaches KeyDescription, only the column's name and type do. The worse case scenario would be just wasted work, there wouldn't be any correction issue.
The optional suggestion to drop MATERIALIZED columns whose expression is non-trivial is not needed. MATERIALIZED columns are physically stored. The expression only runs at INSERT to populate the stored value. From the read path's perspective they're the same as ordinary columns, and the user's filter can reference them directly. Excluding them would be wrong.

@alsugiliazova
Copy link
Copy Markdown
Member

Resolved (verification update on 00ca0b48)

Both Low-severity findings from the previous round have been addressed.

Resolved: Misleading parameter comment in createLocalPlan call → fixed in 04758b6a (/*has_missing_objects=*/false/*build_logical_plan=*/false). The new comment matches the actual 7th positional argument (bool build_logical_plan) of createLocalPlan in src/Processors/QueryPlan/DistributedCreateLocalPlan.h. Verified by re-reading StorageDistributed::read, the cluster.empty() && !additional_query_infos.empty() block.

Resolved: Identity key built from getAll() instead of physical columns → fixed in 00ca0b48 (getAll()getAllPhysical() in StorageDistributed::computeHybridPruningVerdict). getAllPhysical() correctly keeps Ordinary and Materialized columns and drops Alias/Ephemeral, so the identity key now reflects only what the user filter can actually reference at read time.

Audit corrections accepted

  • Worst-case impact was overstated: my earlier write-up suggested an exception could escape HybridSegmentPruner's constructor when an alias expression failed analysis. That is not reachable: buildIdentityKey calls KeyDescription::getKeyFromAST(tuple(...), ColumnsDescription{NamesAndTypesList}). Constructing ColumnsDescription from a NamesAndTypesList materializes plain columns (ColumnsDescription::add(ColumnDescription(name, type))), so the alias's source expression is dropped at the getAll() boundary. The actual worst case is exactly what the commit message says: wasted analysis work on alias entries that the user filter cannot bind to. No correctness or exception path.
  • MATERIALIZED exclusion suggestion was wrong: my earlier note ("optionally drop MATERIALIZED columns whose expression is non-trivial") was incorrect. MATERIALIZED columns are physically stored — the expression only runs at INSERT time to populate the value — so user filters can reference them directly and pruning on them is valid. getAllPhysical() keeps them, and that is the right behavior; excluding them would have been a missed-pruning regression.

Confirmed defects

No confirmed defects in reviewed scope on head 00ca0b48.

@alsugiliazova alsugiliazova added the verified Approved for release label May 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

antalya antalya-26.3 hybrid port-antalya PRs to be ported to all new Antalya releases verified Approved for release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants