Skip to content

feat: clustered segments + MSQ compaction#19597

Open
clintropolis wants to merge 2 commits into
apache:masterfrom
clintropolis:clustered-segment-compaction
Open

feat: clustered segments + MSQ compaction#19597
clintropolis wants to merge 2 commits into
apache:masterfrom
clintropolis:clustered-segment-compaction

Conversation

@clintropolis

Copy link
Copy Markdown
Member

Description

Follow-up to #19579, this PR adds MSQ compaction support for clustered segments when using 'inline' or reindexing template based compaction.

changes:

  • CompactionTask now can specify baseTable spec to create clustered segments
  • DataSourceMSQDestination can now specify a baseTable so MSQ can generate clustered segments (or any other future baseTable spec)
  • adds baseTable to 'inline' and reindexing template compaction configs to feed to compaction task for auto-compaction
  • adds baseTable, segmentGranularitySpec to CompactionState, CompactionStatus is baseTable aware for checks
  • guards to prevent baseTable from working with 'native' compaction and direct towards MSQ compaction

changes:
* `CompactionTask` now can specify `baseTable` spec to create clustered segments
* `DataSourceMSQDestination` can now specify a `baseTable` so MSQ can generate clustered segments (or any other future baseTable spec)
* adds `baseTable` to 'inline' and reindexing template compaction configs to feed to compaction task for auto-compaction
* adds `baseTable`, `segmentGranularitySpec` to `CompactionState`, `CompactionStatus` is `baseTable` aware for checks
* guards to prevent `baseTable` from working with 'native' compaction and direct towards MSQ compaction
@github-actions github-actions Bot added Area - Batch Ingestion Area - Querying Area - Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Jun 17, 2026
capistrant
capistrant previously approved these changes Jun 17, 2026

@capistrant capistrant left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. I re-read some of the paths before hitting submit here and I think it is a silent ignore of the other top level configs that base table now owns. which seems kind of sad but not destructive at least? I think it is worth forcing an operator to acknowledge what they are doing by cleaning up their config (or rules)

}
return CompactionConfigValidationResult.success();
} else {
return compactionConfigSupportedByMSQEngine(newConfig);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about adding a baseTable != null block to this MSQ validation method that blows up if the config is trying to set the top level data schema configs that are now handled by the base table. fuzzy to me on if that situation would be handled okay (base table wins or blow up later on), but blowing up eagerly feels safer and will prompt operator to consciously modify the config (or rules?) to make it play nice

@FrankChen021 FrankChen021 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity Findings
P0 0
P1 2
P2 1
P3 0
Total 3

Reviewed 55 of 55 changed files.

Findings that could not be attached inline:

  • server/src/main/java/org/apache/druid/client/indexing/ClientCompactionRunnerInfo.java:127 - [P2] Validate range partitions against baseTable columns. The affected call is existing code made incomplete by the new baseTable config field. compactionConfigSupportedByMSQEngine still passes dimensionSchemas from newConfig.getDimensionsSpec(), so baseTable configs with no legacy dimensionsSpec skip range-partition column validation at config time. The same config is later validated against dataSchema.getDimensionsSpec() inside MSQCompactionRunner and can fail only after the coordinator submits tasks. Use newConfig.getBaseTable().getDimensionsSpec()/virtual columns when baseTable is present.

This is an automated review by Codex GPT-5.5

.withDataSource(destination.getDataSource())
.withTimestamp(new TimestampSpec(ColumnHolder.TIME_COLUMN_NAME, "millis", null))
.withTransform(transformSpec)
.withBaseTable(baseTable)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] Carry projections into base-table segment generation

The baseTable branch builds the worker DataSchema without withProjections(destination.getProjections()), unlike the legacy branch. MSQCompactionRunner passes projections through DataSourceMSQDestination, but they are dropped before segment generation, so baseTable compaction configured with projections writes segments without them and stores lastCompactionState.projections as null; CompactionStatus then keeps scheduling the same compaction. Add the projections to this builder path and cover baseTable+projections.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh this is a good catch.

{
if (queryGranularity == null
|| Granularities.NONE.equals(queryGranularity)
|| Granularities.ALL.equals(queryGranularity)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] Do not erase ALL query granularity

Treating Granularities.ALL as a no-op makes baseTable compaction report an effective query granularity of NONE. That bypasses MSQCompactionRunner's existing ALL handling, which assigns rows to the interval start timestamp, so queryGranularity=ALL produces incorrectly timestamped clustered segments; the later status check also compares configured ALL to stored NONE and can recompact forever. Preserve or reject ALL explicitly instead of silently normalizing it away.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think rejecting ALL with demand to use some other granularity makes more sense than going through effort to try and support it in granularity virtual column (if that is even possible).

@capistrant capistrant self-requested a review June 18, 2026 14:04
@capistrant capistrant dismissed their stale review June 18, 2026 14:05

Frank's comments seem legit. pulling approval until further discussion to avoid early merge by someone who comes along and sees approved green pr and merges for some reason

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area - Batch Ingestion Area - Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 Area - Querying

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants