feat: clustered segments + MSQ compaction#19597
Conversation
changes: * `CompactionTask` now can specify `baseTable` spec to create clustered segments * `DataSourceMSQDestination` can now specify a `baseTable` so MSQ can generate clustered segments (or any other future baseTable spec) * adds `baseTable` to 'inline' and reindexing template compaction configs to feed to compaction task for auto-compaction * adds `baseTable`, `segmentGranularitySpec` to `CompactionState`, `CompactionStatus` is `baseTable` aware for checks * guards to prevent `baseTable` from working with 'native' compaction and direct towards MSQ compaction
capistrant
left a comment
There was a problem hiding this comment.
looks good. I re-read some of the paths before hitting submit here and I think it is a silent ignore of the other top level configs that base table now owns. which seems kind of sad but not destructive at least? I think it is worth forcing an operator to acknowledge what they are doing by cleaning up their config (or rules)
| } | ||
| return CompactionConfigValidationResult.success(); | ||
| } else { | ||
| return compactionConfigSupportedByMSQEngine(newConfig); |
There was a problem hiding this comment.
what about adding a baseTable != null block to this MSQ validation method that blows up if the config is trying to set the top level data schema configs that are now handled by the base table. fuzzy to me on if that situation would be handled okay (base table wins or blow up later on), but blowing up eagerly feels safer and will prompt operator to consciously modify the config (or rules?) to make it play nice
FrankChen021
left a comment
There was a problem hiding this comment.
| Severity | Findings |
|---|---|
| P0 | 0 |
| P1 | 2 |
| P2 | 1 |
| P3 | 0 |
| Total | 3 |
Reviewed 55 of 55 changed files.
Findings that could not be attached inline:
- server/src/main/java/org/apache/druid/client/indexing/ClientCompactionRunnerInfo.java:127 - [P2] Validate range partitions against baseTable columns. The affected call is existing code made incomplete by the new baseTable config field. compactionConfigSupportedByMSQEngine still passes dimensionSchemas from newConfig.getDimensionsSpec(), so baseTable configs with no legacy dimensionsSpec skip range-partition column validation at config time. The same config is later validated against dataSchema.getDimensionsSpec() inside MSQCompactionRunner and can fail only after the coordinator submits tasks. Use newConfig.getBaseTable().getDimensionsSpec()/virtual columns when baseTable is present.
This is an automated review by Codex GPT-5.5
| .withDataSource(destination.getDataSource()) | ||
| .withTimestamp(new TimestampSpec(ColumnHolder.TIME_COLUMN_NAME, "millis", null)) | ||
| .withTransform(transformSpec) | ||
| .withBaseTable(baseTable) |
There was a problem hiding this comment.
[P1] Carry projections into base-table segment generation
The baseTable branch builds the worker DataSchema without withProjections(destination.getProjections()), unlike the legacy branch. MSQCompactionRunner passes projections through DataSourceMSQDestination, but they are dropped before segment generation, so baseTable compaction configured with projections writes segments without them and stores lastCompactionState.projections as null; CompactionStatus then keeps scheduling the same compaction. Add the projections to this builder path and cover baseTable+projections.
There was a problem hiding this comment.
oh this is a good catch.
| { | ||
| if (queryGranularity == null | ||
| || Granularities.NONE.equals(queryGranularity) | ||
| || Granularities.ALL.equals(queryGranularity) |
There was a problem hiding this comment.
[P1] Do not erase ALL query granularity
Treating Granularities.ALL as a no-op makes baseTable compaction report an effective query granularity of NONE. That bypasses MSQCompactionRunner's existing ALL handling, which assigns rows to the interval start timestamp, so queryGranularity=ALL produces incorrectly timestamped clustered segments; the later status check also compares configured ALL to stored NONE and can recompact forever. Preserve or reject ALL explicitly instead of silently normalizing it away.
There was a problem hiding this comment.
I think rejecting ALL with demand to use some other granularity makes more sense than going through effort to try and support it in granularity virtual column (if that is even possible).
Frank's comments seem legit. pulling approval until further discussion to avoid early merge by someone who comes along and sees approved green pr and merges for some reason
Description
Follow-up to #19579, this PR adds MSQ compaction support for clustered segments when using 'inline' or reindexing template based compaction.
changes:
CompactionTasknow can specifybaseTablespec to create clustered segmentsDataSourceMSQDestinationcan now specify abaseTableso MSQ can generate clustered segments (or any other future baseTable spec)baseTableto 'inline' and reindexing template compaction configs to feed to compaction task for auto-compactionbaseTable,segmentGranularitySpectoCompactionState,CompactionStatusisbaseTableaware for checksbaseTablefrom working with 'native' compaction and direct towards MSQ compaction