[SPARK-55951][SQL] Add ChangelogTable schema validation and INVALID_CHANGELOG_SCHEMA error class by SanJSp · Pull Request #55507 · apache/spark

SanJSp · 2026-04-23T11:17:13Z

What changes were proposed in this pull request?

This is PR 1 of a split of #55426 (see the split suggestion for the full plan). Can merge in any order, but 1 (#55507) < 2 (#55508) would be preferable. For more context, see discussion posted to dev@spark.apache.org and linked SPIP.

Validates the CDC metadata columns and row-identity presence returned by a Changelog connector at relation construction time, and introduces a dedicated error class to report the failure at analysis time rather than later at execution time with a less helpful error.

ChangelogTable.validateSchema: fail-fast checks that the connector schema contains the required metadata columns (_change_type as StringType, _commit_version of connector-defined type, _commit_timestamp as TimestampType), and that rowId() returns a non-empty array when a capability requires row identity. rowVersion() is invoked when a capability requires it and surfaces the default UnsupportedOperationException directly if the connector has not overridden it. References can be top-level or nested (e.g. Delta's _metadata.row_commit_version). Invoked from the ChangelogTable constructor.
New error class INVALID_CHANGELOG_SCHEMA with sub-classes MISSING_COLUMN, INVALID_COLUMN_TYPE, MISSING_ROW_ID.
Matching QueryCompilationErrors helpers for each sub-class.
rowVersion nullability is enforced at runtime in the carry-over filter in [SPARK-55952][SPARK-55953][SQL] Add ResolveChangelogTable analyzer rule for batch CDC post-processing #55508 via count(rowVersion) = 2 (see the #55426 NULL-safety thread for rationale). rowId nullability is not enforced. It is covered by the Changelog.rowId() Javadoc contract.

Why are the changes needed?

Gives connector implementors a clear analysis-time error message for misshapen CDC schemas instead of an opaque execution-time failure. Background on the original PR and its discussion thread.

Does this PR introduce any user-facing change?

Yes, for connector implementors. A connector that returns an invalid changelog schema (missing or wrong-typed metadata column, or advertising a capability requiring row identity without declaring rowId()) now fails at analysis time with INVALID_CHANGELOG_SCHEMA.*. A connector that advertises a capability requiring rowId() or rowVersion() without implementing the method surfaces the default UnsupportedOperationException at analysis time.

How was this patch tested?

Added schema-validation cases to ChangelogResolutionSuite covering:

Missing metadata column: _change_type, _commit_version, _commit_timestamp.
Wrong data type: _change_type non-String, _commit_timestamp non-Timestamp.
Connector-defined _commit_version type accepted (Integer, Long, String).
Valid schema with data columns passes.
Nested rowId and rowVersion references (Delta-style _metadata.row_id / _metadata.row_commit_version) pass.
MISSING_ROW_ID triggered by representsUpdateAsDeleteAndInsert = true.
MISSING_ROW_ID triggered by containsIntermediateChanges = true.
Default UnsupportedOperationException on rowId() surfaces when a capability requires it.
Default UnsupportedOperationException on rowVersion() surfaces when a capability requires it.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.7

…HANGELOG_SCHEMA error class Validate the CDC metadata columns, row identity and row versioning returned by a `Changelog` connector at relation construction time, and introduce a dedicated error class to report the failure at analysis time rather than later at execution time with a less helpful error. - `ChangelogTable.validateSchema`: fail-fast checks that the connector schema contains the required metadata columns (`_change_type`, `_commit_version`, `_commit_timestamp`), and that, when the connector advertises a capability requiring it, `rowId()` and `rowVersion()` are declared and the row version column is a non-nullable top-level column. Invoked from the `ChangelogTable` constructor. - New error class `INVALID_CHANGELOG_SCHEMA` with sub-classes `MISSING_COLUMN`, `INVALID_COLUMN_TYPE`, `MISSING_ROW_ID`, `MISSING_ROW_VERSION`, `NESTED_ROW_VERSION`, `NULLABLE_ROW_VERSION`. - `QueryCompilationErrors` helpers for each sub-class. - Tests: `ChangelogResolutionSuite` schema-validation cases using a `TestChangelog` fixture that returns hand-crafted schemas.

johanl-db

i don't have concerns about this change, some minor improvements suggested

gengliangwang

Summary

Clean, contained change: a connector-side Changelog that returns a misshapen CDC schema now gets a sharp INVALID_CHANGELOG_SCHEMA.* at analysis time, replacing the earlier opaque execution-time failure. The validator runs eagerly in ChangelogTable's constructor, which is the right boundary — everything downstream (resolution, planning, scans) then sees a schema it can trust.

A handful of things worth addressing before merging, in priority order:

rowId non-nullability is not validated, even though the Changelog.rowId() Javadoc says "Each referenced column must be non-nullable" and the existing peer SupportsDelta.rowId() path (resolveRowIdAttrs → NULLABLE_ROW_ID_ATTRIBUTES) has been doing this check for years. This PR is asymmetric: rowVersion gets nullability + top-level-ness, rowId gets presence only.
The new NESTED_ROW_VERSION constraint ("rowVersion must be a top-level column") is not documented on Changelog.rowVersion(). Right now a connector author can follow the Javadoc exactly ("non-nullable") and still trip this error. Either add the requirement to the Javadoc or drop the check.
PR description overstates test coverage. The "How was this patch tested?" section lists tests for "row-identity-required capabilities without rowId/rowVersion" and "nested rowVersion", but the suite only exercises metadata presence/types, nullable rowVersion, and valid schemas — there is no MISSING_ROW_ID, MISSING_ROW_VERSION, or NESTED_ROW_VERSION case, and capability triggers other than containsCarryoverRows=true are unexercised.

Remaining inline comments are smaller (scoping, error-message specificity, a comment typo).

For the error-text wording around MISSING_ROW_ID / MISSING_ROW_VERSION I'll defer to @johanl-db's existing comments rather than duplicate.

gengliangwang · 2026-04-23T17:44:08Z

+      cl.containsIntermediateChanges()
+    if (needsRowId && (rowIds == null || rowIds.isEmpty)) {
+      throw QueryCompilationErrors.changelogMissingRowIdError(cl.name)
+    }


rowId columns are not checked for non-nullability, even though (a) the Changelog.rowId() Javadoc requires "Each referenced column must be non-nullable", and (b) the peer row-level-operations path validates this via RewriteRowLevelCommand.resolveRowIdAttrs with NULLABLE_ROW_ID_ATTRIBUTES. Consider adding a parallel NULLABLE_ROW_ID sub-class (or at least stating explicitly that rowId column validation is deferred to a later PR). As written, rowVersion gets nullability + top-level-ness but rowId gets presence only.

Done, using your option 2 from the NULL-safety thread on #55426. Added count(rowVersion) to the carry-over Window as a third aggregate alongside min and max (no extra Window operator, no additional shuffle). The filter now requires _rv_cnt = 2 AND _min_rv = _max_rv. A NULL rowVersion on either side fails the count check and the pair falls through as raw delete+insert instead of being silently dropped. Nesting-agnostic. Implementation and regression test ("NULL rowVersion on one side is NOT silently dropped as carry-over") in #55508.

On the rowId asymmetry: rowId nullability is not schema-checked. An analogous silent-drop path exists (multiple NULL-rowId rows collapse into one Window partition via SQL NULL-group semantics), but the trigger surface is narrower than for rowVersion and a count()=2-style runtime guard does not port cleanly.

A top-level-only schema check would cover id but miss, for example, Delta's nested _metadata.row_id. This asymmetric coverage feels worse than no coverage at all.
We can

either do a full schema walk through metadata columns covering both top-level and nested (how deep do we go, I think all the way, right?),

or leave it unenforced and trust the Javadoc contract.

Currently, it's implemented for the latter. Open to the recursive column check but could need some input here. What do you think @gengliangwang?

For now I'd defer to a later PR. Is that fine?

gengliangwang · 2026-04-23T17:44:08Z

+    // delete+insert pair would be misclassified as a real update).
+    rowVersionRef.foreach { ref =>
+      val fieldNames = ref.fieldNames()
+      if (fieldNames.length != 1) {


The top-level requirement is new — the Changelog.rowVersion() Javadoc only says "non-nullable". A connector that reads the contract and returns a nested NamedReference will fail with NESTED_ROW_VERSION but get no hint from the API docs. Please either (a) update the Changelog.rowVersion() Javadoc to state that the reference must be a top-level column of columns(), or (b) remove this check and accept nested references. Same applies (by extension) to rowId if top-level-ness is also intended there.

Done (option b). Removed the top-level-only restriction. NESTED_ROW_VERSION error class, helper, and subclass are gone.

There is no new commit since the review yesterday.

SanJSp changed the title ~~[SPARK-55668][SQL] Add ChangelogTable schema validation and INVALID_CHANGELOG_SCHEMA error class~~ [SPARK-55951][SQL] Add ChangelogTable schema validation and INVALID_CHANGELOG_SCHEMA error class Apr 23, 2026

SanJSp force-pushed the SPARK-55668-PR1-changelog-schema-validation branch from 5f5279e to 753bea1 Compare April 23, 2026 12:05

johanl-db approved these changes Apr 23, 2026

View reviewed changes

Comment thread common/utils/src/main/resources/error/error-conditions.json Outdated

Comment thread common/utils/src/main/resources/error/error-conditions.json Outdated

Comment thread sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ChangelogTable.scala Outdated

gengliangwang reviewed Apr 23, 2026

View reviewed changes

SanJSp added 3 commits April 24, 2026 08:21

Re-enabled nested columns support

3a61e7d

PR feedback

cf69658

Allign PR feedback from other PR

909a4bf

SanJSp requested review from gengliangwang and johanl-db April 24, 2026 13:13

johanl-db approved these changes Apr 24, 2026

View reviewed changes

gengliangwang mentioned this pull request Apr 24, 2026

[SPARK-55952][SPARK-55953][SQL] Add ResolveChangelogTable analyzer rule for batch CDC post-processing #55508

Open

Added Changetype constants from apache#55508 PR feedback

cee3a81

SanJSp force-pushed the SPARK-55668-PR1-changelog-schema-validation branch from d737051 to cee3a81 Compare April 27, 2026 09:20

gengliangwang approved these changes Apr 27, 2026

View reviewed changes

gengliangwang mentioned this pull request Apr 27, 2026

[DO NOT MERGE][SPARK-55951][SQL] Add ChangelogTable schema validation and INVALID_CHANGELOG_SCHEMA error class #55567

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-55951][SQL] Add ChangelogTable schema validation and INVALID_CHANGELOG_SCHEMA error class#55507

[SPARK-55951][SQL] Add ChangelogTable schema validation and INVALID_CHANGELOG_SCHEMA error class#55507
SanJSp wants to merge 5 commits intoapache:masterfrom
SanJSp:SPARK-55668-PR1-changelog-schema-validation

SanJSp commented Apr 23, 2026 •

edited

Loading

Uh oh!

johanl-db left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gengliangwang left a comment

Uh oh!

Uh oh!

Uh oh!

gengliangwang Apr 23, 2026

Uh oh!

SanJSp Apr 24, 2026

Uh oh!

SanJSp Apr 24, 2026

Uh oh!

gengliangwang Apr 23, 2026

Uh oh!

SanJSp Apr 24, 2026

Uh oh!

gengliangwang Apr 24, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

SanJSp commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

johanl-db left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gengliangwang left a comment

Choose a reason for hiding this comment

Summary

Uh oh!

Uh oh!

Uh oh!

gengliangwang Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

SanJSp Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

SanJSp Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

gengliangwang Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

SanJSp Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

gengliangwang Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SanJSp commented Apr 23, 2026 •

edited

Loading