[DO NOT MERGE][SPARK-55951][SQL] Add ChangelogTable schema validation and INVALID_CHANGELOG_SCHEMA error class#55567
Open
gengliangwang wants to merge 6 commits intoapache:masterfrom
Open
[DO NOT MERGE][SPARK-55951][SQL] Add ChangelogTable schema validation and INVALID_CHANGELOG_SCHEMA error class#55567gengliangwang wants to merge 6 commits intoapache:masterfrom
gengliangwang wants to merge 6 commits intoapache:masterfrom
Conversation
…HANGELOG_SCHEMA error class Validate the CDC metadata columns, row identity and row versioning returned by a `Changelog` connector at relation construction time, and introduce a dedicated error class to report the failure at analysis time rather than later at execution time with a less helpful error. - `ChangelogTable.validateSchema`: fail-fast checks that the connector schema contains the required metadata columns (`_change_type`, `_commit_version`, `_commit_timestamp`), and that, when the connector advertises a capability requiring it, `rowId()` and `rowVersion()` are declared and the row version column is a non-nullable top-level column. Invoked from the `ChangelogTable` constructor. - New error class `INVALID_CHANGELOG_SCHEMA` with sub-classes `MISSING_COLUMN`, `INVALID_COLUMN_TYPE`, `MISSING_ROW_ID`, `MISSING_ROW_VERSION`, `NESTED_ROW_VERSION`, `NULLABLE_ROW_VERSION`. - `QueryCompilationErrors` helpers for each sub-class. - Tests: `ChangelogResolutionSuite` schema-validation cases using a `TestChangelog` fixture that returns hand-crafted schemas.
Member
Author
|
This PR is to help run CI for #55507 |
Contributor
|
I've resolved the merge conflicts on the main branch, feel free to test again 👍 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This is PR 1 of a split of #55426 (see the split suggestion for the full plan). Can merge in any order, but 1 (#55507) < 2 (#55508) would be preferable. For more context, see discussion posted to dev@spark.apache.org and linked SPIP.
Validates the CDC metadata columns and row-identity presence returned by a
Changelogconnector at relation construction time, and introduces a dedicated error class to report the failure at analysis time rather than later at execution time with a less helpful error.ChangelogTable.validateSchema: fail-fast checks that the connector schema contains the required metadata columns (_change_typeas StringType,_commit_versionof connector-defined type,_commit_timestampas TimestampType), and thatrowId()returns a non-empty array when a capability requires row identity.rowVersion()is invoked when a capability requires it and surfaces the defaultUnsupportedOperationExceptiondirectly if the connector has not overridden it. References can be top-level or nested (e.g. Delta's_metadata.row_commit_version). Invoked from theChangelogTableconstructor.INVALID_CHANGELOG_SCHEMAwith sub-classesMISSING_COLUMN,INVALID_COLUMN_TYPE,MISSING_ROW_ID.QueryCompilationErrorshelpers for each sub-class.count(rowVersion) = 2(see the #55426 NULL-safety thread for rationale). rowId nullability is not enforced. It is covered by theChangelog.rowId()Javadoc contract.Why are the changes needed?
Gives connector implementors a clear analysis-time error message for misshapen CDC schemas instead of an opaque execution-time failure. Background on the original PR and its discussion thread.
Does this PR introduce any user-facing change?
Yes, for connector implementors. A connector that returns an invalid changelog schema (missing or wrong-typed metadata column, or advertising a capability requiring row identity without declaring
rowId()) now fails at analysis time withINVALID_CHANGELOG_SCHEMA.*. A connector that advertises a capability requiringrowId()orrowVersion()without implementing the method surfaces the defaultUnsupportedOperationExceptionat analysis time.How was this patch tested?
Added schema-validation cases to
ChangelogResolutionSuitecovering:_change_type,_commit_version,_commit_timestamp._change_typenon-String,_commit_timestampnon-Timestamp._commit_versiontype accepted (Integer, Long, String)._metadata.row_id/_metadata.row_commit_version) pass.MISSING_ROW_IDtriggered byrepresentsUpdateAsDeleteAndInsert = true.MISSING_ROW_IDtriggered bycontainsIntermediateChanges = true.UnsupportedOperationExceptiononrowId()surfaces when a capability requires it.UnsupportedOperationExceptiononrowVersion()surfaces when a capability requires it.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Opus 4.7