Skip to content

Conversation

@hongyunyan
Copy link
Collaborator

@hongyunyan hongyunyan commented Dec 17, 2025

What problem does this PR solve?

Issue Number: close #3663

What is changed and how it works?

This pull request addresses a critical data consistency concern in TiCDC related to how table dispatchers are initialized when they are moved or recreated, particularly during in-flight DDL and syncpoint barrier operations. The changes ensure that dispatchers correctly determine their starting timestamp and whether to skip DML events, preventing data inconsistencies such as duplicate writes or missed DDLs. This is achieved by introducing a skipDMLAsStartTs flag, updating the dispatcher creation logic, and adding robust unit and integration tests to validate the behavior under complex distributed system conditions.

Highlights

  • Corrected Dispatcher Start Behavior during DDL Barriers: When a dispatcher is moved or recreated during an in-flight DDL barrier, it will now correctly start from blockTs-1 and set skipDMLAsStartTs to true. This ensures the DDL is replayed without duplicating DML events that might have already been written.
  • Improved Dispatcher Start Behavior during Syncpoint Barriers: For in-flight syncpoint barriers, a moved or recreated dispatcher will now start from blockTs (the syncpoint's commit timestamp) and set skipDMLAsStartTs to false. This prevents re-scanning and re-applying events that could corrupt snapshot consistency.
  • Propagation of skipDMLAsStartTs: The skipDMLAsStartTs flag is now properly propagated through the DispatcherConfig protobuf message, dispatcherCreateInfo struct, and various dispatcher creation functions (NewEventDispatcher, NewRedoDispatcher) to ensure consistent behavior across the system.
  • New Integration Tests for Barrier Scenarios: Two new integration tests (ddl_move_table and syncpoint_move_table) have been added to thoroughly validate the correct behavior of dispatchers when moved during DDL and syncpoint barriers, respectively, ensuring end-to-end data consistency.
  • Enhanced Unit Test Coverage: Expanded unit tests for SpanReplication.NewAddDispatcherMessage and DispatcherManager to cover various scenarios involving blockState, syncpoint, and DDL barriers, ensuring the new logic functions as expected.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Questions

Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Please refer to [Release Notes Language Style Guide](https://pingcap.github.io/tidb-dev-guide/contribute-to-tidb/release-notes-style-guide.html) to write a quality release note.

If you don't think this PR needs a release note then fill it with `None`.

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-linked-issue release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 17, 2025
@gemini-code-assist
Copy link

Summary of Changes

Hello @hongyunyan, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a potential data consistency issue in TiCDC by refining how table dispatchers are initialized when they are moved or recreated, particularly in the presence of in-flight syncpoints. The core change ensures that a new dispatcher starts processing events from the correct timestamp (the syncpoint's BlockTs) to avoid re-processing historical data and preserve snapshot consistency. This enhancement is backed by comprehensive unit tests and a new integration test that validates the fix under realistic distributed system conditions.

Highlights

  • Syncpoint Consistency Fix: Implemented a critical fix to ensure data consistency when a table dispatcher is moved or recreated during an in-flight syncpoint. The dispatcher will now correctly resume processing from the syncpoint's BlockTs (commit timestamp) to prevent re-scanning and re-applying events, thereby maintaining snapshot semantics.
  • Enhanced Unit Testing: Added new unit tests for NewAddDispatcherMessage to thoroughly validate the logic for calculating the dispatcher's starting timestamp under various syncpoint states, including in-flight, completed, and non-syncpoint block scenarios.
  • New Integration Test: Introduced a dedicated integration test (syncpoint_move_table) to simulate the scenario of moving a table dispatcher during an active syncpoint. This test verifies that the system correctly handles the dispatcher's StartTs and ensures end-to-end data consistency using sync_diff.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a potential race condition when a table is moved while a syncpoint is in-flight. The fix involves adjusting the startTs for the new dispatcher to prevent re-scanning of events, which could corrupt snapshot consistency. The change is well-commented and supported by new unit tests and a comprehensive integration test. My review includes a suggestion to enhance the unit tests for better coverage.

Comment on lines 59 to 73
func TestSpanReplication_NewAddDispatcherMessage_UseBlockTsForInFlightSyncPoint(t *testing.T) {
t.Parallel()

replicaSet := NewSpanReplication(common.NewChangeFeedIDWithName("test", common.DefaultKeyspaceNamme), common.NewDispatcherID(), 1, getTableSpanByID(4), 9, common.DefaultMode, false)
replicaSet.UpdateBlockState(heartbeatpb.State{
IsBlocked: true,
BlockTs: 10,
IsSyncPoint: true,
Stage: heartbeatpb.BlockStage_WAITING,
})

msg := replicaSet.NewAddDispatcherMessage("node1")
req := msg.Message[0].(*heartbeatpb.ScheduleDispatcherRequest)
require.Equal(t, uint64(10), req.Config.StartTs)
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To improve test coverage, it's good practice to use a table-driven test here to cover both BlockStage_WAITING and BlockStage_WRITING stages for an in-flight sync point. This ensures the logic is robust for all relevant states.

func TestSpanReplication_NewAddDispatcherMessage_UseBlockTsForInFlightSyncPoint(t *testing.T) {
	t.Parallel()

	testCases := []struct {
		stage heartbeatpb.BlockStage
		name  string
	}{
		{stage: heartbeatpb.BlockStage_WAITING, name: "waiting"},
		{stage: heartbeatpb.BlockStage_WRITING, name: "writing"},
	}

	for _, tc := range testCases {
		t.Run(tc.name, func(t *testing.T) {
			t.Parallel()
			replicaSet := NewSpanReplication(common.NewChangeFeedIDWithName("test", common.DefaultKeyspaceNamme), common.NewDispatcherID(), 1, getTableSpanByID(4), 9, common.DefaultMode, false)
			replicaSet.UpdateBlockState(heartbeatpb.State{
				IsBlocked:   true,
				BlockTs:     10,
				IsSyncPoint: true,
				Stage:       tc.stage,
			})

			msg := replicaSet.NewAddDispatcherMessage("node1")
			req := msg.Message[0].(*heartbeatpb.ScheduleDispatcherRequest)
			require.Equal(t, uint64(10), req.Config.StartTs)
		})
	}
}

@ti-chi-bot ti-chi-bot bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 18, 2025
@hongyunyan
Copy link
Collaborator Author

/test all

@hongyunyan
Copy link
Collaborator Author

/test all

@hongyunyan
Copy link
Collaborator Author

/gemini summary

@hongyunyan
Copy link
Collaborator Author

/gemini review

@gemini-code-assist
Copy link

Summary of Changes

This pull request addresses a critical data consistency concern in TiCDC related to how table dispatchers are initialized when they are moved or recreated, particularly during in-flight DDL and syncpoint barrier operations. The changes ensure that dispatchers correctly determine their starting timestamp and whether to skip DML events, preventing data inconsistencies such as duplicate writes or missed DDLs. This is achieved by introducing a skipDMLAsStartTs flag, updating the dispatcher creation logic, and adding robust unit and integration tests to validate the behavior under complex distributed system conditions.

Highlights

  • Corrected Dispatcher Start Behavior during DDL Barriers: When a dispatcher is moved or recreated during an in-flight DDL barrier, it will now correctly start from blockTs-1 and set skipDMLAsStartTs to true. This ensures the DDL is replayed without duplicating DML events that might have already been written.
  • Improved Dispatcher Start Behavior during Syncpoint Barriers: For in-flight syncpoint barriers, a moved or recreated dispatcher will now start from blockTs (the syncpoint's commit timestamp) and set skipDMLAsStartTs to false. This prevents re-scanning and re-applying events that could corrupt snapshot consistency.
  • Propagation of skipDMLAsStartTs: The skipDMLAsStartTs flag is now properly propagated through the DispatcherConfig protobuf message, dispatcherCreateInfo struct, and various dispatcher creation functions (NewEventDispatcher, NewRedoDispatcher) to ensure consistent behavior across the system.
  • New Integration Tests for Barrier Scenarios: Two new integration tests (ddl_move_table and syncpoint_move_table) have been added to thoroughly validate the correct behavior of dispatchers when moved during DDL and syncpoint barriers, respectively, ensuring end-to-end data consistency.
  • Enhanced Unit Test Coverage: Expanded unit tests for SpanReplication.NewAddDispatcherMessage and DispatcherManager to cover various scenarios involving blockState, syncpoint, and DDL barriers, ensuring the new logic functions as expected.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • downstreamadapter/dispatcher/basic_dispatcher.go
    • Updated comments to clarify the scenarios where skipDMLAsStartTs is set to true, adding a case for maintainer-initiated move operations during DDL events.
    • Modified a log message for clarity when skipping DML events.
  • downstreamadapter/dispatcher/redo_dispatcher.go
    • Added skipDMLAsStartTs as a parameter to NewRedoDispatcher and passed it to the underlying basic dispatcher creation.
  • downstreamadapter/dispatcher/redo_dispatcher_test.go
    • Updated newRedoDispatcherForTest to include the new skipDMLAsStartTs parameter, defaulting it to false.
  • downstreamadapter/dispatchermanager/dispatcher_manager.go
    • Modified newEventDispatchers to retrieve scheduleSkipDMLAsStartTsList from prepareCreateDispatcher.
    • Implemented logic to combine or use the skipDMLAsStartTs flag based on whether the startTs has changed, ensuring correct propagation.
    • Added skipDMLAsStartTs to the log output when creating new dispatchers.
  • downstreamadapter/dispatchermanager/dispatcher_manager_helper.go
    • Modified prepareCreateDispatcher to return an additional slice of booleans, skipDMLAsStartTsList, to carry the SkipDMLAsStartTs information.
  • downstreamadapter/dispatchermanager/dispatcher_manager_info.go
    • Added a new field SkipDMLAsStartTs to the dispatcherCreateInfo struct, along with a detailed comment explaining its purpose.
  • downstreamadapter/dispatchermanager/dispatcher_manager_redo.go
    • Modified newRedoDispatchers to retrieve scheduleSkipDMLAsStartTsList from prepareCreateDispatcher.
    • Updated the call to getTableRecoveryInfoFromMysqlSink to also return skipDMLAsStartTsList.
    • Implemented logic to combine or use the skipDMLAsStartTs flag for redo dispatchers.
    • Added skipDMLAsStartTs to the log output when creating new redo dispatchers.
    • Updated mergeRedoDispatcher to pass false for the new skipDMLAsStartTs parameter when creating a new redo dispatcher.
  • downstreamadapter/dispatchermanager/dispatcher_manager_test.go
    • Added TestNewEventDispatchers_PropagateSkipDMLAsStartTs to verify that the skipDMLAsStartTs flag is correctly propagated to event dispatchers.
    • Added TestNewRedoDispatchers_PropagateSkipDMLAsStartTs to verify that the skipDMLAsStartTs flag is correctly propagated to redo dispatchers.
  • heartbeatpb/heartbeat.pb.go
    • Added SkipDMLAsStartTs field to the DispatcherConfig protobuf message.
    • Updated MarshalToSizedBuffer, Size, and Unmarshal methods to handle the new SkipDMLAsStartTs field.
  • heartbeatpb/heartbeat.proto
    • Added bool skipDMLAsStartTs = 6; to the DispatcherConfig message, including comments explaining its use for DDL barriers.
  • maintainer/replica/replication_span.go
    • Modified NewAddDispatcherMessage to dynamically calculate StartTs and SkipDMLAsStartTs based on the replication span's blockState.
    • Implemented logic to set StartTs to blockTs-1 and SkipDMLAsStartTs to true for in-flight DDL barriers.
    • Implemented logic to set StartTs to blockTs and SkipDMLAsStartTs to false for in-flight syncpoint barriers.
  • maintainer/replica/replication_span_test.go
    • Added TestSpanReplication_NewAddDispatcherMessage_UseBlockTsForInFlightSyncPoint to test dispatcher start behavior during an in-flight syncpoint.
    • Added TestSpanReplication_NewAddDispatcherMessage_DontUseBlockTsAfterSyncPointDone to test dispatcher start behavior after a syncpoint is completed.
    • Added TestSpanReplication_NewAddDispatcherMessage_UseBlockTsMinusOneForDDLInFlight to test dispatcher start behavior during an in-flight DDL barrier, verifying skipDMLAsStartTs.
  • tests/integration_tests/ddl_move_table/run.sh
    • Added a new integration test script to verify that moving a table dispatcher during an in-flight multi-table DDL barrier correctly handles the DDL and skipDMLAsStartTs.
  • tests/integration_tests/run_heavy_it_in_ci.sh
    • Updated the mysql_groups array to include the new ddl_move_table and syncpoint_move_table integration tests.
  • tests/integration_tests/syncpoint_move_table/conf/changefeed.toml
    • Added a new configuration file for the syncpoint_move_table integration test, enabling sync points.
  • tests/integration_tests/syncpoint_move_table/conf/diff_config_part1.toml
    • Added a new configuration file part for sync_diff in the syncpoint_move_table integration test.
  • tests/integration_tests/syncpoint_move_table/conf/diff_config_part2.toml
    • Added another new configuration file part for sync_diff in the syncpoint_move_table integration test.
  • tests/integration_tests/syncpoint_move_table/run.sh
    • Added a new integration test script to verify that moving a table dispatcher during an in-flight syncpoint does not restart it from syncpoint_ts - 1, preserving snapshot consistency.
Activity
  • ti-chi-bot[bot] noted that the PR is NOT APPROVED and listed required approvers.
  • gemini-code-assist[bot] provided an initial summary of changes and a review comment suggesting a table-driven test for TestSpanReplication_NewAddDispatcherMessage_UseBlockTsForInFlightSyncPoint.
  • hongyunyan ran /test all.
  • ti-chi-bot[bot] reported that the pull-cdc-mysql-integration-heavy test failed.
  • hongyunyan ran /test all again.
  • hongyunyan requested a summary using /gemini summary.
  • hongyunyan requested a review using /gemini review.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a potential DDL loss issue when a table is moved while a DDL operation is in-flight. The fix involves introducing a skipDMLAsStartTs flag and adjusting the startTs for newly created dispatchers under specific conditions (in-flight DDL or syncpoint). The changes are propagated from the maintainer to the dispatcher manager and finally to the dispatcher. The logic appears sound and is well-supported by new unit and integration tests that cover both DDL and syncpoint scenarios.

My review includes a couple of suggestions to improve maintainability: one to fix a typo in a key comment, and another to refactor duplicated logic into a helper function. Overall, this is a solid contribution that improves the robustness of table migration.

// This flag is set to true ONLY when is_syncpoint=false AND finished=0 in ddl-ts table (non-syncpoint DDL not finished).
// In this case, we return startTs = ddlTs-1 to replay the DDL, and skip the already-written DML at ddlTs
// to avoid duplicate writes while ensuring the DDL is replayed.
// This flag is set to true in two secnaios:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's a typo in this comment. secnaios should be scenarios. Correcting this will improve the clarity of this important comment.

Suggested change
// This flag is set to true in two secnaios:
// This flag is set to true in two scenarios:

Comment on lines 422 to 429
// if the newStartTs equals to the original startTs, we need to combine the skipDMLAsStartTs flag
// otherwise, we just use the skipDMLAsStartTs flag from mysql sink
var skipDMLAsStartTs bool
if newStartTsList[idx] == startTsList[idx] {
skipDMLAsStartTs = scheduleSkipDMLAsStartTsList[idx] || skipDMLAsStartTsList[idx]
} else {
skipDMLAsStartTs = skipDMLAsStartTsList[idx]
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This logic for determining the skipDMLAsStartTs flag is duplicated in dispatcher_manager_redo.go (lines 131-138). To improve maintainability and reduce code duplication, consider extracting this logic into a shared helper function, potentially in dispatcher_manager_helper.go.

For example:

func determineSkipDMLFlag(newStartTs, originalStartTs int64, scheduleSkip, sinkSkip bool) bool {
    if newStartTs == originalStartTs {
        return scheduleSkip || sinkSkip
    }
    return sinkSkip
}

You could then call this helper function from both newEventDispatchers and newRedoDispatchers.

@hongyunyan
Copy link
Collaborator Author

/test all

@hongyunyan hongyunyan changed the title wip ddl: Fix corner case when move dispatcher during an in-flight DDL or Syncpoint Dec 19, 2025
@hongyunyan hongyunyan changed the title ddl: Fix corner case when move dispatcher during an in-flight DDL or Syncpoint ddl: Fix potential data inconsistency when move dispatcher during an in-flight DDL or Syncpoint Dec 19, 2025
@hongyunyan
Copy link
Collaborator Author

/test all

SchemaID: r.schemaID,
Span: r.Span,
StartTs: startTs,
SkipDMLAsStartTs: skipDMLAsStartTs,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the merge operator need to carry SkipDMLAsStartTs?

}

trap 'stop_tidb_cluster; collect_logs $WORK_DIR' EXIT
run $*
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add more tests for the redo mode

@ti-chi-bot
Copy link

ti-chi-bot bot commented Dec 22, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wk989898

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@hongyunyan
Copy link
Collaborator Author

/gemini review

@hongyunyan
Copy link
Collaborator Author

/gemini summary

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a critical data consistency issue that can occur when a dispatcher is moved during in-flight DDL or syncpoint operations. The changes are comprehensive, touching the dispatcher, dispatcher manager, and maintainer components to correctly calculate the dispatcher's start timestamp and a new skipDMLAsStartTs flag. The logic for handling different barrier scenarios (DDL vs. syncpoint) and for merge operations appears sound. The addition of extensive integration tests for these specific scenarios is excellent and provides high confidence in the fix. My review identifies a couple of opportunities to improve maintainability by refactoring duplicated code and a long, complex function. Overall, this is a high-quality contribution that solves a complex problem.

Comment on lines 153 to 160
// if the newStartTs equals to the original startTs, we need to combine the skipDMLAsStartTs flag
// otherwise, we just use the skipDMLAsStartTs flag from mysql sink
var skipDMLAsStartTs bool
if newStartTsList[idx] == startTsList[idx] {
skipDMLAsStartTs = scheduleSkipDMLAsStartTsList[idx] || skipDMLAsStartTsList[idx]
} else {
skipDMLAsStartTs = skipDMLAsStartTsList[idx]
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This block of logic for determining the skipDMLAsStartTs flag is duplicated from newEventDispatchers in dispatcher_manager.go. To improve maintainability and avoid potential inconsistencies in the future, consider extracting this logic into a shared helper function. This would ensure that any future changes to this logic are applied consistently for both event dispatchers and redo dispatchers.

Comment on lines 311 to 422
// resolveMergedDispatcherStartTs returns the effective startTs and skip flags for the merged dispatcher.
//
// Inputs:
// - minCheckpointTs: min checkpointTs among all source dispatchers, collected after they are closed.
// - pendingStates: per-source block state from GetBlockEventStatus() captured at close time.
//
// Algorithm:
// 1. Build a merge candidate from minCheckpointTs.
// If all source dispatchers have a non-nil pending block state and they refer to the same (commitTs, isSyncPoint),
// adjust the merge candidate so the merged dispatcher can replay that block event safely:
// - DDL: startTs = commitTs - 1, skipDMLAsStartTs = true.
// - SyncPoint: startTs = commitTs.
// The merge candidate always uses skipSyncpointAtStartTs = false.
// 2. If the sink is MySQL, query downstream ddl_ts recovery info using the merge candidate startTs and merge the results:
// - If recoveryStartTs > mergeStartTsCandidate: use recoveryStartTs and its skip flags.
// - If recoveryStartTs == mergeStartTsCandidate: OR the skip flags.
// - If recoveryStartTs < mergeStartTsCandidate: keep the merge candidate.
// If the query fails, the error is reported via mergedDispatcher.HandleError and the merge candidate is returned.
//
// For non-MySQL and redo, the merge candidate is the final result.
func resolveMergedDispatcherStartTs(t *MergeCheckTask, minCheckpointTs uint64, pendingStates []*heartbeatpb.State) (uint64, bool, bool) {
mergeStartTsCandidate := minCheckpointTs
mergeSkipSyncpointAtStartTsCandidate := false
mergeSkipDMLAsStartTsCandidate := false

// If all source dispatchers have a pending block event and they are the same one,
// adjust the startTs to ensure the merged dispatcher can replay it safely.
allSamePending := true
var pendingCommitTs uint64
var pendingIsSyncPoint bool
for idx, state := range pendingStates {
if state == nil {
allSamePending = false
break
}
if idx == 0 {
pendingCommitTs = state.BlockTs
pendingIsSyncPoint = state.IsSyncPoint
continue
}
if state.BlockTs != pendingCommitTs || state.IsSyncPoint != pendingIsSyncPoint {
allSamePending = false
break
}
}
if allSamePending {
if pendingIsSyncPoint {
mergeStartTsCandidate = pendingCommitTs
} else if pendingCommitTs > 0 {
mergeStartTsCandidate = pendingCommitTs - 1
mergeSkipDMLAsStartTsCandidate = true
} else {
log.Warn("pending ddl has zero commit ts, fallback to min checkpoint ts",
zap.Stringer("changefeedID", t.manager.changefeedID),
zap.Uint64("minCheckpointTs", minCheckpointTs),
zap.Any("mergedDispatcher", t.mergedDispatcher.GetId()))
}
log.Info("merge dispatcher uses pending block event to calculate start ts",
zap.Stringer("changefeedID", t.manager.changefeedID),
zap.Any("mergedDispatcher", t.mergedDispatcher.GetId()),
zap.Uint64("pendingCommitTs", pendingCommitTs),
zap.Bool("pendingIsSyncPoint", pendingIsSyncPoint),
zap.Uint64("startTs", mergeStartTsCandidate),
zap.Bool("skipSyncpointAtStartTs", mergeSkipSyncpointAtStartTsCandidate),
zap.Bool("skipDMLAsStartTs", mergeSkipDMLAsStartTsCandidate))
}

finalStartTs := mergeStartTsCandidate
finalSkipSyncpointAtStartTs := mergeSkipSyncpointAtStartTsCandidate
finalSkipDMLAsStartTs := mergeSkipDMLAsStartTsCandidate

if common.IsDefaultMode(t.mergedDispatcher.GetMode()) && t.manager.sink.SinkType() == common.MysqlSinkType {
newStartTsList, skipSyncpointAtStartTsList, skipDMLAsStartTsList, err := t.manager.sink.(*mysql.Sink).GetTableRecoveryInfo(
[]int64{t.mergedDispatcher.GetTableSpan().TableID},
[]int64{int64(mergeStartTsCandidate)},
false,
)
if err != nil {
log.Error("get table recovery info for merge dispatcher failed",
zap.Stringer("dispatcherID", t.mergedDispatcher.GetId()),
zap.Stringer("changefeedID", t.manager.changefeedID),
zap.Error(err),
)
t.mergedDispatcher.HandleError(err)
return finalStartTs, finalSkipSyncpointAtStartTs, finalSkipDMLAsStartTs
}
recoveryStartTs := uint64(newStartTsList[0])
recoverySkipSyncpointAtStartTs := skipSyncpointAtStartTsList[0]
recoverySkipDMLAsStartTs := skipDMLAsStartTsList[0]
if recoveryStartTs > mergeStartTsCandidate {
finalStartTs = recoveryStartTs
finalSkipSyncpointAtStartTs = recoverySkipSyncpointAtStartTs
finalSkipDMLAsStartTs = recoverySkipDMLAsStartTs
} else if recoveryStartTs == mergeStartTsCandidate {
finalSkipSyncpointAtStartTs = mergeSkipSyncpointAtStartTsCandidate || recoverySkipSyncpointAtStartTs
finalSkipDMLAsStartTs = mergeSkipDMLAsStartTsCandidate || recoverySkipDMLAsStartTs
}

log.Info("get table recovery info for merge dispatcher",
zap.Stringer("changefeedID", t.manager.changefeedID),
zap.Uint64("mergeStartTsCandidate", mergeStartTsCandidate),
zap.Any("recoveryStartTs", newStartTsList),
zap.Any("recoverySkipSyncpointAtStartTsList", skipSyncpointAtStartTsList),
zap.Any("recoverySkipDMLAsStartTsList", skipDMLAsStartTsList),
zap.Uint64("finalStartTs", finalStartTs),
zap.Bool("finalSkipSyncpointAtStartTs", finalSkipSyncpointAtStartTs),
zap.Bool("finalSkipDMLAsStartTs", finalSkipDMLAsStartTs),
)
}

return finalStartTs, finalSkipSyncpointAtStartTs, finalSkipDMLAsStartTs
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This new function resolveMergedDispatcherStartTs is quite long and complex, spanning over 100 lines. While it is well-commented, its length makes it difficult to digest at a glance. To improve readability, testability, and maintainability, consider refactoring it into smaller, more focused helper functions. For example:

  • A function to determine the merge candidate startTs and flags based on the pendingStates of source dispatchers.
  • A function to apply the MySQL-specific recovery logic, taking the candidate values and refining them.

Breaking down this logic would make the merge process easier to understand and reason about.

@gemini-code-assist
Copy link

Summary of Changes

This pull request resolves a critical data consistency issue in TiCDC by enhancing how table dispatchers are initialized when they are moved or recreated, especially during in-flight DDL and syncpoint barrier operations. The core changes ensure that dispatchers accurately determine their starting timestamp and whether to skip DML events, effectively preventing data inconsistencies such as duplicate writes or missed DDLs. This is achieved through refined dispatcher creation logic and comprehensive testing under complex distributed system conditions.

Highlights

  • Corrected Dispatcher Start Behavior during DDL Barriers: When a dispatcher is moved or recreated during an in-flight DDL barrier, it will now correctly start from blockTs-1 and set skipDMLAsStartTs to true. This ensures the DDL is replayed without duplicating DML events that might have already been written.
  • Improved Dispatcher Start Behavior during Syncpoint Barriers: For in-flight syncpoint barriers, a moved or recreated dispatcher will now start from blockTs (the syncpoint's commit timestamp) and set skipDMLAsStartTs to false. This prevents re-scanning and re-applying events that could corrupt snapshot consistency.
  • Propagation of skipDMLAsStartTs: The skipDMLAsStartTs flag is now properly propagated through the DispatcherConfig protobuf message, dispatcherCreateInfo struct, and various dispatcher creation functions (NewEventDispatcher, NewRedoDispatcher) to ensure consistent behavior across the system.
  • New Integration Tests for Barrier Scenarios: Two new integration tests (ddl_move_table and syncpoint_move_table) have been added to thoroughly validate the correct behavior of dispatchers when moved during DDL and syncpoint barriers, respectively, ensuring end-to-end data consistency.
  • Enhanced Unit Test Coverage: Expanded unit tests for SpanReplication.NewAddDispatcherMessage and DispatcherManager to cover various scenarios involving blockState, syncpoint, and DDL barriers, ensuring the new logic functions as expected.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • downstreamadapter/dispatcher/basic_dispatcher.go
    • Added GetBlockEventStatus() to the Dispatcher interface.
    • Updated comments for skipDMLAsStartTs to clarify scenarios, specifically including maintainer-initiated move operations during DDL events.
    • Modified a log message for clarity when skipping DML events.
  • downstreamadapter/dispatcher/redo_dispatcher.go
    • Added skipDMLAsStartTs as a parameter to NewRedoDispatcher and passed it to the underlying basic dispatcher creation.
  • downstreamadapter/dispatcher/redo_dispatcher_test.go
    • Updated newRedoDispatcherForTest to include the new skipDMLAsStartTs parameter, defaulting it to false.
  • downstreamadapter/dispatchermanager/dispatcher_manager.go
    • Modified newEventDispatchers to retrieve scheduleSkipDMLAsStartTsList from prepareCreateDispatcher.
    • Implemented logic to combine or use the skipDMLAsStartTs flag based on whether the startTs has changed, ensuring correct propagation.
    • Added skipDMLAsStartTs to the log output when creating new dispatchers.
  • downstreamadapter/dispatchermanager/dispatcher_manager_helper.go
    • Modified prepareCreateDispatcher to return an additional slice of booleans, skipDMLAsStartTsList, to carry the SkipDMLAsStartTs information.
  • downstreamadapter/dispatchermanager/dispatcher_manager_info.go
    • Added a new field SkipDMLAsStartTs to the dispatcherCreateInfo struct, with comments explaining its purpose.
  • downstreamadapter/dispatchermanager/dispatcher_manager_redo.go
    • Modified newRedoDispatchers to retrieve scheduleSkipDMLAsStartTsList from prepareCreateDispatcher.
    • Updated the call to getTableRecoveryInfoFromMysqlSink to also return skipDMLAsStartTsList.
    • Implemented logic to combine or use the skipDMLAsStartTs flag for redo dispatchers.
    • Added skipDMLAsStartTs to the log output when creating new redo dispatchers.
    • Updated mergeRedoDispatcher to pass false for the new skipDMLAsStartTs parameter when creating a new redo dispatcher.
  • downstreamadapter/dispatchermanager/task.go
    • Added pendingStates to doMerge to record block event status from source dispatchers.
    • Introduced a new function resolveMergedDispatcherStartTs to determine the effective startTs and skip flags for merged dispatchers, considering minCheckpointTs, pendingStates, and MySQL sink recovery information.
  • heartbeatpb/heartbeat.pb.go
    • Added SkipDMLAsStartTs field to the DispatcherConfig protobuf message.
    • Updated MarshalToSizedBuffer, Size, and Unmarshal methods to handle the new SkipDMLAsStartTs field.
  • heartbeatpb/heartbeat.proto
    • Added bool skipDMLAsStartTs = 6; to the DispatcherConfig message, with comments explaining its use for DDL barriers.
  • maintainer/replica/replication_span.go
    • Modified NewAddDispatcherMessage to dynamically calculate StartTs and SkipDMLAsStartTs based on the replication span's blockState.
    • Implemented logic to set StartTs to blockTs-1 and SkipDMLAsStartTs to true for in-flight DDL barriers.
    • Implemented logic to set StartTs to blockTs and SkipDMLAsStartTs to false for in-flight syncpoint barriers.
  • maintainer/replica/replication_span_test.go
    • Added TestSpanReplication_NewAddDispatcherMessage_UseBlockTsForInFlightSyncPoint to test dispatcher start behavior during an in-flight syncpoint.
    • Added TestSpanReplication_NewAddDispatcherMessage_DontUseBlockTsAfterSyncPointDone to test dispatcher start behavior after a syncpoint is completed.
    • Added TestSpanReplication_NewAddDispatcherMessage_UseBlockTsMinusOneForDDLInFlight to test dispatcher start behavior during an in-flight DDL barrier, verifying skipDMLAsStartTs.
  • tests/integration_tests/ddl_move_table/run.sh
    • Added a new integration test script to verify that moving a table dispatcher during an in-flight multi-table DDL barrier correctly handles the DDL and skipDMLAsStartTs.
  • tests/integration_tests/run_heavy_it_in_ci.sh
    • Updated the mysql_groups array to include the new ddl_move_table and syncpoint_move_table integration tests.
  • tests/integration_tests/syncpoint_move_table/conf/changefeed.toml
    • Added a new configuration file for the syncpoint_move_table integration test, enabling sync points.
  • tests/integration_tests/syncpoint_move_table/conf/diff_config_part1.toml
    • Added a new configuration file part for sync_diff in the syncpoint_move_table integration test.
  • tests/integration_tests/syncpoint_move_table/conf/diff_config_part2.toml
    • Added another new configuration file part for sync_diff in the syncpoint_move_table integration test.
  • tests/integration_tests/syncpoint_move_table/run.sh
    • Added a new integration test script to verify that moving a table dispatcher during an in-flight syncpoint does not restart it from syncpoint_ts - 1, preserving snapshot consistency.
Activity
  • 2025-12-17 14:41:10 UTC: gemini-code-assist[bot] provided an initial summary and noted the PR is NOT APPROVED.
  • 2025-12-17 14:42:33 UTC: gemini-code-assist[bot] suggested improving test coverage for TestSpanReplication_NewAddDispatcherMessage_UseBlockTsForInFlightSyncPoint using a table-driven test.
  • 2025-12-18 08:01:33 UTC: hongyunyan ran /test all.
  • 2025-12-19 02:20:14 UTC: hongyunyan ran /test all again.
  • 2025-12-19 02:33:47 UTC: hongyunyan requested a summary using /gemini summary.
  • 2025-12-19 02:33:53 UTC: hongyunyan requested a review using /gemini review.
  • 2025-12-19 02:34:19 UTC: gemini-code-assist[bot] provided a new summary.
  • 2025-12-19 02:35:57 UTC: gemini-code-assist[bot] pointed out a typo in a comment (secnaios to scenarios).
  • 2025-12-19 02:35:57 UTC: gemini-code-assist[bot] suggested extracting duplicated logic for skipDMLAsStartTs into a shared helper function.
  • 2025-12-22 03:24:10 UTC: hongyunyan ran /test all.
  • 2025-12-22 03:51:57 UTC: wk989898 asked if the merge operator needs to carry SkipDMLAsStartTs.
  • 2025-12-22 03:54:04 UTC: wk989898 requested more tests for redo mode.
  • 2025-12-22 04:20:22 UTC: ti-chi-bot[bot] reported that the PR was APPROVED by wk989898.
  • 2025-12-22 04:20:23 UTC: ti-chi-bot[bot] provided LGTM timeline.
  • 2025-12-22 04:24:21 UTC: ti-chi-bot[bot] reported pull-cdc-mysql-integration-heavy test failed.
  • 2025-12-23 07:15:56 UTC: hongyunyan requested a review using /gemini review.
  • 2025-12-23 07:16:04 UTC: hongyunyan requested a summary using /gemini summary.
  • 2025-12-23 07:19:23 UTC: gemini-code-assist[bot] reiterated the suggestion to extract duplicated logic for skipDMLAsStartTs.
  • 2025-12-23 07:19:23 UTC: gemini-code-assist[bot] suggested refactoring resolveMergedDispatcherStartTs into smaller helper functions due to its complexity.

@hongyunyan
Copy link
Collaborator Author

/test all

@hongyunyan
Copy link
Collaborator Author

/test all

@hongyunyan
Copy link
Collaborator Author

/test all

@hongyunyan
Copy link
Collaborator Author

/test all

1 similar comment
@hongyunyan
Copy link
Collaborator Author

/test all

@hongyunyan
Copy link
Collaborator Author

/retest

1 similar comment
@hongyunyan
Copy link
Collaborator Author

/retest

@hongyunyan
Copy link
Collaborator Author

/gemini review

@hongyunyan
Copy link
Collaborator Author

/gemini summary

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a crucial fix for potential data inconsistencies that can occur when a dispatcher is moved or merged during in-flight DDL and syncpoint operations. The changes correctly adjust the dispatcher's starting timestamp and introduce a skipDMLAsStartTs flag to prevent duplicate writes or missed DDLs, which is a solid approach. The logic is well-supported by new unit and integration tests. My feedback focuses on improving code structure for better long-term maintainability by refactoring a long function and extracting duplicated logic into a helper.

Comment on lines 428 to 435
// if the newStartTs equals to the original startTs, we need to combine the skipDMLAsStartTs flag
// otherwise, we just use the skipDMLAsStartTs flag from mysql sink
var skipDMLAsStartTs bool
if newStartTsList[idx] == startTsList[idx] {
skipDMLAsStartTs = scheduleSkipDMLAsStartTsList[idx] || skipDMLAsStartTsList[idx]
} else {
skipDMLAsStartTs = skipDMLAsStartTsList[idx]
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This logic for determining skipDMLAsStartTs is duplicated in newEventDispatchers and newRedoDispatchers (in dispatcher_manager_redo.go). To improve maintainability and reduce code duplication, consider extracting this logic into a small helper function.

For example, you could create a helper in dispatcher_manager_helper.go:

func resolveSkipDMLAsStartTs(newStartTs, originalStartTs int64, scheduleSkip, sinkSkip bool) bool {
	// if the newStartTs equals to the original startTs, we need to combine the skipDMLAsStartTs flag
	// otherwise, we just use the skipDMLAsStartTs flag from mysql sink
	if newStartTs == originalStartTs {
		return scheduleSkip || sinkSkip
	}
	return sinkSkip
}

Then call it from both places.

Comment on lines 311 to 422
// resolveMergedDispatcherStartTs returns the effective startTs and skip flags for the merged dispatcher.
//
// Inputs:
// - minCheckpointTs: min checkpointTs among all source dispatchers, collected after they are closed.
// - pendingStates: per-source block state from GetBlockEventStatus() captured at close time.
//
// Algorithm:
// 1. Build a merge candidate from minCheckpointTs.
// If all source dispatchers have a non-nil pending block state and they refer to the same (commitTs, isSyncPoint),
// adjust the merge candidate so the merged dispatcher can replay that block event safely:
// - DDL: startTs = commitTs - 1, skipDMLAsStartTs = true.
// - SyncPoint: startTs = commitTs.
// The merge candidate always uses skipSyncpointAtStartTs = false.
// 2. If the sink is MySQL, query downstream ddl_ts recovery info using the merge candidate startTs and merge the results:
// - If recoveryStartTs > mergeStartTsCandidate: use recoveryStartTs and its skip flags.
// - If recoveryStartTs == mergeStartTsCandidate: OR the skip flags.
// - If recoveryStartTs < mergeStartTsCandidate: keep the merge candidate.
// If the query fails, the error is reported via mergedDispatcher.HandleError and the merge candidate is returned.
//
// For non-MySQL and redo, the merge candidate is the final result.
func resolveMergedDispatcherStartTs(t *MergeCheckTask, minCheckpointTs uint64, pendingStates []*heartbeatpb.State) (uint64, bool, bool) {
mergeStartTsCandidate := minCheckpointTs
mergeSkipSyncpointAtStartTsCandidate := false
mergeSkipDMLAsStartTsCandidate := false

// If all source dispatchers have a pending block event and they are the same one,
// adjust the startTs to ensure the merged dispatcher can replay it safely.
allSamePending := true
var pendingCommitTs uint64
var pendingIsSyncPoint bool
for idx, state := range pendingStates {
if state == nil {
allSamePending = false
break
}
if idx == 0 {
pendingCommitTs = state.BlockTs
pendingIsSyncPoint = state.IsSyncPoint
continue
}
if state.BlockTs != pendingCommitTs || state.IsSyncPoint != pendingIsSyncPoint {
allSamePending = false
break
}
}
if allSamePending {
if pendingIsSyncPoint {
mergeStartTsCandidate = pendingCommitTs
} else if pendingCommitTs > 0 {
mergeStartTsCandidate = pendingCommitTs - 1
mergeSkipDMLAsStartTsCandidate = true
} else {
log.Warn("pending ddl has zero commit ts, fallback to min checkpoint ts",
zap.Stringer("changefeedID", t.manager.changefeedID),
zap.Uint64("minCheckpointTs", minCheckpointTs),
zap.Any("mergedDispatcher", t.mergedDispatcher.GetId()))
}
log.Info("merge dispatcher uses pending block event to calculate start ts",
zap.Stringer("changefeedID", t.manager.changefeedID),
zap.Any("mergedDispatcher", t.mergedDispatcher.GetId()),
zap.Uint64("pendingCommitTs", pendingCommitTs),
zap.Bool("pendingIsSyncPoint", pendingIsSyncPoint),
zap.Uint64("startTs", mergeStartTsCandidate),
zap.Bool("skipSyncpointAtStartTs", mergeSkipSyncpointAtStartTsCandidate),
zap.Bool("skipDMLAsStartTs", mergeSkipDMLAsStartTsCandidate))
}

finalStartTs := mergeStartTsCandidate
finalSkipSyncpointAtStartTs := mergeSkipSyncpointAtStartTsCandidate
finalSkipDMLAsStartTs := mergeSkipDMLAsStartTsCandidate

if common.IsDefaultMode(t.mergedDispatcher.GetMode()) && t.manager.sink.SinkType() == common.MysqlSinkType {
newStartTsList, skipSyncpointAtStartTsList, skipDMLAsStartTsList, err := t.manager.sink.(*mysql.Sink).GetTableRecoveryInfo(
[]int64{t.mergedDispatcher.GetTableSpan().TableID},
[]int64{int64(mergeStartTsCandidate)},
false,
)
if err != nil {
log.Error("get table recovery info for merge dispatcher failed",
zap.Stringer("dispatcherID", t.mergedDispatcher.GetId()),
zap.Stringer("changefeedID", t.manager.changefeedID),
zap.Error(err),
)
t.mergedDispatcher.HandleError(err)
return finalStartTs, finalSkipSyncpointAtStartTs, finalSkipDMLAsStartTs
}
recoveryStartTs := uint64(newStartTsList[0])
recoverySkipSyncpointAtStartTs := skipSyncpointAtStartTsList[0]
recoverySkipDMLAsStartTs := skipDMLAsStartTsList[0]
if recoveryStartTs > mergeStartTsCandidate {
finalStartTs = recoveryStartTs
finalSkipSyncpointAtStartTs = recoverySkipSyncpointAtStartTs
finalSkipDMLAsStartTs = recoverySkipDMLAsStartTs
} else if recoveryStartTs == mergeStartTsCandidate {
finalSkipSyncpointAtStartTs = mergeSkipSyncpointAtStartTsCandidate || recoverySkipSyncpointAtStartTs
finalSkipDMLAsStartTs = mergeSkipDMLAsStartTsCandidate || recoverySkipDMLAsStartTs
}

log.Info("get table recovery info for merge dispatcher",
zap.Stringer("changefeedID", t.manager.changefeedID),
zap.Uint64("mergeStartTsCandidate", mergeStartTsCandidate),
zap.Any("recoveryStartTs", newStartTsList),
zap.Any("recoverySkipSyncpointAtStartTsList", skipSyncpointAtStartTsList),
zap.Any("recoverySkipDMLAsStartTsList", skipDMLAsStartTsList),
zap.Uint64("finalStartTs", finalStartTs),
zap.Bool("finalSkipSyncpointAtStartTs", finalSkipSyncpointAtStartTs),
zap.Bool("finalSkipDMLAsStartTs", finalSkipDMLAsStartTs),
)
}

return finalStartTs, finalSkipSyncpointAtStartTs, finalSkipDMLAsStartTs
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The function resolveMergedDispatcherStartTs is quite long (over 100 lines) and handles multiple complex logical steps. To improve readability and maintainability, consider refactoring it into smaller, more focused helper functions.

For example, you could break it down into:

  1. A function to determine the candidate startTs and skip flags based on the pending states of the source dispatchers.
  2. A function that takes these candidate values and merges them with the recovery information from the MySQL sink.

This would make the overall logic easier to follow and test.

@gemini-code-assist
Copy link

Summary of Changes

This pull request addresses a critical data consistency concern in TiCDC by refining how table dispatchers are initialized when they are moved or recreated, particularly during in-flight DDL and syncpoint barrier operations. The changes ensure that dispatchers correctly determine their starting timestamp and whether to skip DML events, preventing data inconsistencies such as duplicate writes or missed DDLs. This is achieved by introducing a skipDMLAsStartTs flag, updating the dispatcher creation logic, and adding robust unit and integration tests to validate the behavior under complex distributed system conditions.

Highlights

  • Corrected Dispatcher Start Behavior during DDL Barriers: When a dispatcher is moved or recreated during an in-flight DDL barrier, it will now correctly start from blockTs-1 and set skipDMLAsStartTs to true. This ensures the DDL is replayed without duplicating DML events that might have already been written.
  • Improved Dispatcher Start Behavior during Syncpoint Barriers: For in-flight syncpoint barriers, a moved or recreated dispatcher will now start from blockTs (the syncpoint's commit timestamp) and set skipDMLAsStartTs to false. This prevents re-scanning and re-applying events that could corrupt snapshot consistency.
  • Propagation of skipDMLAsStartTs: The skipDMLAsStartTs flag is now properly propagated through the DispatcherConfig protobuf message, dispatcherCreateInfo struct, and various dispatcher creation functions (NewEventDispatcher, NewRedoDispatcher) to ensure consistent behavior across the system.
  • New Integration Tests for Barrier Scenarios: Two new integration tests (ddl_move_table and syncpoint_move_table) have been added to thoroughly validate the correct behavior of dispatchers when moved during DDL and syncpoint barriers, respectively, ensuring end-to-end data consistency.
  • Enhanced Unit Test Coverage: Expanded unit tests for SpanReplication.NewAddDispatcherMessage and DispatcherManager to cover various scenarios involving blockState, syncpoint, and DDL barriers, ensuring the new logic functions as expected.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • downstreamadapter/dispatcher/basic_dispatcher.go
    • Added GetBlockEventStatus() to the Dispatcher interface.
    • Updated comments for skipDMLAsStartTs to clarify scenarios, specifically including maintainer-initiated move operations during DDL events.
    • Modified a log message for clarity when skipping DML events.
    • Extracted DDL execution logic into a new ExecutePendingDDL method.
  • downstreamadapter/dispatcher/redo_dispatcher.go
    • Added skipDMLAsStartTs as a parameter to NewRedoDispatcher and passed it to the underlying basic dispatcher creation.
  • downstreamadapter/dispatcher/redo_dispatcher_test.go
    • Updated newRedoDispatcherForTest to include the new skipDMLAsStartTs parameter, defaulting it to false.
  • downstreamadapter/dispatchermanager/dispatcher_manager.go
    • Modified newEventDispatchers to retrieve scheduleSkipDMLAsStartTsList from prepareCreateDispatcher.
    • Implemented logic to combine or use the skipDMLAsStartTs flag based on whether the startTs has changed, ensuring correct propagation.
    • Added skipDMLAsStartTs to the log output when creating new dispatchers.
  • downstreamadapter/dispatchermanager/dispatcher_manager_helper.go
    • Modified prepareCreateDispatcher to return an additional slice of booleans, skipDMLAsStartTsList, to carry the SkipDMLAsStartTs information.
  • downstreamadapter/dispatchermanager/dispatcher_manager_info.go
    • Added a new field SkipDMLAsStartTs to the dispatcherCreateInfo struct, with comments explaining its purpose.
  • downstreamadapter/dispatchermanager/dispatcher_manager_redo.go
    • Modified newRedoDispatchers to retrieve scheduleSkipDMLAsStartTsList from prepareCreateDispatcher.
    • Updated the call to getTableRecoveryInfoFromMysqlSink to also return skipDMLAsStartTsList.
    • Implemented logic to combine or use the skipDMLAsStartTs flag for redo dispatchers.
    • Added skipDMLAsStartTs to the log output when creating new redo dispatchers.
    • Updated mergeRedoDispatcher to pass false for the new skipDMLAsStartTs parameter when creating a new redo dispatcher.
  • downstreamadapter/dispatchermanager/task.go
    • Added pendingStates to doMerge to record block event status from source dispatchers.
    • Introduced a new function resolveMergedDispatcherStartTs to determine the effective startTs and skip flags for merged dispatchers, considering minCheckpointTs, pendingStates, and MySQL sink recovery information.
  • heartbeatpb/heartbeat.pb.go
    • Added SkipDMLAsStartTs field to the DispatcherConfig protobuf message.
    • Updated MarshalToSizedBuffer, Size, and Unmarshal methods to handle the new SkipDMLAsStartTs field.
  • heartbeatpb/heartbeat.proto
    • Added bool skipDMLAsStartTs = 6; to the DispatcherConfig message, with comments explaining its use for DDL barriers.
  • maintainer/barrier_event.go
    • Added forwardBarrierEvent function to check if a barrier event can be forwarded for a given replication, considering checkpoint TS and block state.
    • Updated checkBlockedDispatchers to use forwardBarrierEvent.
  • maintainer/replica/replication_span.go
    • Added GetBlockState() method.
    • Modified NewAddDispatcherMessage to dynamically calculate StartTs and SkipDMLAsStartTs based on the replication span's blockState.
    • Implemented logic to set StartTs to blockTs-1 and SkipDMLAsStartTs to true for in-flight DDL barriers.
    • Implemented logic to set StartTs to blockTs and SkipDMLAsStartTs to false for in-flight syncpoint barriers.
  • maintainer/replica/replication_span_test.go
    • Added TestSpanReplication_NewAddDispatcherMessage_UseBlockTsForInFlightSyncPoint to test dispatcher start behavior during an in-flight syncpoint.
    • Added TestSpanReplication_NewAddDispatcherMessage_DontUseBlockTsAfterSyncPointDone to test dispatcher start behavior after a syncpoint is completed.
    • Added TestSpanReplication_NewAddDispatcherMessage_UseBlockTsMinusOneForDDLInFlight to test dispatcher start behavior during an in-flight DDL barrier, verifying skipDMLAsStartTs.
  • tests/integration_tests/_utils/query_dispatcher_count
    • Modified the query_dispatcher_count script to support "ge" (greater than or equal) comparison mode.
  • tests/integration_tests/in_flight_ddl_during_scheduling/run.sh
    • Added a new integration test script to verify that moving a table dispatcher during an in-flight multi-table DDL barrier correctly handles the DDL and skipDMLAsStartTs.
  • tests/integration_tests/in_flight_syncpoint_during_sheduling/conf/changefeed.toml
    • Added a new configuration file for the syncpoint_move_table integration test, enabling sync points.
  • tests/integration_tests/in_flight_syncpoint_during_sheduling/conf/diff_config_part1.toml
    • Added a new configuration file part for sync_diff in the syncpoint_move_table integration test.
  • tests/integration_tests/in_flight_syncpoint_during_sheduling/conf/diff_config_part2.toml
    • Added another new configuration file part for sync_diff in the syncpoint_move_table integration test.
  • tests/integration_tests/in_flight_syncpoint_during_sheduling/run.sh
    • Added a new integration test script to verify that moving a table dispatcher during an in-flight syncpoint does not restart it from syncpoint_ts - 1, preserving snapshot consistency.
  • tests/integration_tests/run_heavy_it_in_ci.sh
    • Updated the mysql_groups array to include the new in_flight_ddl_during_scheduling and in_flight_syncpoint_during_sheduling integration tests.
Activity
  • On Dec 17, gemini-code-assist[bot] provided an initial summary and noted the PR was not approved. It also suggested improving test coverage for TestSpanReplication_NewAddDispatcherMessage_UseBlockTsForInFlightSyncPoint using a table-driven test.
  • On Dec 18, hongyunyan ran /test all.
  • On Dec 19, hongyunyan ran /test all again.
  • On Dec 19, hongyunyan requested a summary and review from /gemini.
  • On Dec 19, gemini-code-assist[bot] provided a new summary. It also pointed out a typo ('secnaios' to 'scenarios') and suggested extracting duplicated skipDMLAsStartTs logic into a shared helper function.
  • On Dec 22, hongyunyan ran /test all.
  • On Dec 22, wk989898 asked if the merge operator needs to carry SkipDMLAsStartTs and requested more tests for redo mode.
  • On Dec 22, ti-chi-bot[bot] reported that the PR was APPROVED by wk989898 and provided the LGTM timeline.
  • On Dec 22, ti-chi-bot[bot] reported that the pull-cdc-mysql-integration-heavy test failed.
  • On Dec 23, hongyunyan requested a review and summary from /gemini.
  • On Dec 23, gemini-code-assist[bot] reiterated the suggestion to extract duplicated skipDMLAsStartTs logic and suggested refactoring resolveMergedDispatcherStartTs into smaller helper functions.
  • On Dec 23, hongyunyan ran /test all multiple times.
  • On Dec 24, hongyunyan ran /test all multiple times.
  • On Dec 24, hongyunyan ran /retest.
  • On Dec 24, ti-chi-bot[bot] reported that the pull-cdc-kafka-integration-light test failed.
  • On Dec 25, hongyunyan ran /retest.
  • On Dec 25, hongyunyan requested a review and summary from /gemini.

@hongyunyan
Copy link
Collaborator Author

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a critical fix for data consistency issues that could occur when a dispatcher is moved or recreated during in-flight DDL and syncpoint operations. The changes correctly determine the dispatcher's start timestamp and whether to skip DML events by introducing and propagating a skipDMLAsStartTs flag. The logic for handling these barrier scenarios is now more robust, especially in the dispatcher creation and merge logic. The addition of comprehensive integration and unit tests is a great step towards ensuring the correctness of these complex distributed scenarios. Overall, the changes are well-implemented and significantly improve the reliability of TiCDC. I've found one critical issue regarding an undefined variable that would cause a compilation failure, and a minor typo in a test script.

t.mergedDispatcher.SetCurrentPDTs(t.manager.pdClock.CurrentTS())
t.mergedDispatcher.SetComponentStatus(heartbeatpb.ComponentState_Initializing)
appcontext.GetService[*eventcollector.EventCollector](appcontext.EventCollector).CommitAddDispatcher(t.mergedDispatcher, minCheckpointTs)
appcontext.GetService[*eventcollector.EventCollector](appcontext.EventCollector).CommitAddDispatcher(t.mergedDispatcher, finalStartTs)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The variable finalStartTs is used here but it's not defined in the scope of the doMerge function. This will cause a compilation error. Based on the surrounding code and the refactoring, it seems you intended to use startTs, which is defined on line 276 and holds the correctly resolved start timestamp for the merged dispatcher. Please correct this to use startTs.

Suggested change
appcontext.GetService[*eventcollector.EventCollector](appcontext.EventCollector).CommitAddDispatcher(t.mergedDispatcher, finalStartTs)
appcontext.GetService[*eventcollector.EventCollector](appcontext.EventCollector).CommitAddDispatcher(t.mergedDispatcher, startTs)

'cdc move_table in_flight_ddl_during_scheduling checkpoint_race_ddl_crash'
# G04
'complex_transaction syncpoint syncpoint_check_ts random_drop_message'
'complex_transaction syncpoint in_flight_syncpoint_during_sheduling syncpoint_check_ts random_drop_message'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's a typo in the test case name. It should be in_flight_syncpoint_during_scheduling instead of in_flight_syncpoint_during_sheduling to match the directory name.

Suggested change
'complex_transaction syncpoint in_flight_syncpoint_during_sheduling syncpoint_check_ts random_drop_message'
'complex_transaction syncpoint in_flight_syncpoint_during_scheduling syncpoint_check_ts random_drop_message'

@hongyunyan
Copy link
Collaborator Author

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a critical fix for data consistency issues that can occur when a dispatcher is moved, recreated, or merged during an in-flight DDL or syncpoint barrier. The core of the change is to make the dispatcher creation and merge logic aware of these in-flight barriers, correctly setting the startTs and a new skipDMLAsStartTs flag to ensure events are replayed correctly without data loss or duplication.

The changes are comprehensive, touching the dispatcher, dispatcher manager, maintainer, and protobuf definitions. The logic for determining the correct starting timestamp and skip flags is well-encapsulated in new helper functions, improving maintainability. The addition of two thorough integration tests (ddl_move_table and syncpoint_move_table) that use failpoints to simulate the exact race conditions is excellent and provides strong confidence in the fix. The code is well-documented with comments explaining the complex logic. Overall, this is a high-quality and important contribution to the stability of TiCDC.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved needs-1-more-lgtm Indicates a PR needs 1 more LGTM. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Temporary inconsistencies may occur between upstream and downstream in SYNCPOINT timestamps Under extreme conditions,

2 participants