wip-checksum #3782

hongyunyan · 2025-12-24T06:51:39Z

What problem does this PR solve?

Issue Number: close #xxx

What is changed and how it works?

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No code

Questions

Will it cause performance regression or break compatibility?

Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Please refer to [Release Notes Language Style Guide](https://pingcap.github.io/tidb-dev-guide/contribute-to-tidb/release-notes-style-guide.html) to write a quality release note.

If you don't think this PR needs a release note then fill it with `None`.

ti-chi-bot · 2025-12-24T06:51:44Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign sdojjy for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gemini-code-assist · 2025-12-24T06:52:43Z

Summary of Changes

Hello @hongyunyan, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request implements a critical dispatcher set checksum mechanism to enhance the reliability of watermark advancement. By introducing a robust verification process, the system can now prevent checkpoint progression when there are discrepancies between the expected and actual dispatcher configurations on individual captures. This ensures data consistency and stability, particularly in dynamic environments where dispatchers might be added, removed, or moved. The changes span across dispatcher and maintainer components, incorporating new communication protocols, state management, and comprehensive monitoring capabilities.

Highlights

Dispatcher Set Checksum Mechanism: Introduced a new dispatcher set checksum mechanism to ensure the consistency of dispatcher sets across the system. This mechanism prevents watermark advancement if the local dispatcher set on a capture does not match the maintainer's expected set.
Watermark Advancement Gating: Watermark advancement (both regular and redo) is now gated by the dispatcher set checksum state. If the checksum is 'UNINITIALIZED' or 'MISMATCH', watermark reporting is suppressed to prevent incorrect progress.
Checksum State Management: New protobuf messages (ChecksumState, DispatcherSetChecksumFingerprint, DispatcherSetChecksumUpdate, DispatcherSetChecksumAck) and corresponding Go structs/logic have been added to manage and communicate checksum states between maintainers and dispatchers.
Operator Integration: All dispatcher-related operators (Add, Move, Merge, Remove, Split) have been updated to interact with a DispatcherSetChecksumUpdater interface. This ensures that the maintainer's expected dispatcher set is correctly updated whenever dispatchers are added, removed, or modified by an operator.
Monitoring and Logging: New Prometheus metrics (DispatcherManagerDispatcherSetChecksumNotOKGauge, DispatcherManagerDispatcherSetChecksumNotOKTotal) are introduced to monitor the dispatcher set checksum states. Detailed logging is also added for non-OK checksum states, including warnings for prolonged inconsistencies.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a checksum mechanism to ensure the consistency of the dispatcher set between the maintainer and dispatcher managers. This is a significant improvement for data integrity and safety against scheduling anomalies. The implementation is thorough, covering initialization, incremental updates, and safety checks to halt checkpoint advancement on mismatches. The changes are well-integrated across various components. I have a few suggestions to refactor some parts of the new code to improve readability and reduce duplication.

gemini-code-assist · 2025-12-24T06:56:29Z

downstreamadapter/dispatchermanager/dispatcher_manager.go

+	for _, state := range []string{"mismatch", "uninitialized"} {
+		metrics.DispatcherManagerDispatcherSetChecksumNotOKGauge.DeleteLabelValues(keyspace, changefeed, capture, "default", state)
+		metrics.DispatcherManagerDispatcherSetChecksumNotOKTotal.DeleteLabelValues(keyspace, changefeed, capture, "default", state)
+	}
+	if e.RedoEnable {
+		for _, state := range []string{"mismatch", "uninitialized"} {
+			metrics.DispatcherManagerDispatcherSetChecksumNotOKGauge.DeleteLabelValues(keyspace, changefeed, capture, "redo", state)
+			metrics.DispatcherManagerDispatcherSetChecksumNotOKTotal.DeleteLabelValues(keyspace, changefeed, capture, "redo", state)
+		}
+	}


The logic for deleting metrics for 'default' and 'redo' modes is duplicated. This can be refactored to improve maintainability by iterating over a list of modes. For example:

modes := []string{"default"} if e.RedoEnable { modes = append(modes, "redo") } for _, mode := range modes { for _, state := range []string{"mismatch", "uninitialized"} { metrics.DispatcherManagerDispatcherSetChecksumNotOKGauge.DeleteLabelValues(keyspace, changefeed, capture, mode, state) metrics.DispatcherManagerDispatcherSetChecksumNotOKTotal.DeleteLabelValues(keyspace, changefeed, capture, mode, state) } }

gemini-code-assist · 2025-12-24T06:56:29Z

downstreamadapter/dispatchermanager/dispatcher_set_checksum.go

+func (e *DispatcherManager) verifyDispatcherSetChecksum(mode int64, actual dispatcherSetFingerprint) heartbeatpb.ChecksumState {
+	now := time.Now()
+	capture := appcontext.GetID()
+	modeLabel := common.StringMode(mode)
+	keyspace := e.changefeedID.Keyspace()
+	changefeed := e.changefeedID.Name()
+
+	var (
+		state           heartbeatpb.ChecksumState
+		expectedSeq     uint64
+		expectedInit    bool
+		expectedFP      dispatcherSetFingerprint
+		oldState        heartbeatpb.ChecksumState
+		nonOKSince      time.Time
+		needGaugeUpdate bool
+		logRecovered    bool
+		logNotOKWarn    bool
+		logNotOKError   bool
+		recoveredFor    time.Duration
+		notOKFor        time.Duration
+	)
+
+	e.dispatcherSetChecksum.mu.Lock()
+	expected := &e.dispatcherSetChecksum.defaultExpected
+	runtime := &e.dispatcherSetChecksum.defaultRuntime
+	if common.IsRedoMode(mode) {
+		expected = &e.dispatcherSetChecksum.redoExpected
+		runtime = &e.dispatcherSetChecksum.redoRuntime
+	}
+
+	expectedSeq = expected.seq
+	expectedInit = expected.initialized
+	expectedFP = expected.fingerprint
+
+	if !expected.initialized {
+		state = heartbeatpb.ChecksumState_UNINITIALIZED
+	} else if !actual.equal(expected.fingerprint) {
+		state = heartbeatpb.ChecksumState_MISMATCH
+	} else {
+		state = heartbeatpb.ChecksumState_OK
+	}
+
+	oldState = runtime.state
+	nonOKSince = runtime.nonOKSince
+	needGaugeUpdate = !runtime.gaugeInitialized || oldState != state
+
+	const (
+		errorAfter    = 30 * time.Second
+		errorInterval = 30 * time.Second
+	)
+
+	if state == heartbeatpb.ChecksumState_OK {
+		if oldState != heartbeatpb.ChecksumState_OK && !runtime.nonOKSince.IsZero() {
+			logRecovered = true
+			recoveredFor = now.Sub(runtime.nonOKSince)
+		}
+		runtime.state = state
+		runtime.nonOKSince = time.Time{}
+		runtime.lastErrorLogTime = time.Time{}
+	} else {
+		needResetTimer := oldState == heartbeatpb.ChecksumState_OK || oldState != state
+		if needResetTimer || runtime.nonOKSince.IsZero() {
+			runtime.nonOKSince = now
+			runtime.lastErrorLogTime = time.Time{}
+			logNotOKWarn = true
+		} else {
+			notOKFor = now.Sub(runtime.nonOKSince)
+			if notOKFor >= errorAfter && now.Sub(runtime.lastErrorLogTime) >= errorInterval {
+				runtime.lastErrorLogTime = now
+				logNotOKError = true
+			}
+		}
+		runtime.state = state
+		nonOKSince = runtime.nonOKSince
+	}
+	runtime.gaugeInitialized = true
+	e.dispatcherSetChecksum.mu.Unlock()
+
+	if needGaugeUpdate {
+		setGauge := func(stateLabel string, value float64) {
+			metrics.DispatcherManagerDispatcherSetChecksumNotOKGauge.WithLabelValues(
+				keyspace, changefeed, capture, modeLabel, stateLabel,
+			).Set(value)
+		}
+
+		setGauge("mismatch", 0)
+		setGauge("uninitialized", 0)
+
+		if state != heartbeatpb.ChecksumState_OK {
+			stateLabel := "mismatch"
+			if state == heartbeatpb.ChecksumState_UNINITIALIZED {
+				stateLabel = "uninitialized"
+			}
+			setGauge(stateLabel, 1)
+		}
+	}
+
+	if logRecovered {
+		log.Info("dispatcher set checksum recovered",
+			zap.Stringer("changefeedID", e.changefeedID),
+			zap.String("capture", capture),
+			zap.String("mode", modeLabel),
+			zap.Duration("duration", recoveredFor),
+			zap.Uint64("expectedSeq", expectedSeq),
+		)
+	}
+
+	if logNotOKWarn || logNotOKError {
+		level := "warn"
+		if logNotOKError {
+			level = "error"
+		}
+		stateStr := "mismatch"
+		if state == heartbeatpb.ChecksumState_UNINITIALIZED {
+			stateStr = "uninitialized"
+		}
+		notOKFor = now.Sub(nonOKSince)
+		fields := []zap.Field{
+			zap.Stringer("changefeedID", e.changefeedID),
+			zap.String("capture", capture),
+			zap.String("mode", modeLabel),
+			zap.String("state", stateStr),
+			zap.Duration("duration", notOKFor),
+			zap.Uint64("expectedSeq", expectedSeq),
+			zap.Bool("expectedInitialized", expectedInit),
+			zap.Uint64("actualCount", actual.count),
+			zap.Uint64("actualXorHigh", actual.xorHigh),
+			zap.Uint64("actualXorLow", actual.xorLow),
+			zap.Uint64("actualSumHigh", actual.sumHigh),
+			zap.Uint64("actualSumLow", actual.sumLow),
+			zap.Uint64("expectedCount", expectedFP.count),
+			zap.Uint64("expectedXorHigh", expectedFP.xorHigh),
+			zap.Uint64("expectedXorLow", expectedFP.xorLow),
+			zap.Uint64("expectedSumHigh", expectedFP.sumHigh),
+			zap.Uint64("expectedSumLow", expectedFP.sumLow),
+			zap.String("prevState", oldState.String()),
+		}
+		if level == "error" {
+			log.Error("dispatcher set checksum not ok, skip watermark reporting", fields...)
+		} else {
+			log.Warn("dispatcher set checksum not ok, skip watermark reporting", fields...)
+		}
+	}
+
+	return state
+}


The verifyDispatcherSetChecksum function is quite long and complex, which can make it difficult to understand and maintain. Consider refactoring it by extracting parts of the logic into smaller, well-named helper functions. For example, the metric update logic (lines 285-302) and the logging logic (lines 304-349) could be moved to separate functions.

gemini-code-assist · 2025-12-24T06:56:29Z

maintainer/dispatcher_set_checksum.go

+	for capture, ids := range defaultExpected {
+		state, ok := m.defaultNodes[capture]
+		if !ok {
+			continue
+		}
+		for _, id := range ids {
+			oldCapture, exists := m.defaultDispatcherToNode[id]
+			if exists {
+				if oldCapture == capture {
+					log.Warn("dispatcher already exists in expected set, ignore it",
+						zap.Stringer("changefeedID", m.changefeedID),
+						zap.String("dispatcherID", id.String()),
+						zap.String("capture", capture.String()),
+						zap.String("mode", common.StringMode(common.DefaultMode)),
+					)
+					continue
+				}
+				log.Warn("dispatcher exists in another capture, override expected node",
+					zap.Stringer("changefeedID", m.changefeedID),
+					zap.String("dispatcherID", id.String()),
+					zap.String("oldCapture", oldCapture.String()),
+					zap.String("newCapture", capture.String()),
+					zap.String("mode", common.StringMode(common.DefaultMode)),
+				)
+				if oldState, ok := m.defaultNodes[oldCapture]; ok {
+					oldState.fingerprint.remove(id)
+				}
+			}
+			m.defaultDispatcherToNode[id] = capture
+			state.fingerprint.add(id)
+		}
+	}
+
+	if m.redoEnabled {
+		for capture, ids := range redoExpected {
+			state, ok := m.redoNodes[capture]
+			if !ok {
+				continue
+			}
+			for _, id := range ids {
+				oldCapture, exists := m.redoDispatcherToNode[id]
+				if exists {
+					if oldCapture == capture {
+						log.Warn("dispatcher already exists in expected set, ignore it",
+							zap.Stringer("changefeedID", m.changefeedID),
+							zap.String("dispatcherID", id.String()),
+							zap.String("capture", capture.String()),
+							zap.String("mode", common.StringMode(common.RedoMode)),
+						)
+						continue
+					}
+					log.Warn("dispatcher exists in another capture, override expected node",
+						zap.Stringer("changefeedID", m.changefeedID),
+						zap.String("dispatcherID", id.String()),
+						zap.String("oldCapture", oldCapture.String()),
+						zap.String("newCapture", capture.String()),
+						zap.String("mode", common.StringMode(common.RedoMode)),
+					)
+					if oldState, ok := m.redoNodes[oldCapture]; ok {
+						oldState.fingerprint.remove(id)
+					}
+				}
+				m.redoDispatcherToNode[id] = capture
+				state.fingerprint.add(id)
+			}
+		}
+	}


The logic for building the expected fingerprints for default and redo modes is duplicated. This could be extracted into a helper function to improve code clarity and reduce duplication. The helper function could take the mode-specific maps as arguments and handle the fingerprint calculation.

gemini-code-assist · 2025-12-24T06:56:29Z

maintainer/maintainer.go

+		defaultExpected := make(map[node.ID][]common.DispatcherID, len(nodes))
+		defaultExpected[m.selfNode.ID] = append(defaultExpected[m.selfNode.ID], m.ddlSpan.ID)
+		for _, span := range m.controller.spanController.GetReplicating() {
+			capture := span.GetNodeID()
+			defaultExpected[capture] = append(defaultExpected[capture], span.ID)
+		}
+
+		var redoExpected map[node.ID][]common.DispatcherID
+		if m.enableRedo {
+			redoExpected = make(map[node.ID][]common.DispatcherID, len(nodes))
+			redoExpected[m.selfNode.ID] = append(redoExpected[m.selfNode.ID], m.redoDDLSpan.ID)
+			for _, span := range m.controller.redoSpanController.GetReplicating() {
+				capture := span.GetNodeID()
+				redoExpected[capture] = append(redoExpected[capture], span.ID)
+			}
+		}


The logic for building defaultExpected and redoExpected is very similar. This could be refactored into a helper function to avoid code duplication and improve readability. The function could take the mode and the corresponding span controller as arguments and return the expected dispatcher map.

hongyunyan · 2025-12-26T06:21:03Z

/gemini review

hongyunyan · 2025-12-26T06:21:08Z

/gemini summary

gemini-code-assist

Code Review

This pull request introduces a checksum mechanism to verify the consistency of the dispatcher set between the maintainer and dispatcher managers. This is a crucial feature for ensuring data integrity and preventing incorrect checkpoint advancement, especially during dispatcher lifecycle changes. The implementation is comprehensive, covering both the maintainer and dispatcher manager sides, with new protocol messages, metrics, and tests. The logic for incremental updates, acknowledgements, and resends appears robust. Operators are correctly updated to maintain the expected dispatcher set.

One minor suggestion for code improvement is to refactor a small piece of duplicated code in the cleanMetrics function to improve maintainability.

gemini-code-assist · 2025-12-26T06:23:51Z

downstreamadapter/dispatchermanager/dispatcher_manager.go

+	for _, state := range []string{"mismatch", "uninitialized"} {
+		metrics.DispatcherManagerDispatcherSetChecksumNotOKGauge.DeleteLabelValues(keyspace, changefeed, capture, "default", state)
+		metrics.DispatcherManagerDispatcherSetChecksumNotOKTotal.DeleteLabelValues(keyspace, changefeed, capture, "default", state)
+	}
+	if e.RedoEnable {
+		for _, state := range []string{"mismatch", "uninitialized"} {
+			metrics.DispatcherManagerDispatcherSetChecksumNotOKGauge.DeleteLabelValues(keyspace, changefeed, capture, "redo", state)
+			metrics.DispatcherManagerDispatcherSetChecksumNotOKTotal.DeleteLabelValues(keyspace, changefeed, capture, "redo", state)
+		}
+	}


The logic for deleting checksum-related metrics for default and redo modes is duplicated. This can be refactored into a loop over the modes to improve readability and maintainability.

Suggested change

for _, state := range []string{"mismatch", "uninitialized"} {

metrics.DispatcherManagerDispatcherSetChecksumNotOKGauge.DeleteLabelValues(keyspace, changefeed, capture, "default", state)

metrics.DispatcherManagerDispatcherSetChecksumNotOKTotal.DeleteLabelValues(keyspace, changefeed, capture, "default", state)

}

if e.RedoEnable {

for _, state := range []string{"mismatch", "uninitialized"} {

metrics.DispatcherManagerDispatcherSetChecksumNotOKGauge.DeleteLabelValues(keyspace, changefeed, capture, "redo", state)

metrics.DispatcherManagerDispatcherSetChecksumNotOKTotal.DeleteLabelValues(keyspace, changefeed, capture, "redo", state)

}

}

modes := []string{"default"}

if e.RedoEnable {

modes = append(modes, "redo")

}

for _, mode := range modes {

for _, state := range []string{"mismatch", "uninitialized"} {

metrics.DispatcherManagerDispatcherSetChecksumNotOKGauge.DeleteLabelValues(keyspace, changefeed, capture, mode, state)

metrics.DispatcherManagerDispatcherSetChecksumNotOKTotal.DeleteLabelValues(keyspace, changefeed, capture, mode, state)

}

}

hongyunyan · 2025-12-26T06:24:45Z

/test all

gemini-code-assist · 2025-12-26T06:25:35Z

Summary of Changes

This pull request introduces a robust dispatcher set checksum mechanism to enhance the reliability of watermark advancement in the system. By verifying the consistency of dispatcher configurations on individual captures against the maintainer's expectations, it prevents checkpoint progression in the event of discrepancies. This ensures data integrity and system stability, especially in dynamic environments where dispatcher assignments change. The changes involve new communication protocols, state management, and comprehensive monitoring across both dispatcher and maintainer components.

Highlights

Dispatcher Set Checksum Mechanism: A new dispatcher set checksum mechanism has been introduced to ensure the consistency of dispatcher sets across the system. This mechanism prevents watermark advancement if the local dispatcher set on a capture does not match the maintainer's expected set.
Watermark Advancement Gating: Watermark advancement (both regular and redo) is now gated by the dispatcher set checksum state. If the checksum is 'UNINITIALIZED' or 'MISMATCH', watermark reporting is suppressed to prevent incorrect progress.
Checksum State Management: New protobuf messages (ChecksumState, DispatcherSetChecksum, DispatcherSetChecksumUpdate, DispatcherSetChecksumAck) and corresponding Go structs/logic have been added to manage and communicate checksum states between maintainers and dispatchers.
Operator Integration: All dispatcher-related operators (Add, Move, Merge, Remove, Split) have been updated to interact with a DispatcherSetChecksumUpdater interface. This ensures that the maintainer's expected dispatcher set is correctly updated whenever dispatchers are added, removed, or modified by an operator.
Monitoring and Logging: New Prometheus metrics (DispatcherManagerDispatcherSetChecksumNotOKGauge, DispatcherManagerDispatcherSetChecksumNotOKTotal) are introduced to monitor the dispatcher set checksum states. Detailed logging is also added for non-OK checksum states, including warnings for prolonged inconsistencies.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

downstreamadapter/dispatchermanager/dispatcher_manager.go
- Imported pkg/set_checksum for checksum functionality.
- Added dispatcherSetChecksum field to DispatcherManager to store expected checksum states.
- Modified collectComponentStatusWhenChanged to compute and apply checksums to heartbeat messages.
- Modified aggregateDispatcherHeartbeats to compute checksums for default and redo modes and apply them to heartbeat messages.
- Added logic to cleanMetrics to delete new checksum-related Prometheus metrics upon cleanup.
downstreamadapter/dispatchermanager/dispatcher_manager_info.go
- Imported github.com/pingcap/log and go.uber.org/zap for logging.
- Replaced SetMaintainerID with UpdateMaintainer to also track the maintainer epoch and reset dispatcher set checksums on changes.
downstreamadapter/dispatchermanager/dispatcher_set_checksum.go
- Added new file to define structs (dispatcherSetChecksumExpected, dispatcherSetChecksumRuntime, dispatcherSetChecksumState) for tracking and managing checksum states.
- Implemented ApplyDispatcherSetChecksumUpdate to process checksum updates from the Maintainer, handling reordering.
- Implemented ResetDispatcherSetChecksum to clear all checksum states.
- Implemented shouldIncludeDispatcherInChecksum to filter dispatcher states for checksum calculation.
- Implemented computeDispatcherSetChecksum to calculate the checksum of active dispatchers.
- Implemented applyChecksumStateToHeartbeat to verify checksums and suppress watermark reporting if states are not OK.
- Implemented incDispatcherSetChecksumNotOKTotal to increment a counter for suppressed watermarks.
- Implemented verifyDispatcherSetChecksum to compare actual and expected checksums, update runtime state, and emit throttled logs.
- Included helper functions for managing checksum states and metrics.
downstreamadapter/dispatchermanager/dispatcher_set_checksum_test.go
- Added new file with unit tests for DispatcherSetChecksumWatermarkSuppression, covering uninitialized, mismatch, and OK states.
downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator.go
- Added handling for DispatcherSetChecksumUpdate messages in handleMessages.
- Modified handleBootstrapRequest to call manager.UpdateMaintainer.
- Added handleDispatcherSetChecksumUpdate to process checksum updates from the Maintainer, including validation and sending DispatcherSetChecksumAck.
heartbeatpb/heartbeat.pb.go
- Added ChecksumState enum (UNINITIALIZED, MISMATCH, OK).
- Added ChecksumState and RedoChecksumState fields to HeartBeatRequest.
- Added DispatcherSetChecksum message (count, xorHigh, xorLow, sumHigh, sumLow).
- Added DispatcherSetChecksumUpdate message (changefeedID, epoch, mode, seq, checksum).
- Added DispatcherSetChecksumAck message (changefeedID, epoch, mode, seq).
- Updated MarshalToSizedBuffer, Size, and Unmarshal methods to include new fields and messages.
- Updated init() function to register new enums and types.
heartbeatpb/heartbeat.proto
- Added ChecksumState enum.
- Added checksumState and redoChecksumState fields to HeartBeatRequest.
- Added DispatcherSetChecksum, DispatcherSetChecksumUpdate, and DispatcherSetChecksumAck messages.
maintainer/barrier_event_test.go
- Updated operator.NewOperatorController calls to include nil for the new checksumUpdater argument in test setups.
maintainer/barrier_test.go
- Updated operator.NewOperatorController calls to include nil for the new checksumUpdater argument in test setups.
maintainer/capture_set_checksum_manager.go
- Added new file defining captureSetChecksumState and captureChecksumState structs for managing per-capture checksums.
- Implemented captureSetChecksumManager to maintain maintainer-side expected dispatcher IDs for a single mode.
- Provided methods for initializing, updating, flushing, acknowledging, resending, and observing checksum states.
maintainer/capture_set_checksum_test.go
- Added new file with unit tests for captureSetChecksumManager, including tests for checkpoint gating and resend/ack logic.
- Included a recordingMessageCenter mock for testing message sending.
maintainer/maintainer.go
- Added defaultChecksumManager and redoChecksumManager fields to Maintainer struct.
- Added checksumStateByCapture and redoChecksumStateByCapture to store observed checksum states.
- Initialized defaultChecksumManager and redoChecksumManager in NewMaintainer.
- Modified onMessage to handle TypeDispatcherSetChecksumAck messages.
- Updated onNodeChanged to delete checksum states for removed nodes and call RemoveNodes on checksum managers.
- Modified advanceRedoMetaTsOnce to check redoChecksumStateByCapture before advancing redo checkpoint.
- Modified calculateNewCheckpointTs to check checksumStateByCapture before advancing checkpoint.
- Updated onHeartbeatRequest to record ChecksumState and RedoChecksumState from heartbeats and call ObserveHeartbeat on checksum managers.
- Modified onBootstrapResponses to receive and send checksum messages from controller.FinishBootstrap.
- Updated handleResendMessage to flush and resend pending checksum updates from checksum managers.
maintainer/maintainer_controller.go
- Added defaultChecksumManager and redoChecksumManager fields to Controller struct.
- Modified NewController to accept DispatcherSetChecksumUpdater interfaces for checksum managers and pass them to operator.NewOperatorController.
- Updated initializeComponents to return checksum messages and calls resetAndBuildDispatcherSetChecksumMessages.
- Added resetAndBuildDispatcherSetChecksumMessages and resetAndBuildChecksumMessages to initialize and build checksum update messages.
maintainer/maintainer_controller_bootstrap.go
- Imported pkg/messaging.
- Modified FinishBootstrap to return a slice of *messaging.TargetMessage for checksum updates.
- Updated calls to initializeComponents to handle the returned checksum messages.
maintainer/maintainer_controller_helper.go
- Updated operator.NewSplitDispatcherOperator call to include operatorController.GetChecksumUpdater().
maintainer/maintainer_controller_test.go
- Updated NewController calls to include testChecksumUpdater{} for the new checksumUpdater arguments in test setups.
- Updated s.FinishBootstrap calls to handle the new checksumMsgs return value.
maintainer/maintainer_helper.go
- Added ChecksumStateCaptureMap struct and associated methods (newChecksumStateCaptureMap, Get, Set, Delete) to manage heartbeatpb.ChecksumState per node.
maintainer/maintainer_manager.go
- Added a case in recvMessages to handle TypeDispatcherSetChecksumAck messages.
maintainer/maintainer_test.go
- Added a case in mockDispatcherManager.handleMessage to handle TypeDispatcherSetChecksumUpdate messages.
- Added onDispatcherSetChecksumUpdate method to mockDispatcherManager to send DispatcherSetChecksumAck.
- Updated mockDispatcherManager.recvMessages to include TypeDispatcherSetChecksumUpdate.
- Modified mockDispatcherManager.onDispatchRequest and sendHeartbeat to include ChecksumState and RedoChecksumState in HeartBeatRequest.
maintainer/operator/checksum_updater.go
- Added new file defining DispatcherSetChecksumUpdater interface with an ApplyDelta method for updating expected dispatcher sets.
maintainer/operator/operator_add.go
- Added checksumUpdater field to AddDispatcherOperator struct.
- Modified NewAddDispatcherOperator to accept DispatcherSetChecksumUpdater.
- Updated PostFinish to call m.checksumUpdater.ApplyDelta when a span is marked replicating.
maintainer/operator/operator_add_test.go
- Updated NewAddDispatcherOperator calls to include testChecksumUpdater{}.
maintainer/operator/operator_controller.go
- Added checksumUpdater field to Controller struct.
- Modified NewOperatorController to accept DispatcherSetChecksumUpdater.
- Added GetChecksumUpdater method.
- Updated RemoveTasksBySchemaID and RemoveTasksByTableIDs to pass oc.GetChecksumUpdater() to newRemoveDispatcherOperator.
- Modified NewMoveOperator to use NewMoveDispatcherOperator with oc.GetChecksumUpdater().
- Modified AddMergeOperator to use NewMergeDispatcherOperator with oc.GetChecksumUpdater().
maintainer/operator/operator_controller_test.go
- Updated NewOperatorController calls to include nil or testChecksumUpdater{}.
- Updated newRemoveDispatcherOperator call to include testChecksumUpdater{}.
maintainer/operator/operator_merge.go
- Added checksumUpdater field to MergeDispatcherOperator struct.
- Modified NewMergeDispatcherOperator to accept DispatcherSetChecksumUpdater.
- Updated PostFinish to call m.checksumUpdater.ApplyDelta for removed and new dispatchers.
maintainer/operator/operator_merge_test.go
- Updated NewMergeDispatcherOperator calls to include testChecksumUpdater{}.
maintainer/operator/operator_move.go
- Added checksumUpdater field to MoveDispatcherOperator struct.
- Modified NewMoveDispatcherOperator to accept DispatcherSetChecksumUpdater.
- Updated finishAsAbsent and PostFinish to call m.checksumUpdater.ApplyDelta.
maintainer/operator/operator_move_test.go
- Updated NewMoveDispatcherOperator calls to include testChecksumUpdater{}.
maintainer/operator/operator_remove.go
- Added checksumUpdater field to removeDispatcherOperator struct.
- Modified newRemoveDispatcherOperator to accept DispatcherSetChecksumUpdater.
- Updated PostFinish to call m.checksumUpdater.ApplyDelta.
maintainer/operator/operator_remove_test.go
- Updated newRemoveDispatcherOperator calls to include testChecksumUpdater{}.
maintainer/operator/operator_split.go
- Added checksumUpdater field to SplitDispatcherOperator struct.
- Modified NewSplitDispatcherOperator to accept DispatcherSetChecksumUpdater.
- Updated PostFinish to call m.checksumUpdater.ApplyDelta for the original dispatcher.
maintainer/operator/operator_split_test.go
- Updated NewSplitDispatcherOperator calls to include testChecksumUpdater{}.
maintainer/operator/test_checksum_updater_test.go
- Added new file defining testChecksumUpdater struct implementing DispatcherSetChecksumUpdater for testing purposes.
maintainer/scheduler/balance.go
- Updated operator.NewSplitDispatcherOperator and operator.NewMoveDispatcherOperator calls to include s.operatorController.GetChecksumUpdater().
maintainer/scheduler/balance_splits.go
- Updated operator.NewSplitDispatcherOperator and operator.NewMoveDispatcherOperator calls to include s.operatorController.GetChecksumUpdater().
maintainer/scheduler/basic.go
- Updated operator.NewAddDispatcherOperator calls to include s.operatorController.GetChecksumUpdater().
maintainer/test_checksum_updater_test.go
- Added new file defining testChecksumUpdater struct implementing DispatcherSetChecksumUpdater for testing purposes.
pkg/messaging/message.go
- Added TypeDispatcherSetChecksumUpdate and TypeDispatcherSetChecksumAck to IOType enum.
- Updated String() method for new types.
- Updated decodeIOType and NewSingleTargetMessage to handle new message types.
pkg/metrics/dispatcher.go
- Added DispatcherManagerDispatcherSetChecksumNotOKGauge and DispatcherManagerDispatcherSetChecksumNotOKTotal Prometheus metrics.
- Updated initDispatcherMetrics to register new metrics.
pkg/set_checksum/set_checksum.go
- Added new file defining Checksum struct for an order-independent, incrementally updatable checksum.
- Implemented Add, Remove, Equal methods for checksum manipulation.
- Implemented FromPB and ToPB methods for conversion to/from protobuf messages.

Activity

A bot commented about the PR needing approval and a linked issue.
Gemini Code Assist provided an initial summary and highlights of the changes.
Gemini Code Assist provided several review comments suggesting refactoring for code duplication and complexity.
Automated tests failed, including pull-unit-test-next-gen, pull-unit-test, and pull-error-log-review.
The author requested another review and summary from Gemini Code Assist.

hongyunyan · 2025-12-26T14:44:39Z

/test pull-cdc-mysql-integration-light

hongyunyan · 2025-12-26T14:53:50Z

/test pull-cdc-mysql-integration-light

ti-chi-bot · 2025-12-26T15:04:36Z

[FORMAT CHECKER NOTIFICATION]

Notice: To remove the do-not-merge/needs-linked-issue label, please provide the linked issue number on one line in the PR body, for example: Issue Number: close #123 or Issue Number: ref #456.

_{📖 For more info, you can check the "Contribute Code" section in the development guide.}

hongyunyan · 2025-12-27T13:40:28Z

/test all

hongyunyan · 2025-12-28T04:03:49Z

/retest

ti-chi-bot · 2025-12-28T04:04:39Z

@hongyunyan: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-error-log-review	`14c6058`	link	true	`/test pull-error-log-review`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

update

67c5b01

ti-chi-bot bot added do-not-merge/needs-linked-issue release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Dec 24, 2025

ti-chi-bot bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Dec 24, 2025

gemini-code-assist bot reviewed Dec 24, 2025

View reviewed changes

hongyunyan added 5 commits December 25, 2025 21:28

update

31dadb8

update

b9b4a96

update

bfd266b

update

f007405

update

20e301c

gemini-code-assist bot reviewed Dec 26, 2025

View reviewed changes

update

3de398d

update

14c6058

wip-checksum #3782

Are you sure you want to change the base?

wip-checksum #3782

Conversation

hongyunyan commented Dec 24, 2025

What problem does this PR solve?

What is changed and how it works?

Check List

Tests

Questions

Will it cause performance regression or break compatibility?

Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Uh oh!

ti-chi-bot bot commented Dec 24, 2025

Uh oh!

gemini-code-assist bot commented Dec 24, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

hongyunyan commented Dec 26, 2025

Uh oh!

hongyunyan commented Dec 26, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

hongyunyan commented Dec 26, 2025

Uh oh!

gemini-code-assist bot commented Dec 26, 2025

Summary of Changes

Highlights

Uh oh!

hongyunyan commented Dec 26, 2025

Uh oh!

hongyunyan commented Dec 26, 2025

Uh oh!

ti-chi-bot bot commented Dec 26, 2025

Uh oh!

hongyunyan commented Dec 27, 2025

Uh oh!

hongyunyan commented Dec 28, 2025

Uh oh!

ti-chi-bot bot commented Dec 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ti-chi-bot bot commented Dec 28, 2025 •

edited

Loading