Skip to content

Conversation

@hongyunyan
Copy link
Collaborator

What problem does this PR solve?

Issue Number: close #xxx

What is changed and how it works?

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Questions

Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Please refer to [Release Notes Language Style Guide](https://pingcap.github.io/tidb-dev-guide/contribute-to-tidb/release-notes-style-guide.html) to write a quality release note.

If you don't think this PR needs a release note then fill it with `None`.

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-linked-issue release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Dec 24, 2025
@ti-chi-bot
Copy link

ti-chi-bot bot commented Dec 24, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign sdojjy for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Dec 24, 2025
@gemini-code-assist
Copy link

Summary of Changes

Hello @hongyunyan, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request implements a critical dispatcher set checksum mechanism to enhance the reliability of watermark advancement. By introducing a robust verification process, the system can now prevent checkpoint progression when there are discrepancies between the expected and actual dispatcher configurations on individual captures. This ensures data consistency and stability, particularly in dynamic environments where dispatchers might be added, removed, or moved. The changes span across dispatcher and maintainer components, incorporating new communication protocols, state management, and comprehensive monitoring capabilities.

Highlights

  • Dispatcher Set Checksum Mechanism: Introduced a new dispatcher set checksum mechanism to ensure the consistency of dispatcher sets across the system. This mechanism prevents watermark advancement if the local dispatcher set on a capture does not match the maintainer's expected set.
  • Watermark Advancement Gating: Watermark advancement (both regular and redo) is now gated by the dispatcher set checksum state. If the checksum is 'UNINITIALIZED' or 'MISMATCH', watermark reporting is suppressed to prevent incorrect progress.
  • Checksum State Management: New protobuf messages (ChecksumState, DispatcherSetChecksumFingerprint, DispatcherSetChecksumUpdate, DispatcherSetChecksumAck) and corresponding Go structs/logic have been added to manage and communicate checksum states between maintainers and dispatchers.
  • Operator Integration: All dispatcher-related operators (Add, Move, Merge, Remove, Split) have been updated to interact with a DispatcherSetChecksumUpdater interface. This ensures that the maintainer's expected dispatcher set is correctly updated whenever dispatchers are added, removed, or modified by an operator.
  • Monitoring and Logging: New Prometheus metrics (DispatcherManagerDispatcherSetChecksumNotOKGauge, DispatcherManagerDispatcherSetChecksumNotOKTotal) are introduced to monitor the dispatcher set checksum states. Detailed logging is also added for non-OK checksum states, including warnings for prolonged inconsistencies.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a checksum mechanism to ensure the consistency of the dispatcher set between the maintainer and dispatcher managers. This is a significant improvement for data integrity and safety against scheduling anomalies. The implementation is thorough, covering initialization, incremental updates, and safety checks to halt checkpoint advancement on mismatches. The changes are well-integrated across various components. I have a few suggestions to refactor some parts of the new code to improve readability and reduce duplication.

Comment on lines 990 to 999
for _, state := range []string{"mismatch", "uninitialized"} {
metrics.DispatcherManagerDispatcherSetChecksumNotOKGauge.DeleteLabelValues(keyspace, changefeed, capture, "default", state)
metrics.DispatcherManagerDispatcherSetChecksumNotOKTotal.DeleteLabelValues(keyspace, changefeed, capture, "default", state)
}
if e.RedoEnable {
for _, state := range []string{"mismatch", "uninitialized"} {
metrics.DispatcherManagerDispatcherSetChecksumNotOKGauge.DeleteLabelValues(keyspace, changefeed, capture, "redo", state)
metrics.DispatcherManagerDispatcherSetChecksumNotOKTotal.DeleteLabelValues(keyspace, changefeed, capture, "redo", state)
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for deleting metrics for 'default' and 'redo' modes is duplicated. This can be refactored to improve maintainability by iterating over a list of modes. For example:

	modes := []string{"default"}
	if e.RedoEnable {
		modes = append(modes, "redo")
	}
	for _, mode := range modes {
		for _, state := range []string{"mismatch", "uninitialized"} {
			metrics.DispatcherManagerDispatcherSetChecksumNotOKGauge.DeleteLabelValues(keyspace, changefeed, capture, mode, state)
			metrics.DispatcherManagerDispatcherSetChecksumNotOKTotal.DeleteLabelValues(keyspace, changefeed, capture, mode, state)
		}
	}

Comment on lines 207 to 352
func (e *DispatcherManager) verifyDispatcherSetChecksum(mode int64, actual dispatcherSetFingerprint) heartbeatpb.ChecksumState {
now := time.Now()
capture := appcontext.GetID()
modeLabel := common.StringMode(mode)
keyspace := e.changefeedID.Keyspace()
changefeed := e.changefeedID.Name()

var (
state heartbeatpb.ChecksumState
expectedSeq uint64
expectedInit bool
expectedFP dispatcherSetFingerprint
oldState heartbeatpb.ChecksumState
nonOKSince time.Time
needGaugeUpdate bool
logRecovered bool
logNotOKWarn bool
logNotOKError bool
recoveredFor time.Duration
notOKFor time.Duration
)

e.dispatcherSetChecksum.mu.Lock()
expected := &e.dispatcherSetChecksum.defaultExpected
runtime := &e.dispatcherSetChecksum.defaultRuntime
if common.IsRedoMode(mode) {
expected = &e.dispatcherSetChecksum.redoExpected
runtime = &e.dispatcherSetChecksum.redoRuntime
}

expectedSeq = expected.seq
expectedInit = expected.initialized
expectedFP = expected.fingerprint

if !expected.initialized {
state = heartbeatpb.ChecksumState_UNINITIALIZED
} else if !actual.equal(expected.fingerprint) {
state = heartbeatpb.ChecksumState_MISMATCH
} else {
state = heartbeatpb.ChecksumState_OK
}

oldState = runtime.state
nonOKSince = runtime.nonOKSince
needGaugeUpdate = !runtime.gaugeInitialized || oldState != state

const (
errorAfter = 30 * time.Second
errorInterval = 30 * time.Second
)

if state == heartbeatpb.ChecksumState_OK {
if oldState != heartbeatpb.ChecksumState_OK && !runtime.nonOKSince.IsZero() {
logRecovered = true
recoveredFor = now.Sub(runtime.nonOKSince)
}
runtime.state = state
runtime.nonOKSince = time.Time{}
runtime.lastErrorLogTime = time.Time{}
} else {
needResetTimer := oldState == heartbeatpb.ChecksumState_OK || oldState != state
if needResetTimer || runtime.nonOKSince.IsZero() {
runtime.nonOKSince = now
runtime.lastErrorLogTime = time.Time{}
logNotOKWarn = true
} else {
notOKFor = now.Sub(runtime.nonOKSince)
if notOKFor >= errorAfter && now.Sub(runtime.lastErrorLogTime) >= errorInterval {
runtime.lastErrorLogTime = now
logNotOKError = true
}
}
runtime.state = state
nonOKSince = runtime.nonOKSince
}
runtime.gaugeInitialized = true
e.dispatcherSetChecksum.mu.Unlock()

if needGaugeUpdate {
setGauge := func(stateLabel string, value float64) {
metrics.DispatcherManagerDispatcherSetChecksumNotOKGauge.WithLabelValues(
keyspace, changefeed, capture, modeLabel, stateLabel,
).Set(value)
}

setGauge("mismatch", 0)
setGauge("uninitialized", 0)

if state != heartbeatpb.ChecksumState_OK {
stateLabel := "mismatch"
if state == heartbeatpb.ChecksumState_UNINITIALIZED {
stateLabel = "uninitialized"
}
setGauge(stateLabel, 1)
}
}

if logRecovered {
log.Info("dispatcher set checksum recovered",
zap.Stringer("changefeedID", e.changefeedID),
zap.String("capture", capture),
zap.String("mode", modeLabel),
zap.Duration("duration", recoveredFor),
zap.Uint64("expectedSeq", expectedSeq),
)
}

if logNotOKWarn || logNotOKError {
level := "warn"
if logNotOKError {
level = "error"
}
stateStr := "mismatch"
if state == heartbeatpb.ChecksumState_UNINITIALIZED {
stateStr = "uninitialized"
}
notOKFor = now.Sub(nonOKSince)
fields := []zap.Field{
zap.Stringer("changefeedID", e.changefeedID),
zap.String("capture", capture),
zap.String("mode", modeLabel),
zap.String("state", stateStr),
zap.Duration("duration", notOKFor),
zap.Uint64("expectedSeq", expectedSeq),
zap.Bool("expectedInitialized", expectedInit),
zap.Uint64("actualCount", actual.count),
zap.Uint64("actualXorHigh", actual.xorHigh),
zap.Uint64("actualXorLow", actual.xorLow),
zap.Uint64("actualSumHigh", actual.sumHigh),
zap.Uint64("actualSumLow", actual.sumLow),
zap.Uint64("expectedCount", expectedFP.count),
zap.Uint64("expectedXorHigh", expectedFP.xorHigh),
zap.Uint64("expectedXorLow", expectedFP.xorLow),
zap.Uint64("expectedSumHigh", expectedFP.sumHigh),
zap.Uint64("expectedSumLow", expectedFP.sumLow),
zap.String("prevState", oldState.String()),
}
if level == "error" {
log.Error("dispatcher set checksum not ok, skip watermark reporting", fields...)
} else {
log.Warn("dispatcher set checksum not ok, skip watermark reporting", fields...)
}
}

return state
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The verifyDispatcherSetChecksum function is quite long and complex, which can make it difficult to understand and maintain. Consider refactoring it by extracting parts of the logic into smaller, well-named helper functions. For example, the metric update logic (lines 285-302) and the logging logic (lines 304-349) could be moved to separate functions.

Comment on lines 142 to 208
for capture, ids := range defaultExpected {
state, ok := m.defaultNodes[capture]
if !ok {
continue
}
for _, id := range ids {
oldCapture, exists := m.defaultDispatcherToNode[id]
if exists {
if oldCapture == capture {
log.Warn("dispatcher already exists in expected set, ignore it",
zap.Stringer("changefeedID", m.changefeedID),
zap.String("dispatcherID", id.String()),
zap.String("capture", capture.String()),
zap.String("mode", common.StringMode(common.DefaultMode)),
)
continue
}
log.Warn("dispatcher exists in another capture, override expected node",
zap.Stringer("changefeedID", m.changefeedID),
zap.String("dispatcherID", id.String()),
zap.String("oldCapture", oldCapture.String()),
zap.String("newCapture", capture.String()),
zap.String("mode", common.StringMode(common.DefaultMode)),
)
if oldState, ok := m.defaultNodes[oldCapture]; ok {
oldState.fingerprint.remove(id)
}
}
m.defaultDispatcherToNode[id] = capture
state.fingerprint.add(id)
}
}

if m.redoEnabled {
for capture, ids := range redoExpected {
state, ok := m.redoNodes[capture]
if !ok {
continue
}
for _, id := range ids {
oldCapture, exists := m.redoDispatcherToNode[id]
if exists {
if oldCapture == capture {
log.Warn("dispatcher already exists in expected set, ignore it",
zap.Stringer("changefeedID", m.changefeedID),
zap.String("dispatcherID", id.String()),
zap.String("capture", capture.String()),
zap.String("mode", common.StringMode(common.RedoMode)),
)
continue
}
log.Warn("dispatcher exists in another capture, override expected node",
zap.Stringer("changefeedID", m.changefeedID),
zap.String("dispatcherID", id.String()),
zap.String("oldCapture", oldCapture.String()),
zap.String("newCapture", capture.String()),
zap.String("mode", common.StringMode(common.RedoMode)),
)
if oldState, ok := m.redoNodes[oldCapture]; ok {
oldState.fingerprint.remove(id)
}
}
m.redoDispatcherToNode[id] = capture
state.fingerprint.add(id)
}
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for building the expected fingerprints for default and redo modes is duplicated. This could be extracted into a helper function to improve code clarity and reduce duplication. The helper function could take the mode-specific maps as arguments and handle the fingerprint calculation.

Comment on lines 1003 to 1018
defaultExpected := make(map[node.ID][]common.DispatcherID, len(nodes))
defaultExpected[m.selfNode.ID] = append(defaultExpected[m.selfNode.ID], m.ddlSpan.ID)
for _, span := range m.controller.spanController.GetReplicating() {
capture := span.GetNodeID()
defaultExpected[capture] = append(defaultExpected[capture], span.ID)
}

var redoExpected map[node.ID][]common.DispatcherID
if m.enableRedo {
redoExpected = make(map[node.ID][]common.DispatcherID, len(nodes))
redoExpected[m.selfNode.ID] = append(redoExpected[m.selfNode.ID], m.redoDDLSpan.ID)
for _, span := range m.controller.redoSpanController.GetReplicating() {
capture := span.GetNodeID()
redoExpected[capture] = append(redoExpected[capture], span.ID)
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for building defaultExpected and redoExpected is very similar. This could be refactored into a helper function to avoid code duplication and improve readability. The function could take the mode and the corresponding span controller as arguments and return the expected dispatcher map.

@hongyunyan
Copy link
Collaborator Author

/gemini review

@hongyunyan
Copy link
Collaborator Author

/gemini summary

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a checksum mechanism to verify the consistency of the dispatcher set between the maintainer and dispatcher managers. This is a crucial feature for ensuring data integrity and preventing incorrect checkpoint advancement, especially during dispatcher lifecycle changes. The implementation is comprehensive, covering both the maintainer and dispatcher manager sides, with new protocol messages, metrics, and tests. The logic for incremental updates, acknowledgements, and resends appears robust. Operators are correctly updated to maintain the expected dispatcher set.

One minor suggestion for code improvement is to refactor a small piece of duplicated code in the cleanMetrics function to improve maintainability.

Comment on lines 989 to 998
for _, state := range []string{"mismatch", "uninitialized"} {
metrics.DispatcherManagerDispatcherSetChecksumNotOKGauge.DeleteLabelValues(keyspace, changefeed, capture, "default", state)
metrics.DispatcherManagerDispatcherSetChecksumNotOKTotal.DeleteLabelValues(keyspace, changefeed, capture, "default", state)
}
if e.RedoEnable {
for _, state := range []string{"mismatch", "uninitialized"} {
metrics.DispatcherManagerDispatcherSetChecksumNotOKGauge.DeleteLabelValues(keyspace, changefeed, capture, "redo", state)
metrics.DispatcherManagerDispatcherSetChecksumNotOKTotal.DeleteLabelValues(keyspace, changefeed, capture, "redo", state)
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for deleting checksum-related metrics for default and redo modes is duplicated. This can be refactored into a loop over the modes to improve readability and maintainability.

Suggested change
for _, state := range []string{"mismatch", "uninitialized"} {
metrics.DispatcherManagerDispatcherSetChecksumNotOKGauge.DeleteLabelValues(keyspace, changefeed, capture, "default", state)
metrics.DispatcherManagerDispatcherSetChecksumNotOKTotal.DeleteLabelValues(keyspace, changefeed, capture, "default", state)
}
if e.RedoEnable {
for _, state := range []string{"mismatch", "uninitialized"} {
metrics.DispatcherManagerDispatcherSetChecksumNotOKGauge.DeleteLabelValues(keyspace, changefeed, capture, "redo", state)
metrics.DispatcherManagerDispatcherSetChecksumNotOKTotal.DeleteLabelValues(keyspace, changefeed, capture, "redo", state)
}
}
modes := []string{"default"}
if e.RedoEnable {
modes = append(modes, "redo")
}
for _, mode := range modes {
for _, state := range []string{"mismatch", "uninitialized"} {
metrics.DispatcherManagerDispatcherSetChecksumNotOKGauge.DeleteLabelValues(keyspace, changefeed, capture, mode, state)
metrics.DispatcherManagerDispatcherSetChecksumNotOKTotal.DeleteLabelValues(keyspace, changefeed, capture, mode, state)
}
}

@hongyunyan
Copy link
Collaborator Author

/test all

@gemini-code-assist
Copy link

Summary of Changes

This pull request introduces a robust dispatcher set checksum mechanism to enhance the reliability of watermark advancement in the system. By verifying the consistency of dispatcher configurations on individual captures against the maintainer's expectations, it prevents checkpoint progression in the event of discrepancies. This ensures data integrity and system stability, especially in dynamic environments where dispatcher assignments change. The changes involve new communication protocols, state management, and comprehensive monitoring across both dispatcher and maintainer components.

Highlights

  • Dispatcher Set Checksum Mechanism: A new dispatcher set checksum mechanism has been introduced to ensure the consistency of dispatcher sets across the system. This mechanism prevents watermark advancement if the local dispatcher set on a capture does not match the maintainer's expected set.
  • Watermark Advancement Gating: Watermark advancement (both regular and redo) is now gated by the dispatcher set checksum state. If the checksum is 'UNINITIALIZED' or 'MISMATCH', watermark reporting is suppressed to prevent incorrect progress.
  • Checksum State Management: New protobuf messages (ChecksumState, DispatcherSetChecksum, DispatcherSetChecksumUpdate, DispatcherSetChecksumAck) and corresponding Go structs/logic have been added to manage and communicate checksum states between maintainers and dispatchers.
  • Operator Integration: All dispatcher-related operators (Add, Move, Merge, Remove, Split) have been updated to interact with a DispatcherSetChecksumUpdater interface. This ensures that the maintainer's expected dispatcher set is correctly updated whenever dispatchers are added, removed, or modified by an operator.
  • Monitoring and Logging: New Prometheus metrics (DispatcherManagerDispatcherSetChecksumNotOKGauge, DispatcherManagerDispatcherSetChecksumNotOKTotal) are introduced to monitor the dispatcher set checksum states. Detailed logging is also added for non-OK checksum states, including warnings for prolonged inconsistencies.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • downstreamadapter/dispatchermanager/dispatcher_manager.go
    • Imported pkg/set_checksum for checksum functionality.
    • Added dispatcherSetChecksum field to DispatcherManager to store expected checksum states.
    • Modified collectComponentStatusWhenChanged to compute and apply checksums to heartbeat messages.
    • Modified aggregateDispatcherHeartbeats to compute checksums for default and redo modes and apply them to heartbeat messages.
    • Added logic to cleanMetrics to delete new checksum-related Prometheus metrics upon cleanup.
  • downstreamadapter/dispatchermanager/dispatcher_manager_info.go
    • Imported github.com/pingcap/log and go.uber.org/zap for logging.
    • Replaced SetMaintainerID with UpdateMaintainer to also track the maintainer epoch and reset dispatcher set checksums on changes.
  • downstreamadapter/dispatchermanager/dispatcher_set_checksum.go
    • Added new file to define structs (dispatcherSetChecksumExpected, dispatcherSetChecksumRuntime, dispatcherSetChecksumState) for tracking and managing checksum states.
    • Implemented ApplyDispatcherSetChecksumUpdate to process checksum updates from the Maintainer, handling reordering.
    • Implemented ResetDispatcherSetChecksum to clear all checksum states.
    • Implemented shouldIncludeDispatcherInChecksum to filter dispatcher states for checksum calculation.
    • Implemented computeDispatcherSetChecksum to calculate the checksum of active dispatchers.
    • Implemented applyChecksumStateToHeartbeat to verify checksums and suppress watermark reporting if states are not OK.
    • Implemented incDispatcherSetChecksumNotOKTotal to increment a counter for suppressed watermarks.
    • Implemented verifyDispatcherSetChecksum to compare actual and expected checksums, update runtime state, and emit throttled logs.
    • Included helper functions for managing checksum states and metrics.
  • downstreamadapter/dispatchermanager/dispatcher_set_checksum_test.go
    • Added new file with unit tests for DispatcherSetChecksumWatermarkSuppression, covering uninitialized, mismatch, and OK states.
  • downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator.go
    • Added handling for DispatcherSetChecksumUpdate messages in handleMessages.
    • Modified handleBootstrapRequest to call manager.UpdateMaintainer.
    • Added handleDispatcherSetChecksumUpdate to process checksum updates from the Maintainer, including validation and sending DispatcherSetChecksumAck.
  • heartbeatpb/heartbeat.pb.go
    • Added ChecksumState enum (UNINITIALIZED, MISMATCH, OK).
    • Added ChecksumState and RedoChecksumState fields to HeartBeatRequest.
    • Added DispatcherSetChecksum message (count, xorHigh, xorLow, sumHigh, sumLow).
    • Added DispatcherSetChecksumUpdate message (changefeedID, epoch, mode, seq, checksum).
    • Added DispatcherSetChecksumAck message (changefeedID, epoch, mode, seq).
    • Updated MarshalToSizedBuffer, Size, and Unmarshal methods to include new fields and messages.
    • Updated init() function to register new enums and types.
  • heartbeatpb/heartbeat.proto
    • Added ChecksumState enum.
    • Added checksumState and redoChecksumState fields to HeartBeatRequest.
    • Added DispatcherSetChecksum, DispatcherSetChecksumUpdate, and DispatcherSetChecksumAck messages.
  • maintainer/barrier_event_test.go
    • Updated operator.NewOperatorController calls to include nil for the new checksumUpdater argument in test setups.
  • maintainer/barrier_test.go
    • Updated operator.NewOperatorController calls to include nil for the new checksumUpdater argument in test setups.
  • maintainer/capture_set_checksum_manager.go
    • Added new file defining captureSetChecksumState and captureChecksumState structs for managing per-capture checksums.
    • Implemented captureSetChecksumManager to maintain maintainer-side expected dispatcher IDs for a single mode.
    • Provided methods for initializing, updating, flushing, acknowledging, resending, and observing checksum states.
  • maintainer/capture_set_checksum_test.go
    • Added new file with unit tests for captureSetChecksumManager, including tests for checkpoint gating and resend/ack logic.
    • Included a recordingMessageCenter mock for testing message sending.
  • maintainer/maintainer.go
    • Added defaultChecksumManager and redoChecksumManager fields to Maintainer struct.
    • Added checksumStateByCapture and redoChecksumStateByCapture to store observed checksum states.
    • Initialized defaultChecksumManager and redoChecksumManager in NewMaintainer.
    • Modified onMessage to handle TypeDispatcherSetChecksumAck messages.
    • Updated onNodeChanged to delete checksum states for removed nodes and call RemoveNodes on checksum managers.
    • Modified advanceRedoMetaTsOnce to check redoChecksumStateByCapture before advancing redo checkpoint.
    • Modified calculateNewCheckpointTs to check checksumStateByCapture before advancing checkpoint.
    • Updated onHeartbeatRequest to record ChecksumState and RedoChecksumState from heartbeats and call ObserveHeartbeat on checksum managers.
    • Modified onBootstrapResponses to receive and send checksum messages from controller.FinishBootstrap.
    • Updated handleResendMessage to flush and resend pending checksum updates from checksum managers.
  • maintainer/maintainer_controller.go
    • Added defaultChecksumManager and redoChecksumManager fields to Controller struct.
    • Modified NewController to accept DispatcherSetChecksumUpdater interfaces for checksum managers and pass them to operator.NewOperatorController.
    • Updated initializeComponents to return checksum messages and calls resetAndBuildDispatcherSetChecksumMessages.
    • Added resetAndBuildDispatcherSetChecksumMessages and resetAndBuildChecksumMessages to initialize and build checksum update messages.
  • maintainer/maintainer_controller_bootstrap.go
    • Imported pkg/messaging.
    • Modified FinishBootstrap to return a slice of *messaging.TargetMessage for checksum updates.
    • Updated calls to initializeComponents to handle the returned checksum messages.
  • maintainer/maintainer_controller_helper.go
    • Updated operator.NewSplitDispatcherOperator call to include operatorController.GetChecksumUpdater().
  • maintainer/maintainer_controller_test.go
    • Updated NewController calls to include testChecksumUpdater{} for the new checksumUpdater arguments in test setups.
    • Updated s.FinishBootstrap calls to handle the new checksumMsgs return value.
  • maintainer/maintainer_helper.go
    • Added ChecksumStateCaptureMap struct and associated methods (newChecksumStateCaptureMap, Get, Set, Delete) to manage heartbeatpb.ChecksumState per node.
  • maintainer/maintainer_manager.go
    • Added a case in recvMessages to handle TypeDispatcherSetChecksumAck messages.
  • maintainer/maintainer_test.go
    • Added a case in mockDispatcherManager.handleMessage to handle TypeDispatcherSetChecksumUpdate messages.
    • Added onDispatcherSetChecksumUpdate method to mockDispatcherManager to send DispatcherSetChecksumAck.
    • Updated mockDispatcherManager.recvMessages to include TypeDispatcherSetChecksumUpdate.
    • Modified mockDispatcherManager.onDispatchRequest and sendHeartbeat to include ChecksumState and RedoChecksumState in HeartBeatRequest.
  • maintainer/operator/checksum_updater.go
    • Added new file defining DispatcherSetChecksumUpdater interface with an ApplyDelta method for updating expected dispatcher sets.
  • maintainer/operator/operator_add.go
    • Added checksumUpdater field to AddDispatcherOperator struct.
    • Modified NewAddDispatcherOperator to accept DispatcherSetChecksumUpdater.
    • Updated PostFinish to call m.checksumUpdater.ApplyDelta when a span is marked replicating.
  • maintainer/operator/operator_add_test.go
    • Updated NewAddDispatcherOperator calls to include testChecksumUpdater{}.
  • maintainer/operator/operator_controller.go
    • Added checksumUpdater field to Controller struct.
    • Modified NewOperatorController to accept DispatcherSetChecksumUpdater.
    • Added GetChecksumUpdater method.
    • Updated RemoveTasksBySchemaID and RemoveTasksByTableIDs to pass oc.GetChecksumUpdater() to newRemoveDispatcherOperator.
    • Modified NewMoveOperator to use NewMoveDispatcherOperator with oc.GetChecksumUpdater().
    • Modified AddMergeOperator to use NewMergeDispatcherOperator with oc.GetChecksumUpdater().
  • maintainer/operator/operator_controller_test.go
    • Updated NewOperatorController calls to include nil or testChecksumUpdater{}.
    • Updated newRemoveDispatcherOperator call to include testChecksumUpdater{}.
  • maintainer/operator/operator_merge.go
    • Added checksumUpdater field to MergeDispatcherOperator struct.
    • Modified NewMergeDispatcherOperator to accept DispatcherSetChecksumUpdater.
    • Updated PostFinish to call m.checksumUpdater.ApplyDelta for removed and new dispatchers.
  • maintainer/operator/operator_merge_test.go
    • Updated NewMergeDispatcherOperator calls to include testChecksumUpdater{}.
  • maintainer/operator/operator_move.go
    • Added checksumUpdater field to MoveDispatcherOperator struct.
    • Modified NewMoveDispatcherOperator to accept DispatcherSetChecksumUpdater.
    • Updated finishAsAbsent and PostFinish to call m.checksumUpdater.ApplyDelta.
  • maintainer/operator/operator_move_test.go
    • Updated NewMoveDispatcherOperator calls to include testChecksumUpdater{}.
  • maintainer/operator/operator_remove.go
    • Added checksumUpdater field to removeDispatcherOperator struct.
    • Modified newRemoveDispatcherOperator to accept DispatcherSetChecksumUpdater.
    • Updated PostFinish to call m.checksumUpdater.ApplyDelta.
  • maintainer/operator/operator_remove_test.go
    • Updated newRemoveDispatcherOperator calls to include testChecksumUpdater{}.
  • maintainer/operator/operator_split.go
    • Added checksumUpdater field to SplitDispatcherOperator struct.
    • Modified NewSplitDispatcherOperator to accept DispatcherSetChecksumUpdater.
    • Updated PostFinish to call m.checksumUpdater.ApplyDelta for the original dispatcher.
  • maintainer/operator/operator_split_test.go
    • Updated NewSplitDispatcherOperator calls to include testChecksumUpdater{}.
  • maintainer/operator/test_checksum_updater_test.go
    • Added new file defining testChecksumUpdater struct implementing DispatcherSetChecksumUpdater for testing purposes.
  • maintainer/scheduler/balance.go
    • Updated operator.NewSplitDispatcherOperator and operator.NewMoveDispatcherOperator calls to include s.operatorController.GetChecksumUpdater().
  • maintainer/scheduler/balance_splits.go
    • Updated operator.NewSplitDispatcherOperator and operator.NewMoveDispatcherOperator calls to include s.operatorController.GetChecksumUpdater().
  • maintainer/scheduler/basic.go
    • Updated operator.NewAddDispatcherOperator calls to include s.operatorController.GetChecksumUpdater().
  • maintainer/test_checksum_updater_test.go
    • Added new file defining testChecksumUpdater struct implementing DispatcherSetChecksumUpdater for testing purposes.
  • pkg/messaging/message.go
    • Added TypeDispatcherSetChecksumUpdate and TypeDispatcherSetChecksumAck to IOType enum.
    • Updated String() method for new types.
    • Updated decodeIOType and NewSingleTargetMessage to handle new message types.
  • pkg/metrics/dispatcher.go
    • Added DispatcherManagerDispatcherSetChecksumNotOKGauge and DispatcherManagerDispatcherSetChecksumNotOKTotal Prometheus metrics.
    • Updated initDispatcherMetrics to register new metrics.
  • pkg/set_checksum/set_checksum.go
    • Added new file defining Checksum struct for an order-independent, incrementally updatable checksum.
    • Implemented Add, Remove, Equal methods for checksum manipulation.
    • Implemented FromPB and ToPB methods for conversion to/from protobuf messages.
Activity
  • A bot commented about the PR needing approval and a linked issue.
  • Gemini Code Assist provided an initial summary and highlights of the changes.
  • Gemini Code Assist provided several review comments suggesting refactoring for code duplication and complexity.
  • Automated tests failed, including pull-unit-test-next-gen, pull-unit-test, and pull-error-log-review.
  • The author requested another review and summary from Gemini Code Assist.

@hongyunyan
Copy link
Collaborator Author

/test pull-cdc-mysql-integration-light

1 similar comment
@hongyunyan
Copy link
Collaborator Author

/test pull-cdc-mysql-integration-light

@ti-chi-bot
Copy link

ti-chi-bot bot commented Dec 26, 2025

[FORMAT CHECKER NOTIFICATION]

Notice: To remove the do-not-merge/needs-linked-issue label, please provide the linked issue number on one line in the PR body, for example: Issue Number: close #123 or Issue Number: ref #456.

📖 For more info, you can check the "Contribute Code" section in the development guide.

@hongyunyan
Copy link
Collaborator Author

/test all

@hongyunyan
Copy link
Collaborator Author

/retest

@ti-chi-bot
Copy link

ti-chi-bot bot commented Dec 28, 2025

@hongyunyan: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-error-log-review 14c6058 link true /test pull-error-log-review

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/needs-linked-issue do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant