fix(*): merge operator inconsistent after maintainer move #3769

wlwilliamx · 2025-12-23T10:05:28Z

What problem does this PR solve?

Issue Number: close #xxx

What is changed and how it works?

Check List

Tests

Unit test
Integration test
Manual test

Questions

Will it cause performance regression or break compatibility?

None

Do you need to update user documentation, design documentation or monitoring documentation?

None

Release note

Please refer to [Release Notes Language Style Guide](https://pingcap.github.io/tidb-dev-guide/contribute-to-tidb/release-notes-style-guide.html) to write a quality release note.

If you don't think this PR needs a release note then fill it with `None`.

…e, split after maintainer restart

…st-due-to-maintainer-move-operator-lost

ti-chi-bot · 2025-12-23T10:05:33Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign hongyunyan for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gemini-code-assist · 2025-12-23T10:06:07Z

Summary of Changes

Hello @wlwilliamx, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness and consistency of dispatcher operations, particularly in distributed environments where maintainer nodes might restart or fail over. By introducing explicit tracking of different operator types and implementing comprehensive recovery logic for in-flight operations during bootstrap, the system can now maintain a more accurate state and seamlessly resume complex tasks like merging dispatchers, even after disruptions. This change addresses potential inconsistencies that could arise from untracked operations, leading to a more reliable and fault-tolerant system.

Highlights

Operator Type Tracking: Introduced a new OperatorType enum (Add, Remove, Move, Split, Merge) and integrated it into ScheduleDispatcherRequest messages, allowing for more granular tracking of dispatcher operations.
In-flight Operator Recovery: Implemented mechanisms to track and restore in-flight operators, including merge operations, during system bootstrap. This ensures that ongoing operations are not lost and can resume correctly after a node restart or maintainer failover.
Dispatcher Manager Enhancements: Added currentOperatorMap, redoCurrentOperatorMap, and mergeOperatorMap to the DispatcherManager to store active operators and merge requests. Logic was added to manage these maps during dispatcher lifecycle events and message processing.
Bootstrap Response Updates: The MaintainerBootstrapResponse now includes lists of Operators and MergeOperators, enabling the maintainer to reconstruct the state of ongoing operations upon recovery.
Improved Operator Handling Logic: Refined the handling of add, remove, move, and split operators to correctly utilize the new OperatorType and ensure proper state management, especially during concurrent scheduling and cleanup.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

…r-inconsistent-after-maintainer-move

wlwilliamx · 2025-12-23T10:19:51Z

/test pull-cdc-mysql-integration-heavy

wlwilliamx · 2025-12-23T10:19:57Z

/test pull-cdc-mysql-integration-light

gemini-code-assist

Code Review

This pull request introduces a mechanism to track and restore in-flight dispatcher operations (add, remove, move, split, merge) during maintainer failover and bootstrap. Key changes include adding currentOperatorMap, redoCurrentOperatorMap, and mergeOperatorMap to DispatcherManager to store ongoing operations, and updating the protobuf definitions to include OperatorType and lists of in-flight operators in the MaintainerBootstrapResponse. The HeartBeatCollector now tracks merge operators, and the SchedulerDispatcherRequestHandler prevents concurrent operations on the same span by checking these new operator maps. During bootstrap, the maintainer now restores these in-flight operators. Review comments highlight that the OperatorType should be correctly propagated and not hardcoded, especially for move and split operations, and suggest simplifying the concurrent operator check logic by potentially unifying the currentOperatorMap and redoCurrentOperatorMap.

gemini-code-assist · 2025-12-23T10:20:51Z

maintainer/maintainer_controller_bootstrap.go

+			case heartbeatpb.ScheduleAction_Create:
+				switch req.OperatorType {
+				case heartbeatpb.OperatorType_O_Add, heartbeatpb.OperatorType_O_Move, heartbeatpb.OperatorType_O_Split:
+					op := operator.NewAddDispatcherOperator(spanController, replicaSet, node, heartbeatpb.OperatorType_O_Add)


When restoring an add operator, the original operator type from the request (req.OperatorType) should be preserved. Hardcoding OperatorType_O_Add here will cause move and split operators to be incorrectly restored as simple add operators, breaking the operator restoration logic.

Suggested change

op := operator.NewAddDispatcherOperator(spanController, replicaSet, node, heartbeatpb.OperatorType_O_Add)

op := operator.NewAddDispatcherOperator(spanController, replicaSet, node, req.OperatorType)

gemini-code-assist · 2025-12-23T10:20:51Z

maintainer/operator/operator_move.go

+		return m.replicaSet.NewAddDispatcherMessage(m.dest, heartbeatpb.OperatorType_O_Add)
 	case moveStateRemoveOrigin, moveStateAbortRemoveOrigin:
-		return m.replicaSet.NewRemoveDispatcherMessage(m.origin)
+		return m.replicaSet.NewRemoveDispatcherMessage(m.origin, heartbeatpb.OperatorType_O_Remove)


The add and remove parts of a move operation should both be typed as O_Move. Using O_Add and O_Remove is incorrect and will break operator restoration logic on maintainer failover, as the new maintainer will not recognize these as parts of a single move operation.

Suggested change

return m.replicaSet.NewAddDispatcherMessage(m.dest, heartbeatpb.OperatorType_O_Add)

case moveStateRemoveOrigin, moveStateAbortRemoveOrigin:

return m.replicaSet.NewRemoveDispatcherMessage(m.origin)

return m.replicaSet.NewRemoveDispatcherMessage(m.origin, heartbeatpb.OperatorType_O_Remove)

return m.replicaSet.NewAddDispatcherMessage(m.dest, heartbeatpb.OperatorType_O_Move)

case moveStateRemoveOrigin, moveStateAbortRemoveOrigin:

return m.replicaSet.NewRemoveDispatcherMessage(m.origin, heartbeatpb.OperatorType_O_Move)

gemini-code-assist · 2025-12-23T10:20:51Z

downstreamadapter/dispatchermanager/helper.go

+		_, exists := dispatcherManager.currentOperatorMap.Load(operatorKey)
+		if exists {
+			log.Warn("operator key exists, skip this request",
+				zap.String("changefeedID", req.ChangefeedID.String()),
+				zap.String("dispatcherID", common.NewDispatcherIDFromPB(req.Config.DispatcherID).String()),
+				zap.String("operatorKey", operatorKey),
+				zap.Any("operator", req),
+			)
+			continue
+		}
+		_, redoExists := dispatcherManager.redoCurrentOperatorMap.Load(operatorKey)
+		if redoExists {
+			log.Warn("redo operator key exists, skip this request",
+				zap.String("changefeedID", req.ChangefeedID.String()),
+				zap.String("dispatcherID", common.NewDispatcherIDFromPB(req.Config.DispatcherID).String()),
+				zap.String("operatorKey", operatorKey),
+				zap.Any("operator", req),
+			)
+			continue
+		}


The logic to prevent concurrent operators on the same span or dispatcher is split between checking currentOperatorMap and redoCurrentOperatorMap. This could be simplified by using a single map for both, with a composite key or value to distinguish between redo and normal modes. This would reduce code duplication and make the logic easier to follow.

ti-chi-bot · 2025-12-23T11:33:54Z

[FORMAT CHECKER NOTIFICATION]

Notice: To remove the do-not-merge/needs-linked-issue label, please provide the linked issue number on one line in the PR body, for example: Issue Number: close #123 or Issue Number: ref #456.

_{📖 For more info, you can check the "Contribute Code" section in the development guide.}

wlwilliamx · 2025-12-23T11:47:10Z

/test pull-cdc-mysql-integration-heavy

wlwilliamx · 2025-12-23T11:47:18Z

/test pull-cdc-mysql-integration-light

ti-chi-bot · 2025-12-23T12:51:16Z

@wlwilliamx: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-error-log-review	`b8473b1`	link	true	`/test pull-error-log-review`
pull-cdc-mysql-integration-heavy	`b8473b1`	link	true	`/test pull-cdc-mysql-integration-heavy`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

wlwilliamx added 14 commits December 18, 2025 13:57

fix(scheduler): ensure span consistency for operator add, remove, mov…

2686e4a

…e, split after maintainer restart

fix a typo

7288257

fix a log typo

b9637e2

handle err for restoreCurrentWorkingOperators

ce070cc

modify a comment

b0929db

Merge remote-tracking branch 'upstream/master' into fix/dispatcher-lo…

d8732bc

…st-due-to-maintainer-move-operator-lost

resolve conflicts

050e13f

fix newRemoveDispatcherOperator node id

a7be949

handle return value for AddOperator and add some logs for it

c9891d8

add some logs for dispatcher manager store current working operators

8c4e099

fix remove operator nil span

75638b2

Merge remote-tracking branch 'upstream/master' into fix/dispatcher-lo…

82637b9

…st-due-to-maintainer-move-operator-lost

remove enabled split

26f5a83

fix(*): ensure merge operator consistency when maintainer restart

d1d8752

ti-chi-bot bot added do-not-merge/needs-linked-issue release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Dec 23, 2025

ti-chi-bot bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Dec 23, 2025

Merge remote-tracking branch 'upstream/master' into fix/merge-operato…

53fb47d

…r-inconsistent-after-maintainer-move

gemini-code-assist bot reviewed Dec 23, 2025

View reviewed changes

fix maintainer uts

b8473b1

	op := operator.NewAddDispatcherOperator(spanController, replicaSet, node, heartbeatpb.OperatorType_O_Add)
	op := operator.NewAddDispatcherOperator(spanController, replicaSet, node, req.OperatorType)

fix(*): merge operator inconsistent after maintainer move #3769

Are you sure you want to change the base?

fix(*): merge operator inconsistent after maintainer move #3769

Uh oh!

Conversation

wlwilliamx commented Dec 23, 2025

What problem does this PR solve?

What is changed and how it works?

Check List

Tests

Questions

Will it cause performance regression or break compatibility?

Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Uh oh!

ti-chi-bot bot commented Dec 23, 2025

Uh oh!

gemini-code-assist bot commented Dec 23, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

wlwilliamx commented Dec 23, 2025

Uh oh!

wlwilliamx commented Dec 23, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

ti-chi-bot bot commented Dec 23, 2025

Uh oh!

wlwilliamx commented Dec 23, 2025

Uh oh!

wlwilliamx commented Dec 23, 2025

Uh oh!

ti-chi-bot bot commented Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant