-
Notifications
You must be signed in to change notification settings - Fork 31
fix(*): merge operator inconsistent after maintainer move #3769
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
fix(*): merge operator inconsistent after maintainer move #3769
Conversation
…e, split after maintainer restart
…st-due-to-maintainer-move-operator-lost
…st-due-to-maintainer-move-operator-lost
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Summary of ChangesHello @wlwilliamx, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the robustness and consistency of dispatcher operations, particularly in distributed environments where maintainer nodes might restart or fail over. By introducing explicit tracking of different operator types and implementing comprehensive recovery logic for in-flight operations during bootstrap, the system can now maintain a more accurate state and seamlessly resume complex tasks like merging dispatchers, even after disruptions. This change addresses potential inconsistencies that could arise from untracked operations, leading to a more reliable and fault-tolerant system. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
…r-inconsistent-after-maintainer-move
|
/test pull-cdc-mysql-integration-heavy |
|
/test pull-cdc-mysql-integration-light |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a mechanism to track and restore in-flight dispatcher operations (add, remove, move, split, merge) during maintainer failover and bootstrap. Key changes include adding currentOperatorMap, redoCurrentOperatorMap, and mergeOperatorMap to DispatcherManager to store ongoing operations, and updating the protobuf definitions to include OperatorType and lists of in-flight operators in the MaintainerBootstrapResponse. The HeartBeatCollector now tracks merge operators, and the SchedulerDispatcherRequestHandler prevents concurrent operations on the same span by checking these new operator maps. During bootstrap, the maintainer now restores these in-flight operators. Review comments highlight that the OperatorType should be correctly propagated and not hardcoded, especially for move and split operations, and suggest simplifying the concurrent operator check logic by potentially unifying the currentOperatorMap and redoCurrentOperatorMap.
| case heartbeatpb.ScheduleAction_Create: | ||
| switch req.OperatorType { | ||
| case heartbeatpb.OperatorType_O_Add, heartbeatpb.OperatorType_O_Move, heartbeatpb.OperatorType_O_Split: | ||
| op := operator.NewAddDispatcherOperator(spanController, replicaSet, node, heartbeatpb.OperatorType_O_Add) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When restoring an add operator, the original operator type from the request (req.OperatorType) should be preserved. Hardcoding OperatorType_O_Add here will cause move and split operators to be incorrectly restored as simple add operators, breaking the operator restoration logic.
| op := operator.NewAddDispatcherOperator(spanController, replicaSet, node, heartbeatpb.OperatorType_O_Add) | |
| op := operator.NewAddDispatcherOperator(spanController, replicaSet, node, req.OperatorType) |
| return m.replicaSet.NewAddDispatcherMessage(m.dest, heartbeatpb.OperatorType_O_Add) | ||
| case moveStateRemoveOrigin, moveStateAbortRemoveOrigin: | ||
| return m.replicaSet.NewRemoveDispatcherMessage(m.origin) | ||
| return m.replicaSet.NewRemoveDispatcherMessage(m.origin, heartbeatpb.OperatorType_O_Remove) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The add and remove parts of a move operation should both be typed as O_Move. Using O_Add and O_Remove is incorrect and will break operator restoration logic on maintainer failover, as the new maintainer will not recognize these as parts of a single move operation.
| return m.replicaSet.NewAddDispatcherMessage(m.dest, heartbeatpb.OperatorType_O_Add) | |
| case moveStateRemoveOrigin, moveStateAbortRemoveOrigin: | |
| return m.replicaSet.NewRemoveDispatcherMessage(m.origin) | |
| return m.replicaSet.NewRemoveDispatcherMessage(m.origin, heartbeatpb.OperatorType_O_Remove) | |
| return m.replicaSet.NewAddDispatcherMessage(m.dest, heartbeatpb.OperatorType_O_Move) | |
| case moveStateRemoveOrigin, moveStateAbortRemoveOrigin: | |
| return m.replicaSet.NewRemoveDispatcherMessage(m.origin, heartbeatpb.OperatorType_O_Move) |
| _, exists := dispatcherManager.currentOperatorMap.Load(operatorKey) | ||
| if exists { | ||
| log.Warn("operator key exists, skip this request", | ||
| zap.String("changefeedID", req.ChangefeedID.String()), | ||
| zap.String("dispatcherID", common.NewDispatcherIDFromPB(req.Config.DispatcherID).String()), | ||
| zap.String("operatorKey", operatorKey), | ||
| zap.Any("operator", req), | ||
| ) | ||
| continue | ||
| } | ||
| _, redoExists := dispatcherManager.redoCurrentOperatorMap.Load(operatorKey) | ||
| if redoExists { | ||
| log.Warn("redo operator key exists, skip this request", | ||
| zap.String("changefeedID", req.ChangefeedID.String()), | ||
| zap.String("dispatcherID", common.NewDispatcherIDFromPB(req.Config.DispatcherID).String()), | ||
| zap.String("operatorKey", operatorKey), | ||
| zap.Any("operator", req), | ||
| ) | ||
| continue | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic to prevent concurrent operators on the same span or dispatcher is split between checking currentOperatorMap and redoCurrentOperatorMap. This could be simplified by using a single map for both, with a composite key or value to distinguish between redo and normal modes. This would reduce code duplication and make the logic easier to follow.
|
[FORMAT CHECKER NOTIFICATION] Notice: To remove the 📖 For more info, you can check the "Contribute Code" section in the development guide. |
|
/test pull-cdc-mysql-integration-heavy |
|
/test pull-cdc-mysql-integration-light |
|
@wlwilliamx: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What problem does this PR solve?
Issue Number: close #xxx
What is changed and how it works?
Check List
Tests
Questions
Will it cause performance regression or break compatibility?
None
Do you need to update user documentation, design documentation or monitoring documentation?
None
Release note