Skip to content

eventservice,schemastore: avoid stalls after dispatcher reset#5554

Draft
asddongmen wants to merge 1 commit into
pingcap:masterfrom
asddongmen:codex/fix-changefeed-stall-after-failover
Draft

eventservice,schemastore: avoid stalls after dispatcher reset#5554
asddongmen wants to merge 1 commit into
pingcap:masterfrom
asddongmen:codex/fix-changefeed-stall-after-failover

Conversation

@asddongmen

Copy link
Copy Markdown
Collaborator

What problem does this PR solve?

Issue Number: close #5553

Two field stalls are addressed: schema store bootstrap can be blocked by a stale schema-store GC keeper service left by a failed CDC process with the same advertise address, and a reset dispatcher can wait for the next eventstore notify before sending its handshake.

What is changed and how it works?

The schema store now closes any stale keeper service before reading the initial GC safe point and installing its fresh barrier.

After eventBroker.resetDispatcher replaces the dispatcher state, it immediately checks scan readiness and pushes one scan task so ready/handshake/resolved messages do not depend on a later notify.

Check List

Tests

  • Unit test

Questions

Will it cause performance regression or break compatibility?

No. The cleanup is limited to the schema-store keeper service for the current advertise address, and reset only schedules the same scan path earlier.

Do you need to update user documentation, design documentation or monitoring documentation?

No.

Release note

Fix changefeed stalls after CDC failover or dispatcher reset in the new architecture.

@ti-chi-bot

ti-chi-bot Bot commented Jul 1, 2026

Copy link
Copy Markdown

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ti-chi-bot ti-chi-bot Bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/needs-triage-completed labels Jul 1, 2026
@ti-chi-bot

ti-chi-bot Bot commented Jul 1, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign 3aceshowhand for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai

coderabbitai Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ece1c9fc-8e8d-4c56-bdaa-c023e7c37cd9

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@ti-chi-bot ti-chi-bot Bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 1, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request cleans up stale schema-store GC services during initialization to prevent start-ts safety check failures, and triggers a scan task upon resetting a dispatcher in the event broker. Feedback on the event broker changes highlights a potential stall risk where calling c.pushTask(newStat, false) synchronously could drop the task if the channel is full; running this in a goroutine with force = true is recommended to guarantee execution without blocking.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +1232 to +1234
if c.scanReady(newStat) {
c.pushTask(newStat, false)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Potential Stall / Deadlock Risk: Calling c.pushTask(newStat, false) synchronously on the message-processing goroutine uses force = false to avoid blocking. However, if taskChan is full, the task will be dropped after a 10ms timeout.\n\nIf the task is dropped and there are no subsequent notifications from the event store (e.g., because the upstream is quiet), the dispatcher will stall permanently because the initial scan is never triggered.\n\nSolution: Run c.pushTask in a lightweight goroutine with force = true (i.e., go c.pushTask(newStat, true)). This avoids blocking the main message-processing loop (preventing deadlocks) while guaranteeing that the scan task is eventually queued and executed, completely eliminating the stall risk.

Suggested change
if c.scanReady(newStat) {
c.pushTask(newStat, false)
}
if c.scanReady(newStat) {
go c.pushTask(newStat, true)
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/needs-triage-completed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Changefeed can stall after CDC failover or dispatcher reset

1 participant