Skip to content

downstreamadapter: shard dispatcher orchestrator queue#5052

Draft
hongyunyan wants to merge 8 commits into
pingcap:masterfrom
hongyunyan:codex/dispatcher-orchestrator-shards
Draft

downstreamadapter: shard dispatcher orchestrator queue#5052
hongyunyan wants to merge 8 commits into
pingcap:masterfrom
hongyunyan:codex/dispatcher-orchestrator-shards

Conversation

@hongyunyan
Copy link
Copy Markdown
Collaborator

What problem does this PR solve?

During TiCDC bootstrap recovery, DispatcherOrchestrator currently processes all changefeed control messages through one global pending queue and one worker. When a bootstrap request blocks on sink initialization, unrelated changefeeds on the same node queue behind it and recovery lag grows quickly.

This change keeps the existing bootstrap and retry semantics intact, but removes the global head-of-line blocking at the orchestrator queue layer.

Issue Number: ref #0

What is changed and how it works?

  • Split the single global pending queue into 8 fixed shards keyed by changefeedID.Id.Hash(...).
  • Keep one worker and the existing FIFO plus (changefeedID, msgType) de-duplication semantics inside each shard.
  • Keep the existing handler implementations and shutdown ordering, so the behavioral change is limited to cross-shard concurrency.
  • Add unit tests for shard shutdown behavior and for routing different changefeeds to different shards concurrently.

Check List

Tests

  • Unit test
    • SDKROOT=$(xcrun --show-sdk-path) CC=$(xcrun --find clang) CXX=$(xcrun --find clang++) CGO_CFLAGS="-isysroot $(xcrun --show-sdk-path)" CGO_LDFLAGS="-isysroot $(xcrun --show-sdk-path) -L$(xcrun --show-sdk-path)/usr/lib -F$(xcrun --show-sdk-path)/System/Library/Frameworks" go test ./downstreamadapter/dispatcherorchestrator

Questions

Will it cause performance regression or break compatibility?

No compatibility change is expected. The change only increases concurrency across shards; per-shard processing order and message semantics stay the same.

Do you need to update user documentation, design documentation or monitoring documentation?

No.

Release note

Reduce TiCDC dispatcher orchestrator bootstrap head-of-line blocking by sharding its control-message queue.

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented May 15, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ti-chi-bot ti-chi-bot Bot added do-not-merge/needs-linked-issue do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels May 15, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 15, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3f3adae9-2e58-4b55-8096-4e799c5eed64

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented May 15, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign lidezhu for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label May 15, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements sharding within the DispatcherOrchestrator to allow concurrent processing of changefeed control messages, partitioned by changefeed ID. It also enhances the MySQL sink's metadata management by introducing robust error handling and recovery logic for the ddl_ts_v1 table. Feedback suggests optimizing the shutdown process by closing all shard queues before waiting for workers to terminate and ensuring the shard Run method is idempotent to avoid starting multiple worker goroutines.

Comment on lines +414 to +416
for _, shard := range m.shards {
shard.Close()
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Closing shards sequentially in a loop can lead to a cumulative shutdown delay if multiple shards are processing long-running bootstrap or sink initialization requests. Consider closing all shard queues first to signal termination, and then waiting for all shard workers to finish in parallel.

Suggested change
for _, shard := range m.shards {
shard.Close()
}
for _, shard := range m.shards {
shard.queue.Close()
}
for _, shard := range m.shards {
shard.wg.Wait()
}

}

// Run starts the shard worker loop.
func (s *orchestratorShard) Run() {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The Run method is not idempotent. If called multiple times, it will start multiple worker goroutines for the same shard and incorrectly increment the WaitGroup counter. Consider adding a guard (e.g., using sync.Once or an atomic.Bool) to ensure the worker loop starts only once.

@ti-chi-bot ti-chi-bot Bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels May 15, 2026
@hongyunyan hongyunyan changed the title [codex] downstreamadapter: shard dispatcher orchestrator queue downstreamadapter: shard dispatcher orchestrator queue May 15, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented May 15, 2026

[FORMAT CHECKER NOTIFICATION]

Notice: To remove the do-not-merge/needs-linked-issue label, please provide the linked issue number on one line in the PR body, for example: Issue Number: close #123 or Issue Number: ref #456.

📖 For more info, you can check the "Contribute Code" section in the development guide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/needs-linked-issue do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant