eventservice: split large scanned transactions by asddongmen · Pull Request #5511 · pingcap/ticdc

asddongmen · 2026-06-26T10:56:30Z

What problem does this PR solve?

Issue Number: close #xxx

EventBroker currently scans a whole large upstream transaction into memory before it can send DML to the event collector. Very large transactions can therefore cause OOM, especially when UK-changing updates cache deferred insert rows in memory.

What is changed and how it works?

This PR adds row-level eventstore scan resume tokens and uses them in EventBroker scanning so transaction-atomicity=none can emit bounded fragments from one large transaction without sending resolved-ts for that commit-ts early.

It also spills deferred insert halves of UK-changing updates to local disk, drains them after the original delete phase, and cleans spill state on reset/remove. EventCollector reset semantics remain checkpoint-ts based.

Check List

Tests

Unit test

Questions

Will it cause performance regression or break compatibility?

No compatibility break is expected. The split path is only enabled when transaction atomicity allows splitting. It reduces EventBroker peak memory for large transactions, with local spill I/O for UK-changing update insert halves.

Do you need to update user documentation, design documentation or monitoring documentation?

No user documentation update is required for this internal scan behavior change.

Release note

Support splitting large transactions during EventBroker DML scanning to reduce OOM risk.

ti-chi-bot · 2026-06-26T10:56:33Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

coderabbitai · 2026-06-26T10:56:38Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ed31e2ac-5e8d-409b-9c09-08fe272715a1

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

gemini-code-assist

Code Review

This pull request introduces support for row-level resume tokens and large transaction spilling in the event service. It adds a new EventIteratorWithScanPosition interface, updates the event store iterator to track and return opaque scan positions, and implements disk-based spilling for large transactions via largeTxnInsertSpill and largeTxnScanState. Review feedback identifies two key issues in large_txn_spill.go: a potential resource leak where a failed Close() in Cleanup() skips deleting the temporary file, and a loose record length check that could lead to OOM panics if the spill file is corrupted.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-26T10:59:49Z

+	if err := s.Close(); err != nil {
+		return err
+	}
+	if s.cleaned {
+		return nil
+	}
+	s.cleaned = true
+	if s.path == "" {
+		return nil
+	}
+
+	err := os.Remove(s.path)
+	if err != nil && !os.IsNotExist(err) {
+		return errors.Trace(err)
+	}
+	return nil


If s.Close() returns an error, Cleanup() will return early and skip the execution of os.Remove(s.path). This will cause the temporary spill file to be leaked on disk. We should ensure that os.Remove is always attempted even if s.Close() fails, and that s.cleaned is set to true to prevent future cleanup attempts.

closeErr := s.Close() if s.cleaned { if closeErr != nil { return errors.Trace(closeErr) } return nil } s.cleaned = true if s.path == "" { if closeErr != nil { return errors.Trace(closeErr) } return nil } removeErr := os.Remove(s.path) if removeErr != nil && !os.IsNotExist(removeErr) { if closeErr != nil { return errors.Trace(closeErr) } return errors.Trace(removeErr) } if closeErr != nil { return errors.Trace(closeErr) } return nil

gemini-code-assist · 2026-06-26T10:59:49Z

+	if recordLen > uint64(int(^uint(0)>>1)) {
+		return nil, errors.Errorf("large txn spill record is too large: %d", recordLen)
+	}


Enforcing a limit of math.MaxInt is extremely loose and can still lead to out-of-memory (OOM) panics when allocating memory for data (e.g., make([]byte, int(recordLen))) if the spill file is corrupted or contains invalid data. Since a single KV entry size in TiKV is typically limited to a few megabytes (and at most 128MB), we should enforce a much safer and more reasonable upper bound (e.g., 128MB) to prevent OOM vulnerabilities.

Suggested change

if recordLen > uint64(int(^uint(0)>>1)) {

return nil, errors.Errorf("large txn spill record is too large: %d", recordLen)

}

const maxRecordLen = 128 * 1024 * 1024 // 128MB

if recordLen > maxRecordLen {

return nil, errors.Errorf("large txn spill record is too large: %d", recordLen)

}

ti-chi-bot · 2026-06-29T02:30:13Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign lidezhu for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS
pkg/config/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot · 2026-07-01T02:21:56Z

[FORMAT CHECKER NOTIFICATION]

Notice: To remove the do-not-merge/needs-linked-issue label, please provide the linked issue number on one line in the PR body, for example: Issue Number: close #123 or Issue Number: ref #456.

_{📖 For more info, you can check the "Contribute Code" section in the development guide.}

asddongmen added 2 commits June 26, 2026 17:14

eventstore: support row-level scan resume

14e1fb5

eventservice: split large scanned transactions

a0779ff

ti-chi-bot Bot added do-not-merge/needs-linked-issue release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Jun 26, 2026

ti-chi-bot Bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jun 26, 2026

gemini-code-assist Bot reviewed Jun 26, 2026

View reviewed changes

asddongmen added 2 commits June 27, 2026 16:50

eventservice: address large txn spill review comments

9605655

eventservice: gate txn split by large txn threshold

f7771ad

asddongmen added 8 commits June 29, 2026 10:33

tests: add large txn split integration case

e159706

tests: avoid go dependency in large txn split case

8c0c29c

tests: isolate large txn split cdc port

51a44b9

tests: reduce large txn split workload logs

07088cb

tests: wait for sync before checking split log

f5c2ee9

tests: fix large txn split diff config

205a7c1

eventcollector,eventservice: add split txn reset coverage

7652d40

eventstore: test row scan position across sub switch

8d6bf1f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

eventservice: split large scanned transactions#5511

eventservice: split large scanned transactions#5511
asddongmen wants to merge 12 commits into
pingcap:masterfrom
asddongmen:0626-split-big-txn

asddongmen commented Jun 26, 2026

Uh oh!

ti-chi-bot Bot commented Jun 26, 2026

Uh oh!

coderabbitai Bot commented Jun 26, 2026 •

edited

Loading

Review skipped

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 26, 2026

Uh oh!

gemini-code-assist Bot Jun 26, 2026

Uh oh!

ti-chi-bot Bot commented Jun 29, 2026

Uh oh!

ti-chi-bot Bot commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

asddongmen commented Jun 26, 2026

What problem does this PR solve?

What is changed and how it works?

Check List

Tests

Questions

Will it cause performance regression or break compatibility?

Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Uh oh!

ti-chi-bot Bot commented Jun 26, 2026

Uh oh!

coderabbitai Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

ti-chi-bot Bot commented Jun 29, 2026

Uh oh!

ti-chi-bot Bot commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Jun 26, 2026 •

edited

Loading