eventservice: optimize scanwindow by asddongmen · Pull Request #4950 · pingcap/ticdc

asddongmen · 2026-04-29T08:03:38Z

What problem does this PR solve?

Issue Number: close #5041

The old scan window controller overreacted to memory feedback. A release pulse could reset the window, then later reports would shrink it again. Because it is a per-changefeed commit-ts span cap rather than a timer, this created sawtooth batching. Stale dispatchers could also pin the base ts and delay tables blocked by pending DDL.

What is changed and how it works?

This PR replaces the old policy with an adaptive controller while keeping scans event-driven. It combines a sliding usage window, EMAs, a pressure score, and cooldowns. High or critical pressure reduces the window in bounded steps. Low pressure recovers gradually and can leave the default floor faster. Release signals now relieve pressure instead of resetting the interval. The broker also skips stale dispatchers in minSentTs refresh and allows local advance for pending DDL when the global window is pinned.

This smooths scan progress, reduces reset storms, and preserves DDL forward progress.

Before

After

Check List

Tests

Unit test
Manual test (add detailed scripts or steps below)

Questions

Will it cause performance regression or break compatibility?

No compatibility break is expected. This changes only scan window control behavior and should stabilize throughput and memory pressure.

Do you need to update user documentation, design documentation or monitoring documentation?

Release note

None

Summary by CodeRabbit

New Features
- Added scan-window monitoring metrics for usage, EMAs, pressure, target bands, adjustment counts and memory-release events.
Improvements
- Replaced trend-based scan-interval logic with an adaptive, pressure-aware scan-window controller enabling tiered adjustments, braking, and conservative recovery.
Bug Fixes
- Scan interval no longer resets on memory-release; scan-window metrics are initialized and cleaned up consistently.
Tests
- Expanded and reworked tests covering adaptive behavior, braking/recovery, metrics, cooldowns and interval continuity.

Signed-off-by: dongmen <414110582@qq.com>

ti-chi-bot · 2026-04-29T08:03:41Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

coderabbitai · 2026-04-29T08:03:45Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Replaces the prior trend-based scan-interval logic with an adaptive scan-window controller: changefeed status now holds a controller and band-state, memory-usage updates are normalized and reported to the controller, the controller computes EMAs/pressure scores and decides interval changes (with cooldowns, emergency braking, and floor/recovery rules), and new Prometheus metrics and broker metric lifecycle helpers were added alongside expanded tests.

Changes

Adaptive Scan Window Controller

Layer / File(s)	Summary
Status model & constructor `pkg/eventservice/dispatcher_stat.go`, `pkg/eventservice/scan_window.go`	`changefeedStatus` drops trend fields (`lastAdjustTime`, `lastTrendAdjustTime`, `usageWindow`) and adds band-state fields, `scanWindowController *adaptiveScanWindowController`, and `syncPointInterval`; `newChangefeedStatus` initializes the controller via `newAdaptiveScanWindowController(time.Now())`.
Controller parameters & types `pkg/eventservice/scan_window.go`	Adds many controller tuning constants, report/decision/reason types, band-state enum, controller state struct, constructor, and test helpers.
Memory usage entry & decision path `pkg/eventservice/scan_window.go`	`changefeedStatus.updateMemoryUsage` normalizes usage, calls `scanWindowController.OnCongestionReport`, records controller metrics, and updates/persists `scanInterval` only when the controller returns a different interval.
Controller core logic `pkg/eventservice/scan_window.go`	Implements EMA updates, pressure-score accumulation/decay, critical/emergency braking with cooldown/dedup, high/sustained downward adjustments, guarded floor recovery, conservative increase gating (sample/span/cooldown/pressure-score), and interval-scaling helpers.
Metrics & lifecycle helpers `pkg/eventservice/scan_window.go`, `pkg/eventservice/event_broker.go`, `pkg/metrics/event_service.go`	Adds Prometheus metric vectors for scan-window observations/adjustments, implements `initializeScanWindowMetrics` and `deleteScanWindowMetrics`, and wires them into `getOrSetChangefeedStatus` / `removeChangefeedStatus`. Registers new metrics in `initEventServiceMetrics`.
Metrics observation & band tracking `pkg/eventservice/scan_window.go`, `pkg/metrics/event_service.go`	Emits usage ratio/EMA/pressure-score gauges, memory-release counts, target-band in-band/below/above gauges, cross-counts on band transitions, and per-reason adjustment counters.
Tests & test helpers `pkg/eventservice/scan_window_test.go`, `pkg/eventservice/event_broker_test.go`	Extends tests to cover many controller behaviors (low/very-low recovery, high/critical/emergency pressure paths, cooldown gating, sustained pressure), adds Prometheus assertions, adds readiness helpers, renames/adjusts broker memory-release test to expect no scan-interval reset, and removes obsolete trend-based tests.

🎯 4 (Complex) | ⏱️ ~45 minutes

Sequence Diagram(s)

sequenceDiagram
  participant Client as UpdateMemoryUsage
  participant Controller as AdaptiveScanWindowController
  participant Metrics as EventServiceMetrics
  participant Status as ChangefeedStatus

  Client->>Controller: OnCongestionReport(normalizedUsage, releaseCount)
  Controller-->>Controller: update EMAs, update usage window, compute pressure score
  Controller->>Metrics: emit usage/EMA/pressure/target-band/adjust metrics
  Controller->>Status: decision(newScanInterval, reason)
  Status->>Status: persist/apply scanInterval if changed

Possibly related PRs

pingcap/ticdc#4836: Modifies changefeedStatus/initialization paths and touches related controller/metric setup.

Suggested labels

lgtm

Suggested reviewers

wk989898
flowbehappy
3AceShowHand

Poem

🐰 I count the EMAs by moonlight,
Pressure whispers, brakes hold tight,
Counters tally every hop,
Windows shrink and then they stop,
Rabbit nods — the feed sleeps right.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 2.27% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'eventservice: optimize scanwindow' is a short, concise single sentence that accurately reflects the main change—optimization of the scan window controller in the event service.
Description check	✅ Passed	The PR description contains all required template sections: Issue Number (close `#5041`), What is changed and how it works (detailed explanation with before/after images), Check List with selected tests, answers to Questions, and a Release note.
Linked Issues check	✅ Passed	The code changes comprehensively address issue `#5041` by implementing the adaptive controller that replaces aggressive memory feedback reactions with smoothed, pressure-score-based window adjustments using sliding windows, EMAs, cooldowns, and tiered reduction strategies.
Out of Scope Changes check	✅ Passed	All code changes are directly scoped to the scan window optimization objective: controller rewrite, metric consolidation, test updates, and new monitoring metrics. No unrelated or extraneous modifications detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request replaces the existing scan interval adjustment logic with a new adaptiveScanWindowController that utilizes Exponential Moving Averages (EMAs) and a pressure score for more stable memory pressure management. The update includes comprehensive simulation tests and enhanced Prometheus metrics for monitoring controller decisions. Review feedback highlights a non-monotonic discontinuity in the emergency brake calculation, potential over-throttling caused by latching peak usage values, the presence of magic numbers, and the use of a redundant maxFloat64 helper that should be replaced by the built-in max function.

Signed-off-by: dongmen <414110582@qq.com>

coderabbitai

🧹 Nitpick comments (4)

pkg/eventservice/scan_window.go (4)

425-549: 💤 Low value

Consider factoring the repeated scanWindowDecision{...} literal.

OnCongestionReport returns the same scanWindowDecision shape from six branches, each repeating usage, fastUsageEMA, slowUsageEMA, pressureScore. A small builder method on the controller would let the function body focus on policy and reduce the surface for accidental field drift if a new metric/field is added later.

♻️ Sketch

func (c *adaptiveScanWindowController) makeDecisionLocked(
    newInterval, maxInterval time.Duration,
    reason scanWindowDecisionReason,
    usage memoryUsageStats,
) scanWindowDecision {
    return scanWindowDecision{
        newInterval:   newInterval,
        maxInterval:   maxInterval,
        reason:        reason,
        usage:         usage,
        fastUsageEMA:  c.fastUsageEMA,
        slowUsageEMA:  c.slowUsageEMA,
        pressureScore: c.pressureScore,
    }
}

Each branch then becomes return c.makeDecisionLocked(newInterval, maxInterval, reason, usage).

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/eventservice/scan_window.go` around lines 425 - 549, The
OnCongestionReport function repeats identical scanWindowDecision literals across
multiple return sites; add a helper method on adaptiveScanWindowController
(e.g., makeDecisionLocked(newInterval time.Duration, maxInterval time.Duration,
reason scanWindowDecisionReason, usage memoryUsageStats) scanWindowDecision)
that constructs and returns the scanWindowDecision using c.fastUsageEMA,
c.slowUsageEMA, c.pressureScore and the supplied usage, newInterval,
maxInterval, reason; then replace each repeated literal in OnCongestionReport
with calls to c.makeDecisionLocked(...) (keep current names: OnCongestionReport,
scanWindowDecision, usage, fastUsageEMA, slowUsageEMA, pressureScore).

388-406: ⚡ Quick win

Race on band state can double-count target-band crossings.

observeScanWindowTargetBandMetrics reads state.Load(), compares to currentState, then state.Store(...). Because handleCongestionControl can be invoked concurrently from different from nodes for the same changefeed (each call iterates changefeedMap.Range), two goroutines can both observe the same previousState, both increment EventServiceScanWindowTargetBandCrossCount, and both store — over-counting transitions for the same actual crossing.

Use Swap to read-and-replace atomically so only one caller observes each prior state:

🔒 Proposed diff

-	previousState := scanWindowBandState(state.Load())
-	if previousState != scanWindowBandUnknown && previousState != currentState {
-		metrics.EventServiceScanWindowTargetBandCrossCount.WithLabelValues(changefeed, metricType).Inc()
-	}
-	state.Store(int32(currentState))
+	previousState := scanWindowBandState(state.Swap(int32(currentState)))
+	if previousState != scanWindowBandUnknown && previousState != currentState {
+		metrics.EventServiceScanWindowTargetBandCrossCount.WithLabelValues(changefeed, metricType).Inc()
+	}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/eventservice/scan_window.go` around lines 388 - 406, The
observeScanWindowTargetBandMetrics function currently uses state.Load() and
state.Store(), which allows concurrent callers (e.g., handleCongestionControl)
to both see the same previous state and double-count transitions; replace the
Load/Store pair with an atomic swap so the read-and-replace is atomic: call
state.Swap(int32(currentState)) (convert the returned int32 to
scanWindowBandState) to obtain the previousState, then, if previousState !=
scanWindowBandUnknown && previousState != currentState, increment
EventServiceScanWindowTargetBandCrossCount; keep the existing gauge
Set(1)/Set(0) behavior and only change how previousState is read/stored.

321-362: 💤 Low value

Minor: deleteScanWindowMetrics also clears EventServiceAvailableMemoryQuotaGaugeVec.

That metric is not part of the scan-window family (registered separately, set in handleCongestionControl), but its lifecycle is naturally tied to the changefeed. Functionally fine — just be aware the helper name slightly under-promises what it deletes. If you keep this coupling, consider a short comment explaining why the available-memory-quota label is wiped here.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/eventservice/scan_window.go` around lines 321 - 362,
deleteScanWindowMetrics currently deletes
EventServiceAvailableMemoryQuotaGaugeVec even though that metric is not part of
the scan-window family; either move that deletion to the metric's owner (e.g.
where handleCongestionControl manages the metric) or keep it here but add a
short explanatory comment. Update the function deleteScanWindowMetrics to either
remove the line
metrics.EventServiceAvailableMemoryQuotaGaugeVec.DeleteLabelValues(changefeed)
and place it in the lifecycle code that sets/clears available memory quota, or
add a one-line comment above that DeleteLabelValues call explaining why the
available-memory-quota label is cleared here (tie to changefeed lifecycle),
referencing deleteScanWindowMetrics and handleCongestionControl so future
readers can find the rationale.

796-801: ⚡ Quick win

Drop maxFloat64 in favor of Go's built-in max.

This file already uses the Go 1.21+ built-in min/max for float64 elsewhere (e.g., min(c.pressureScore+2, scanWindowPressureScoreCeiling) at line 673), so maxFloat64 is inconsistent and unnecessary. Replacing all three call sites with the builtin removes one ad-hoc helper and matches the surrounding style.

♻️ Proposed diff

-func maxFloat64(a float64, b float64) float64 {
-	if a > b {
-		return a
-	}
-	return b
-}

Then at the call sites (lines 679, 681, 687):

-	c.pressureScore = maxFloat64(0, c.pressureScore-1.5)
+	c.pressureScore = max(0, c.pressureScore-1.5)
...
-	c.pressureScore = maxFloat64(0, c.pressureScore-0.5)
+	c.pressureScore = max(0, c.pressureScore-0.5)
...
-	c.pressureScore = maxFloat64(0, c.pressureScore-relief)
+	c.pressureScore = max(0, c.pressureScore-relief)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/eventservice/scan_window.go` around lines 796 - 801, Remove the ad-hoc
helper maxFloat64 and replace its call sites with the Go 1.21 built-in max: find
all uses of maxFloat64(a, b) and change them to max(a, b), then delete the
maxFloat64 function definition; no extra imports are needed—just remove the
function maxFloat64 and update callers to use max directly.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@pkg/eventservice/scan_window.go`:
- Around line 425-549: The OnCongestionReport function repeats identical
scanWindowDecision literals across multiple return sites; add a helper method on
adaptiveScanWindowController (e.g., makeDecisionLocked(newInterval
time.Duration, maxInterval time.Duration, reason scanWindowDecisionReason, usage
memoryUsageStats) scanWindowDecision) that constructs and returns the
scanWindowDecision using c.fastUsageEMA, c.slowUsageEMA, c.pressureScore and the
supplied usage, newInterval, maxInterval, reason; then replace each repeated
literal in OnCongestionReport with calls to c.makeDecisionLocked(...) (keep
current names: OnCongestionReport, scanWindowDecision, usage, fastUsageEMA,
slowUsageEMA, pressureScore).
- Around line 388-406: The observeScanWindowTargetBandMetrics function currently
uses state.Load() and state.Store(), which allows concurrent callers (e.g.,
handleCongestionControl) to both see the same previous state and double-count
transitions; replace the Load/Store pair with an atomic swap so the
read-and-replace is atomic: call state.Swap(int32(currentState)) (convert the
returned int32 to scanWindowBandState) to obtain the previousState, then, if
previousState != scanWindowBandUnknown && previousState != currentState,
increment EventServiceScanWindowTargetBandCrossCount; keep the existing gauge
Set(1)/Set(0) behavior and only change how previousState is read/stored.
- Around line 321-362: deleteScanWindowMetrics currently deletes
EventServiceAvailableMemoryQuotaGaugeVec even though that metric is not part of
the scan-window family; either move that deletion to the metric's owner (e.g.
where handleCongestionControl manages the metric) or keep it here but add a
short explanatory comment. Update the function deleteScanWindowMetrics to either
remove the line
metrics.EventServiceAvailableMemoryQuotaGaugeVec.DeleteLabelValues(changefeed)
and place it in the lifecycle code that sets/clears available memory quota, or
add a one-line comment above that DeleteLabelValues call explaining why the
available-memory-quota label is cleared here (tie to changefeed lifecycle),
referencing deleteScanWindowMetrics and handleCongestionControl so future
readers can find the rationale.
- Around line 796-801: Remove the ad-hoc helper maxFloat64 and replace its call
sites with the Go 1.21 built-in max: find all uses of maxFloat64(a, b) and
change them to max(a, b), then delete the maxFloat64 function definition; no
extra imports are needed—just remove the function maxFloat64 and update callers
to use max directly.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4190f76c-a953-4e4d-887a-59bfaeef4a9c

📥 Commits

Reviewing files that changed from the base of the PR and between dd501c8 and 274d253.

📒 Files selected for processing (6)

pkg/eventservice/dispatcher_stat.go
pkg/eventservice/event_broker.go
pkg/eventservice/event_broker_test.go
pkg/eventservice/scan_window.go
pkg/eventservice/scan_window_test.go
pkg/metrics/event_service.go

Signed-off-by: dongmen <414110582@qq.com>

asddongmen · 2026-05-11T05:01:48Z

/test all

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

pkg/eventservice/scan_window_test.go (1)

179-181: ⚡ Quick win

Replace the hardcoded 30 with a duration-derived bound

Line 179 bakes in a window-size assumption. Deriving the loop count from memoryUsageWindowDuration makes this test resilient to future constant tuning.

Suggested change

- for i := 0; i <= 30; i++ {
+ for i := 0; i <= int(memoryUsageWindowDuration/time.Second); i++ {
    status.updateMemoryUsage(start.Add(time.Duration(i)*time.Second), 1, 0)
  }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/eventservice/scan_window_test.go` around lines 179 - 181, The test
hardcodes 30 iterations when calling status.updateMemoryUsage, which couples it
to a specific window size; change the loop bound to derive from
memoryUsageWindowDuration (e.g., compute n := int(memoryUsageWindowDuration /
time.Second) and use i := 0; i <= n; i++) so the test scales with the actual
memoryUsageWindowDuration constant; update the loop surrounding
status.updateMemoryUsage to use that computed n instead of the literal 30.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/eventservice/scan_window_test.go`:
- Around line 74-75: Replace hardcoded changefeed IDs created by
newChangefeedStatus(common.NewChangefeedID4Test("default", "test"), ...) with a
unique per-test ID using t.Name() (e.g.,
newChangefeedStatus(common.NewChangefeedID4Test("default", t.Name()), ...)) for
all occurrences (including the instances around lines 89-90, 134-135, 157-158,
166-167, 175-176, 189-190) so updateMemoryUsage emits metrics keyed to a
test-unique label; also make the tests deterministic by using testify/require
assertions where applicable instead of non-deterministic checks.

---

Nitpick comments:
In `@pkg/eventservice/scan_window_test.go`:
- Around line 179-181: The test hardcodes 30 iterations when calling
status.updateMemoryUsage, which couples it to a specific window size; change the
loop bound to derive from memoryUsageWindowDuration (e.g., compute n :=
int(memoryUsageWindowDuration / time.Second) and use i := 0; i <= n; i++) so the
test scales with the actual memoryUsageWindowDuration constant; update the loop
surrounding status.updateMemoryUsage to use that computed n instead of the
literal 30.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: de3ec83f-f0f9-438c-87dc-c06631609797

📥 Commits

Reviewing files that changed from the base of the PR and between 274d253 and d8e1a0f.

📒 Files selected for processing (2)

pkg/eventservice/scan_window.go
pkg/eventservice/scan_window_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

pkg/eventservice/scan_window.go

Signed-off-by: dongmen <414110582@qq.com>

coderabbitai

♻️ Duplicate comments (1)

pkg/eventservice/scan_window_test.go (1)
40-40: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Use unique changefeed IDs in parallel tests to avoid shared metric state.

This issue was raised in a previous review and remains unresolved for most tests. While the metric tests at lines 209, 233, and 250 correctly use t.Name() for isolation, tests at lines 40, 55, 70, 82, 91, 100, 112, 121, 144, 153, 162, 171, 185, 200, 272, 291, 306, 321, 334, 394, and 412 still use hardcoded "default","test". Since updateMemoryUsage emits global metrics keyed by changefeed label, this creates cross-test metric pollution when tests run in parallel.
Suggested change for all affected tests
- status := newChangefeedStatus(common.NewChangefeedID4Test("default", "test"), ...)
+ status := newChangefeedStatus(common.NewChangefeedID4Test("default", t.Name()), ...)
Apply this pattern consistently across all parallel tests that call updateMemoryUsage.
As per coding guidelines, "Prefer focused deterministic tests" and the metric tests at lines 208-267 demonstrate the correct pattern with t.Name() and cleanup.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/eventservice/scan_window_test.go` at line 40, Replace hardcoded
changefeed IDs created via common.NewChangefeedID4Test("default","test") with
unique IDs using the test name (e.g., common.NewChangefeedID4Test("default",
t.Name())) wherever newChangefeedStatus is used in these tests so metrics
emitted by updateMemoryUsage are isolated; update each test (including the ones
at lines noted) to construct the changefeed ID from t.Name() and, where other
metric tests do, ensure any metric cleanup/reset used in the correct tests is
applied consistently after the test to avoid cross-test pollution.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@pkg/eventservice/scan_window_test.go`:
- Line 40: Replace hardcoded changefeed IDs created via
common.NewChangefeedID4Test("default","test") with unique IDs using the test
name (e.g., common.NewChangefeedID4Test("default", t.Name())) wherever
newChangefeedStatus is used in these tests so metrics emitted by
updateMemoryUsage are isolated; update each test (including the ones at lines
noted) to construct the changefeed ID from t.Name() and, where other metric
tests do, ensure any metric cleanup/reset used in the correct tests is applied
consistently after the test to avoid cross-test pollution.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 031c4e03-61d0-4370-88ad-5715e9a1a006

📥 Commits

Reviewing files that changed from the base of the PR and between d8e1a0f and 9a8a3c4.

📒 Files selected for processing (3)

pkg/eventservice/event_broker_test.go
pkg/eventservice/scan_window.go
pkg/eventservice/scan_window_test.go

🚧 Files skipped from review as they are similar to previous changes (2)

pkg/eventservice/scan_window.go
pkg/eventservice/event_broker_test.go

Signed-off-by: dongmen <414110582@qq.com>

asddongmen · 2026-05-13T08:54:39Z

/test all

Signed-off-by: dongmen <414110582@qq.com>

3AceShowHand · 2026-05-13T10:41:23Z

Code review

Found 3 issues:

Race condition in observeScanWindowTargetBandMetrics — non-atomic Load-then-Store on band-state atomics causes double-counting of target-band cross events under concurrent calls. handleCongestionControl (event_broker.go:1320) iterates changefeedMap.Range and can be invoked concurrently from different event-collector nodes, so two goroutines can both observe the same previousState before either stores. Use state.Swap(int32(currentState)) instead of Load() followed by Store().

https://github.com/asddongmen/tigate/blob/260cb3237a6b3d504c5ca2129a76d1fd16f35a34/pkg/eventservice/scan_window.go#L417-L425

Stale comments on threshold constants — Three constant doc comments describe the old trend-based controller behavior and do not match the new adaptive controller:
- memoryUsageHighThreshold (line 51): says reduction is to "1/2", but shouldReduceForHighPressureLocked uses scaleDuration(current, 3, 4) (3/4).
- memoryUsageLowThreshold (line 62): says increase is uniformly "25%", but scanWindowLowRecoveryScale uses tiered scaling (5%/10%/25%) and additional EMA conditions from isLowPressureLocked.
- memoryUsageVeryLowThreshold (line 74): says increase is uniformly "50%", but scanWindowVeryLowRecoveryScale uses tiered scaling (10%/20%/50%) and additional EMA conditions from isVeryLowPressureLocked.

https://github.com/asddongmen/tigate/blob/260cb3237a6b3d504c5ca2129a76d1fd16f35a34/pkg/eventservice/scan_window.go#L49-L78

Brake evaluation before pressure relief on memory release — In OnCongestionReport, the critical-brake and high-pressure checks run before relievePressureLocked is called. On a memory-release event, the release sample is added to the usage window and EMAs are updated first. If the slow EMA was at 0.90 and the release brings usageRatio=0.20, the slow EMA decays to 0.76 — still above scanWindowHighPressureThreshold (0.75) — causing a high-pressure reduction to fire during what should be a pressure-relief event. The old code handled this by resetting to default and returning early. Consider applying relief before evaluating non-critical downward adjustments, or skipping downward adjustments when memoryReleaseCount > 0.

https://github.com/asddongmen/tigate/blob/260cb3237a6b3d504c5ca2129a76d1fd16f35a34/pkg/eventservice/scan_window.go#L458-L479

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

asddongmen · 2026-05-14T01:12:51Z

/retest

Signed-off-by: dongmen <414110582@qq.com>

ti-chi-bot · 2026-05-15T06:27:58Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: 3AceShowHand

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [3AceShowHand]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot · 2026-05-15T06:27:58Z

[LGTM Timeline notifier]

Timeline:

2026-05-15 06:27:58.460207138 +0000 UTC m=+419846.992986447: ☑️ agreed by 3AceShowHand.

Signed-off-by: dongmen <414110582@qq.com>

asddongmen added 5 commits April 24, 2026 13:45

eventseivice: improve scan window

23b17ae

Signed-off-by: dongmen <414110582@qq.com>

eventseivice: add more metrics

14ab445

Signed-off-by: dongmen <414110582@qq.com>

eventseivice: refine scan window 2

0f5d87e

Signed-off-by: dongmen <414110582@qq.com>

eventseivice: add trace test

91be4f7

Signed-off-by: dongmen <414110582@qq.com>

eventseivice: adjust scanwindow

98ebabb

Signed-off-by: dongmen <414110582@qq.com>

ti-chi-bot Bot added do-not-merge/needs-linked-issue release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Apr 29, 2026

ti-chi-bot Bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Apr 29, 2026

gemini-code-assist Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread pkg/eventservice/scan_window.go Outdated

Comment thread pkg/eventservice/scan_window.go Outdated

Comment thread pkg/eventservice/scan_window.go Outdated

Comment thread pkg/eventservice/scan_window.go Outdated

eventservice: remove redundant test files

274d253

Signed-off-by: dongmen <414110582@qq.com>

ti-chi-bot Bot added release-note-none Denotes a PR that doesn't merit a release note. and removed release-note Denotes a PR that will be considered when it comes time to generate release notes. labels May 11, 2026

asddongmen marked this pull request as ready for review May 11, 2026 03:22

ti-chi-bot Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 11, 2026

coderabbitai Bot reviewed May 11, 2026

View reviewed changes

asddongmen added 2 commits May 11, 2026 11:54

resolve comments

3fa1b88

Signed-off-by: dongmen <414110582@qq.com>

resolve comments 2

d8e1a0f

Signed-off-by: dongmen <414110582@qq.com>

coderabbitai Bot reviewed May 11, 2026

View reviewed changes

Comment thread pkg/eventservice/scan_window_test.go

eventservice: add high pressure guard

9a8a3c4

Signed-off-by: dongmen <414110582@qq.com>

coderabbitai Bot reviewed May 12, 2026

View reviewed changes

merge upstream master

260cb32

Signed-off-by: dongmen <414110582@qq.com>

ti-chi-bot Bot removed the do-not-merge/needs-linked-issue label May 13, 2026

remove redundent codes

00d00f9

Signed-off-by: dongmen <414110582@qq.com>

asddongmen added 2 commits May 15, 2026 10:56

eventservice: fix a data race

6fecb90

Signed-off-by: dongmen <414110582@qq.com>

eventservice: stabilize scan window pressure control

20908ab

Signed-off-by: dongmen <414110582@qq.com>

3AceShowHand approved these changes May 15, 2026

View reviewed changes

ti-chi-bot Bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label May 15, 2026

ti-chi-bot Bot added the approved label May 15, 2026

eventservice: clean unused code

d5bf776

Signed-off-by: dongmen <414110582@qq.com>

Conversation

asddongmen commented Apr 29, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What is changed and how it works?

Check List

Tests

Questions

Will it cause performance regression or break compatibility?

Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Summary by CodeRabbit

Uh oh!

ti-chi-bot Bot commented Apr 29, 2026

Uh oh!

coderabbitai Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

asddongmen commented May 11, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

asddongmen commented May 13, 2026

Uh oh!

3AceShowHand commented May 13, 2026

Code review

Uh oh!

asddongmen commented May 14, 2026

Uh oh!

ti-chi-bot Bot commented May 15, 2026

Uh oh!

ti-chi-bot Bot commented May 15, 2026

[LGTM Timeline notifier]

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

asddongmen commented Apr 29, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 29, 2026 •

edited

Loading