tests: add weekly random DDL integration coverage by asddongmen · Pull Request #5542 · pingcap/ticdc

asddongmen · 2026-06-29T09:48:29Z

What problem does this PR solve?

Issue Number: close #5543

This PR adds weekly random DDL+DML integration coverage and fixes issues found while running it. The runner also needs to avoid false failures when random DML payload strings contain words such as panic or fatal.

What is changed and how it works?

Add a standalone random DDL test runner and weekly random DDL integration cases.
Tune the weekly single case DML pressure and extend converge timeout for slow catch-up.
Advance the event service scan window when pending syncpoints are present, similar to DDL handling.
Tighten runner and shell log scans to match real failure signals such as [PANIC], [FATAL], panic:, fatal error:, and WARNING: DATA RACE, while keeping custom substring patterns supported.

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)

Manual checks run:

PATH=/usr/local/go/bin:$PATH GOTOOLCHAIN=auto go test ./tests/utils/random_ddl_test_runner -count=1
bash -n for weekly random DDL scripts and _utils/test_prepare
Re-scanned the failed /tmp/tidb_cdc_test/weekly_rand_single logs with the new shell rule.
Ran make integration_test_weekly_rand_ddl_mysql; workload/converge/syncpoint diff passed, then old log scan rule failed on random payload false positives.

Questions

Will it cause performance regression or break compatibility?

No. Runtime code only expands event-service scan progress when pending syncpoints exist; the remaining changes are test tooling and integration cases.

Do you need to update user documentation, design documentation or monitoring documentation?

No.

Release note

None

Summary by CodeRabbit

New Features
- Added new weekly random DDL/DML integration test runners for single, multi, failover, and lossy-DDL scenarios.
- Expanded test tooling to support log scanning, sync-point checks, failover handling, and automated diff verification.
- Introduced improved workload generation with DDL/DML auto-tuning, conflict traffic, and large transaction coverage.
Bug Fixes
- Improved scan-range handling so progress continues correctly when sync-point barriers are present.
Tests
- Added unit coverage for tuning, DDL generation, selector behavior, log scanning, and scan-range edge cases.

…dl-test

…into 0115-ddl-test

Signed-off-by: root <lidezhu@pingcap.com>

# Conflicts: # Makefile

ti-chi-bot · 2026-06-29T09:48:34Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign sdojjy for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-06-29T09:48:46Z

📝 Walkthrough

Walkthrough

Adds a new random_ddl_test_runner Go binary with config, schema model, concurrent DML/DDL workers, auto-tune, motif DDL sequencing, log scanning, capture failover, and syncpoint diff logic. Four new weekly integration test scenarios are introduced alongside Makefile wiring. A separate fix extends getScanTaskDataRange to advance the local scan window when pending sync-point events are present.

Changes

Event broker sync-point scan fix

Layer / File(s)	Summary
getScanTaskDataRange sync-point logic and test `pkg/eventservice/event_broker.go`, `pkg/eventservice/event_broker_test.go`	Loads `nextSyncPoint` timestamp alongside pending-DDL detection and expands the local-scan advancement condition to cover pending sync-point events; adds a unit test asserting `CommitTsEnd` advances to `nextSyncPointTs+1` when global capping would otherwise produce an empty range.

Random DDL test runner binary and integration test scenarios

Layer / File(s)	Summary
Runner config, DB helpers, and entry point `tests/utils/random_ddl_test_runner/config.go`, `tests/utils/random_ddl_test_runner/db.go`, `tests/utils/random_ddl_test_runner/runner.go`, `tests/utils/random_ddl_test_runner/logger.go`, `tests/utils/random_ddl_test_runner/main.go`	Defines all JSON-mapped config structs with profile-dependent defaults and validation; MySQL connection open/pool helpers; runner struct constructor; and the `main()` CLI entry point dispatching `bootstrap` or `workload` phases.
In-memory cluster model `tests/utils/random_ddl_test_runner/model.go`	Defines domain constants, column/index/schema SQL helpers, deterministic table family catalog, mutable table state, `clusterModel` construction, and hot/cold/domain table selection helpers plus deterministic value generators.
Bootstrap phase `tests/utils/random_ddl_test_runner/bootstrap.go`	Implements `runner.bootstrap()` to drop/create databases and tables on both upstream and downstream, then inserts deterministic initial rows in batches using parameterized INSERTs.
DDL catalog, selector, and worker `tests/utils/random_ddl_test_runner/ddl.go`, `tests/utils/random_ddl_test_runner/ddl_worker.go`, `tests/utils/random_ddl_test_runner/selector.go`, `tests/utils/random_ddl_test_runner/ddl_test.go`, `tests/utils/random_ddl_test_runner/selector_test.go`	Defines weighted `ddlKind` catalog and all DDL generators; inverse-weight `ddlSelector` with sliding-window record; `ddlWorker` submission loop with tracing and model-apply on success; unit tests for drop-column PK safety, partition guard, and selector coverage-debt behavior.
DML worker, motif DDL sequence, and extra workers `tests/utils/random_ddl_test_runner/dml.go`, `tests/utils/random_ddl_test_runner/motif.go`, `tests/utils/random_ddl_test_runner/extra_workers.go`, `tests/utils/random_ddl_test_runner/dml_test.go`	Implements `dmlWorker`, generic/motif DML builders, `buildRandomValue`; three-step primary motif DDL sequence (add `site_code`, unify default, composite PK); `bigTxnWorker`, `conflictWriter`; unit tests for ASCII generation and motif insert site_code omission.
Auto-tune concurrency controller `tests/utils/random_ddl_test_runner/autotune.go`, `tests/utils/random_ddl_test_runner/autotune_test.go`	Implements `autoTuneStep` scaling DML/DDL worker counts based on checkpoint stall and DML success rate with hard-fail on prolonged stall; four unit tests covering all scaling branches.
Health monitor and capture failover `tests/utils/random_ddl_test_runner/health.go`, `tests/utils/random_ddl_test_runner/failover.go`	Implements `getChangefeedStatus` HTTP client with TSO parsing; `failoverLoop` killing/restarting captures with optional checkpoint-advance gating; `getCapturePID` and `execCommand` helpers.
Log scanning utility `tests/utils/random_ddl_test_runner/logscan.go`, `tests/utils/random_ddl_test_runner/logscan_test.go`	Implements `scanLogsForPatterns` with buffered line scanning, special-case multi-token detection for panic/fatal/DATA RACE, and file globbing; seven unit tests.
Syncpoint diff loop and inspector integration `tests/utils/random_ddl_test_runner/syncpoint_diff.go`	Implements periodic `syncpointDiffLoop`, DDL-window skipping, TOML config writing, `runSyncDiffInspectorWithSnapshotGuard` (external-ts-read toggle), retries, and tail capture for failure classification.
Workload orchestration `tests/utils/random_ddl_test_runner/workload.go`	Implements `workload()` starting all concurrent goroutines with cancellation, convergence via finish-mark polling, DML stats writing, final `diff_config.toml` generation, and log scanning.
Integration test scripts, configs, and CI wiring `Makefile`, `tests/integration_tests/_utils/test_prepare`, `tests/integration_tests/run_weekly_rand_ddl_it_in_ci.sh`, `tests/integration_tests/weekly_rand_single/...`, `tests/integration_tests/weekly_rand_multi/...`, `tests/integration_tests/weekly_rand_multi_failover/...`, `tests/integration_tests/weekly_rand_slow_lossy_ddl/...`	Adds four integration test `run.sh` scripts with matching TOML configs enabling sync-point and scheduler settings; CI entry-point script dispatching by sink type; `scan_logs_for_failures` in shared test_prepare; new `integration_test_weekly_rand_ddl_mysql` Makefile target with reduced binary checks.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Suggested labels

lgtm, approved

Suggested reviewers

hongyunyan
lidezhu
wk989898

Poem

🐇 Hopping through schemas and DDL galore,
With sync-points and motifs and diff-checks in store,
The workload runs weekly, the logs get a scan,
No panic or fatal escapes the grand plan.
The rabbit approves — let the test suite begin! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 1.65% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title is concise and accurately summarizes the main change: weekly random DDL integration coverage.
Description check	✅ Passed	The description includes the required issue link, change summary, tests, questions, and release note sections.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

gemini-code-assist

Code Review

This pull request introduces a comprehensive random DDL and DML weekly integration testing suite, including a deterministic workload generator and verifier utility (random_ddl_test_runner) and several test scenarios (single capture, multi-capture, failover, and slow lossy DDL). It also updates the event broker to handle pending syncpoint events when advancing scan windows. The review feedback identifies multiple high-severity data races in ddl_worker.go and extra_workers.go due to concurrent unsynchronized reads of table names via tbl.fqName(), and suggests a minor improvement in bootstrap.go to use the standard library's math.MaxInt32 instead of a custom bitwise function.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-29T09:50:45Z

+		start := time.Now()
+		_, err := db.ExecContext(ctx, sqlText)
+		if err == nil {
+			apply()
+			selector.record(kind.name)
+		}
+		if logger != nil {
+			logger.Printf("ddl worker=%d kind=%s target=%s elapsed=%s err=%v",
+				workerID, kind.name, tbl.fqName(), time.Since(start), err)
+		}
+		if trace != nil {
+			trace.record(kind.name, tbl.fqName(), sqlText, err)
+		}


The tbl.fqName() method reads tbl.name without holding tbl.mu. Since tbl.name can be mutated concurrently by the apply function of genRenameTable under tbl.mu.Lock(), calling tbl.fqName() concurrently without a lock causes a data race. To prevent this, retrieve the fully qualified name while holding tbl.mu.Lock() before executing the DDL, and use the cached string for logging and tracing.

Suggested change

start := time.Now()

_, err := db.ExecContext(ctx, sqlText)

if err == nil {

apply()

selector.record(kind.name)

}

if logger != nil {

logger.Printf("ddl worker=%d kind=%s target=%s elapsed=%s err=%v",

workerID, kind.name, tbl.fqName(), time.Since(start), err)

}

if trace != nil {

trace.record(kind.name, tbl.fqName(), sqlText, err)

}

start := time.Now()

tbl.mu.Lock()

fqName := tbl.fqName()

tbl.mu.Unlock()

_, err := db.ExecContext(ctx, sqlText)

if err == nil {

apply()

selector.record(kind.name)

}

if logger != nil {

logger.Printf("ddl worker=%d kind=%s target=%s elapsed=%s err=%v",

workerID, kind.name, fqName, time.Since(start), err)

}

if trace != nil {

trace.record(kind.name, fqName, sqlText, err)

}

gemini-code-assist · 2026-06-29T09:50:45Z

+func runBigInsertTxn(ctx context.Context, db *sql.DB, tbl *table, rows int) error {
+	// Build a single multi-row INSERT inside a transaction to create a "big txn" workload.
+	tbl.mu.Lock()
+	if !tbl.exists {
+		tbl.mu.Unlock()
+		return nil
+	}
+	schema := tbl.schema.clone()
+	startID := tbl.nextID
+	tbl.nextID += int64(rows)
+	tbl.mu.Unlock()
+
+	var cols []column
+	for _, c := range schema.columns {
+		if c.generated != "" {
+			continue
+		}
+		cols = append(cols, c)
+	}
+	if len(cols) == 0 {
+		return nil
+	}
+
+	colNames := make([]string, 0, len(cols))
+	for _, c := range cols {
+		colNames = append(colNames, c.name)
+	}
+
+	var valuesSQL strings.Builder
+	var args []any
+	for i := 0; i < rows; i++ {
+		if i > 0 {
+			valuesSQL.WriteString(",")
+		}
+		valuesSQL.WriteString("(")
+		for j := range cols {
+			if j > 0 {
+				valuesSQL.WriteString(",")
+			}
+			valuesSQL.WriteString("?")
+		}
+		valuesSQL.WriteString(")")
+		rowID := startID + int64(i)
+		for _, c := range cols {
+			args = append(args, buildRandomValue(rand.New(rand.NewSource(rowID)), tbl, c, rowID))
+		}
+	}
+
+	stmt := fmt.Sprintf("INSERT INTO %s (%s) VALUES %s",
+		tbl.fqName(),
+		backtickJoin(colNames),
+		valuesSQL.String(),
+	)
+
+	tx, err := db.BeginTx(ctx, nil)
+	if err != nil {
+		return err
+	}
+	if _, err := tx.ExecContext(ctx, stmt, args...); err != nil {
+		_ = tx.Rollback()
+		return err
+	}
+	return tx.Commit()
+}
+


The tbl.fqName() method reads tbl.name without holding tbl.mu. Since tbl.name can be mutated concurrently by the apply function of genRenameTable under tbl.mu.Lock(), calling tbl.fqName() concurrently without a lock causes a data race. To prevent this, retrieve the fully qualified name while holding tbl.mu.Lock() at the beginning of runBigInsertTxn, and use the cached string to construct the SQL statement.

func runBigInsertTxn(ctx context.Context, db *sql.DB, tbl *table, rows int) error { // Build a single multi-row INSERT inside a transaction to create a "big txn" workload. tbl.mu.Lock() if !tbl.exists { tbl.mu.Unlock() return nil } fqName := tbl.fqName() schema := tbl.schema.clone() startID := tbl.nextID tbl.nextID += int64(rows) tbl.mu.Unlock() var cols []column for _, c := range schema.columns { if c.generated != "" { continue } cols = append(cols, c) } if len(cols) == 0 { return nil } colNames := make([]string, 0, len(cols)) for _, c := range cols { colNames = append(colNames, c.name) } var valuesSQL strings.Builder var args []any for i := 0; i < rows; i++ { if i > 0 { valuesSQL.WriteString(",") } valuesSQL.WriteString("(") for j := range cols { if j > 0 { valuesSQL.WriteString(",") } valuesSQL.WriteString("?") } valuesSQL.WriteString(")") rowID := startID + int64(i) for _, c := range cols { args = append(args, buildRandomValue(rand.New(rand.NewSource(rowID)), tbl, c, rowID)) } } stmt := fmt.Sprintf("INSERT INTO %s (%s) VALUES %s", fqName, backtickJoin(colNames), valuesSQL.String(), ) tx, err := db.BeginTx(ctx, nil) if err != nil { return err } if _, err := tx.ExecContext(ctx, stmt, args...); err != nil { _ = tx.Rollback() return err } return tx.Commit() }

gemini-code-assist · 2026-06-29T09:50:45Z

+func conflictWriter(
+	ctx context.Context,
+	db *sql.DB,
+	model *clusterModel,
+	seed int64,
+	cfg dmlConfig,
+	counters *dmlCounters,
+) {
+	// conflictWriter continuously upserts into a small key space to create write conflicts.
+	// This targets row-level contention and duplicate key paths.
+	if !cfg.KeyConflictEnabled || cfg.KeyConflictKeyspace <= 0 {
+		return
+	}
+	rng := rand.New(rand.NewSource(seed))
+	targetTables := collectConflictTables(model)
+	if len(targetTables) == 0 {
+		return
+	}
+
+	for {
+		select {
+		case <-ctx.Done():
+			return
+		default:
+		}
+
+		tbl := targetTables[rng.Intn(len(targetTables))]
+		tbl.mu.Lock()
+		exists := tbl.exists
+		tbl.mu.Unlock()
+		if !exists {
+			_ = sleepWithContext(ctx, 200*time.Millisecond)
+			continue
+		}
+
+		key := rng.Intn(cfg.KeyConflictKeyspace) + 1
+		stmt := fmt.Sprintf("INSERT INTO %s (`id`,`a`,`b`,`c`,`d`,`e`,`bin`) VALUES (?,?,?,?,?,?,?) "+
+			"ON DUPLICATE KEY UPDATE `a`=VALUES(`a`),`b`=VALUES(`b`),`c`=VALUES(`c`)",
+			tbl.fqName(),
+		)
+		args := []any{
+			int64(key),
+			int32(rng.Intn(1_000_000)),
+			randASCII(rng, 16),
+			deterministicDecimal(int64(key)),
+			deterministicTime(int64(key)),
+			fmt.Sprintf("{\"k\":%d}", key),
+			[]byte(fmt.Sprintf("%064x", key)),
+		}
+		_, err := db.ExecContext(ctx, stmt, args...)
+		counters.record(err)
+		_ = sleepWithContext(ctx, 20*time.Millisecond)
+	}
+}


The tbl.fqName() method reads tbl.name without holding tbl.mu. Since tbl.name can be mutated concurrently by the apply function of genRenameTable under tbl.mu.Lock(), calling tbl.fqName() concurrently without a lock causes a data race. To prevent this, retrieve the fully qualified name while holding tbl.mu.Lock() inside conflictWriter, and use the cached string to construct the SQL statement.

func conflictWriter( ctx context.Context, db *sql.DB, model *clusterModel, seed int64, cfg dmlConfig, counters *dmlCounters, ) { // conflictWriter continuously upserts into a small key space to create write conflicts. // This targets row-level contention and duplicate key paths. if !cfg.KeyConflictEnabled || cfg.KeyConflictKeyspace <= 0 { return } rng := rand.New(rand.NewSource(seed)) targetTables := collectConflictTables(model) if len(targetTables) == 0 { return } for { select { case <-ctx.Done(): return default: } tbl := targetTables[rng.Intn(len(targetTables))] tbl.mu.Lock() exists := tbl.exists fqName := tbl.fqName() tbl.mu.Unlock() if !exists { _ = sleepWithContext(ctx, 200*time.Millisecond) continue } key := rng.Intn(cfg.KeyConflictKeyspace) + 1 stmt := fmt.Sprintf("INSERT INTO %s (`id`,`a`,`b`,`c`,`d`,`e`,`bin`) VALUES (?,?,?,?,?,?,?) "+ "ON DUPLICATE KEY UPDATE `a`=VALUES(`a`),`b`=VALUES(`b`),`c`=VALUES(`c`)", fqName, ) args := []any{ int64(key), int32(rng.Intn(1_000_000)), randASCII(rng, 16), deterministicDecimal(int64(key)), deterministicTime(int64(key)), fmt.Sprintf("{\"k\":%d}", key), []byte(fmt.Sprintf("%064x", key)), } _, err := db.ExecContext(ctx, stmt, args...) counters.record(err) _ = sleepWithContext(ctx, 20*time.Millisecond) } }

gemini-code-assist · 2026-06-29T09:50:45Z

+func mathMaxInt32() int64 {
+	return int64(^uint32(0) >> 1)
+}


Instead of implementing a custom bitwise function mathMaxInt32() to compute the maximum value of a signed 32-bit integer, use math.MaxInt32 from the standard math package. This is cleaner, more idiomatic, and improves code maintainability.

coderabbitai

Actionable comments posted: 20

🧹 Nitpick comments (2)

tests/utils/random_ddl_test_runner/ddl_test.go (1)
9-47: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Rename these tests to camelCase.

Lines 9 and 33 declare test functions with underscores, which violates the repo Go naming rule for functions. As per coding guidelines, "Functions should use camelCase naming and do not include underscores (e.g., getPartitionNum, not get_partition_num)."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/utils/random_ddl_test_runner/ddl_test.go` around lines 9 - 47, Rename
the test functions in ddl_test.go to camelCase to match the Go naming rule and
repo guidelines: update TestGenDropColumn_DoesNotDropPrimaryKey and
TestGenAddPartition_RequiresRangePartition to use camelCase without underscores,
while keeping their behavior and references to genDropColumn and genAddPartition
unchanged.
Source: Coding guidelines
tests/utils/random_ddl_test_runner/selector_test.go (1)
8-30: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Rename this test to camelCase.

Line 8 uses underscores in the test function name, which violates the repo Go naming rule for functions. As per coding guidelines, "Functions should use camelCase naming and do not include underscores (e.g., getPartitionNum, not get_partition_num)."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/utils/random_ddl_test_runner/selector_test.go` around lines 8 - 30, The
test function name in the selector test uses underscores and should be renamed
to camelCase to match the repo’s Go naming rule. Update the
`TestDDLSelector_CoverageDebtLikeBehavior` symbol in `selector_test.go` to a
camelCase form with no underscores, keeping the same test logic and coverage
behavior unchanged.
Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/integration_tests/weekly_rand_multi_failover/run.sh`:
- Line 149: The workdir cleanup in the script is using an unguarded WORK_DIR
value, which can make the reset dangerous if the variable is empty or malformed.
Update the reset logic around the rm -rf in the run.sh flow to validate WORK_DIR
before deletion, and only proceed when it is a non-empty expected path; use the
WORK_DIR handling near the rm -rf/mkdir sequence as the place to add the guard.

In `@tests/integration_tests/weekly_rand_multi/run.sh`:
- Line 143: The cleanup command in the weekly_rand_multi run script is unsafe
because rm -rf uses WORK_DIR unquoted and without validating it first. Update
the WORK_DIR cleanup logic to check that WORK_DIR is set and non-empty before
deleting, and quote the variable everywhere it is used in the rm/mkdir sequence
to avoid accidental expansion or deletion of the wrong directory tree.
- Around line 118-120: The MySQL sink fallback in the run script is hardcoded to
localhost, which bypasses the configured downstream endpoint. Update the sink
URI construction in the case branch that sets SINK_URI so it uses the same
downstream host and port variables already used elsewhere in the script,
matching the runner config and SQL target. Keep the change localized to the
SINK_URI assignment and use the existing DOWN_TIDB_HOST and DOWN_TIDB_PORT
symbols for consistency.

In `@tests/integration_tests/weekly_rand_single/run.sh`:
- Around line 113-115: The MySQL sink URI is hard-coded to localhost in the sink
selection branch, which makes it diverge from the configured downstream used
elsewhere in run.sh. Update the SINK_URI assignment in the MySQL case to use the
same downstream host and port variables already referenced by the runner config
and run_sql checks, so the changefeed replicates into the intended target. Keep
the fix localized to the sink URI setup logic in the shell script.
- Line 138: The cleanup step in the weekly_rand_single run script uses an
unquoted recursive delete on WORK_DIR, which can target the wrong path if the
variable is empty or malformed. Update the rm -rf invocation to guard against
empty WORK_DIR and quote the variable expansion, and keep the mkdir -p step tied
to the same validated WORK_DIR value so the safe path is only removed and
recreated when it is set correctly.

In `@tests/integration_tests/weekly_rand_slow_lossy_ddl/run.sh`:
- Line 95: Guard the workdir reset in the weekly_rand_slow_lossy_ddl run.sh
script before the rm -rf call, since WORK_DIR can be blank or malformed and
cause deletion outside the test sandbox. Update the startup cleanup around the
WORK_DIR handling so the recursive delete only runs when WORK_DIR is verified as
a safe, non-empty path, keeping the existing mkdir -p flow intact.
- Around line 123-128: The weekly_rand_slow_lossy_ddl run script exits right
after check_sync_diff, so late PANIC/FATAL/DATA RACE messages in captured logs
can be missed. Add the same post-diff shell-side log scan used by the sibling
weekly failover script immediately after check_sync_diff in run.sh, using the
existing scan_logs_for_failures helper and the current WORK_DIR/TEST_NAME
context to detect failures before printing success.

In `@tests/utils/random_ddl_test_runner/bootstrap.go`:
- Around line 17-18: The bootstrap flow is using an unbounded context for
DDL/DML execution, which can let schema creation or bulk insert calls hang
indefinitely. Update `bootstrap()` in `random_ddl_test_runner/bootstrap.go` to
create a timeout/deadline-aware context for the `ExecContext` work, and thread
that same bounded context through the related bootstrap phases and helpers at
the referenced call sites so the TiDB/MySQL operations fail fast instead of
wedging the weekly job.

In `@tests/utils/random_ddl_test_runner/config.go`:
- Around line 210-221: The config defaults in the DML initialization logic are
overwriting explicit user-supplied false/0 values, so update the
parsing/defaulting path to distinguish “unset” from “set to zero/false.” In the
code that sets c.DML.HotspotRatio, c.DML.HotTableRatio, c.DML.BigTxnInterval,
c.DML.BigTxnEnabled, and c.DML.KeyConflictEnabled (and the related
log_scan_enabled/fail_on_panic_match handling noted in the review), only apply
defaults when the field was omitted, not when it was explicitly provided as 0 or
false. Use presence-aware fields or separate “was set” checks in the config
loader so runner_config.json values are preserved.

In `@tests/utils/random_ddl_test_runner/ddl_worker.go`:
- Around line 132-145: The pickMissingTable helper is using only 10 random
probes, which can miss an existing dropped table and return nil too often.
Update pickMissingTable to deterministically scan the candidates slice for a
table whose exists flag is false, using the existing table.mu lock, and return
the first missing table found; keep the rand.Rand parameter only if it is still
needed elsewhere in ddl_worker.go.

In `@tests/utils/random_ddl_test_runner/dml.go`:
- Around line 342-346: The VARCHAR generator in randASCII handling should
respect the actual c.typ.varcharN limit instead of forcing a minimum of 8, since
that can exceed narrowed columns and trigger Data too long errors. Update the
logic in the VARCHAR case to cap the requested length by varcharN and use 1 as
the lower bound, while preserving the special pad column behavior in the same
switch.

In `@tests/utils/random_ddl_test_runner/extra_workers.go`:
- Around line 155-170: The conflictWriter upsert in extra_workers.go is
hard-coded to a fixed column list, so it keeps failing after random DDL changes
to tbl’s schema. Update the logic around the conflict insert path to either
derive the INSERT/UPDATE columns from the current table schema while holding
tbl.mu, or switch conflictWriter to a table family that is not touched by
column-altering DDL. Keep the reference points in conflictWriter and
collectConflictTables in mind so the statement always matches the live schema
before calling db.ExecContext.

In `@tests/utils/random_ddl_test_runner/failover.go`:
- Around line 121-133: The status probe in the capture helper is only limited by
the workload context, so a slow `/status` request can hang failover handling.
Update the request flow in the capture status check to use a short per-request
timeout, ideally by creating a derived context or client timeout around the
existing http.DefaultClient.Do calls in this helper, and keep the fallback
basic-auth retry path in the same timeout-bounded logic.

In `@tests/utils/random_ddl_test_runner/health.go`:
- Around line 54-64: The checkpoint decode in the health response parser is
losing TSO precision because `json.Decoder` is unmarshaling numeric fields into
`float64` before `parseUint64` sees them. Update the decode path in `health.go`
to call `dec.UseNumber()` and make the `changefeedStatus` parsing logic handle
`json.Number` for `checkpoint_tso` and `checkpoint_ts` so the exact integer
value is preserved during comparisons.

In `@tests/utils/random_ddl_test_runner/main.go`:
- Around line 77-84: `main()` currently dispatches to `runner.workload()`
without validating workload-only required config, so bad `cdc.changefeed_id`
values fail later with weaker errors. Add an upfront check in the workload
branch before calling `runner.workload()`, using the existing config/runner
setup that reads `cdc.changefeed_id`, and return a clear error immediately when
it is missing or empty. Keep the bootstrap path unchanged and use the `main()`
phase switch plus `runner.workload()` as the entry points for the fix.

In `@tests/utils/random_ddl_test_runner/model.go`:
- Around line 113-118: `tableSchema.clone()` is only doing a shallow copy of
`indexes`, so each `index.columns` slice still aliases the original backing
array. Update the cloning logic in `tableSchema.clone()` to deep-copy every
`index` entry, including its `columns` field, while preserving the existing
copies of `columns` and `primaryKey`. Use the `tableSchema.clone()` and `index`
symbols to locate the fix.

In `@tests/utils/random_ddl_test_runner/motif.go`:
- Around line 38-53: Stop the motif sequence as soon as any phase fails in the
motif runner. In the function that drives the phases around motifAddSiteCode,
motifUnifySiteCodeDefault, and motifAddCompositePK, check each call result and
return immediately on error instead of continuing to later sleeps and steps.
Only advance motifStep after each phase succeeds so DML workers never switch to
the post-PK pattern before the default-unification phase completes.

In `@tests/utils/random_ddl_test_runner/syncpoint_diff.go`:
- Around line 161-164: Carry the syncpoint cursor forward into post-workload
catch-up instead of restarting from zero. In ensureSyncpointDiffAfterWorkload,
preserve and reuse the lastPrimary value from the periodic syncpoint scan so the
follow-up checks continue from the next candidate rather than rechecking earlier
ones. Update the call flow between workload.go and runSyncpointDiffChecks to
pass the current cursor along with the remaining count, using the same
lastPrimary tracking already present in the syncpoint_diff.go loop.
- Around line 225-226: fetchDDLWindows currently calls http.DefaultClient.Do
with the long-lived workload context, which can block syncpoint diffing on a
stalled /ddl/history request. Update the request path in fetchDDLWindows to use
a bounded per-request timeout by creating a request-scoped context or dedicated
http.Client with timeout before the Do call, so the periodic loop can continue
even if TiDB hangs. Keep the change localized to the fetchDDLWindows HTTP
request handling.

In `@tests/utils/random_ddl_test_runner/workload.go`:
- Around line 260-266: In the health tracking logic around lastCheckpoint and
lastAdvance, avoid treating repeated zero checkpoints as progress. Update the
progress timestamp only when st.Checkpoint is non-zero and differs from the
previous checkpoint, so the initial zero state does not keep resetting
lastAdvance and prevent NoAdvanceHard from firing.

---

Nitpick comments:
In `@tests/utils/random_ddl_test_runner/ddl_test.go`:
- Around line 9-47: Rename the test functions in ddl_test.go to camelCase to
match the Go naming rule and repo guidelines: update
TestGenDropColumn_DoesNotDropPrimaryKey and
TestGenAddPartition_RequiresRangePartition to use camelCase without underscores,
while keeping their behavior and references to genDropColumn and genAddPartition
unchanged.

In `@tests/utils/random_ddl_test_runner/selector_test.go`:
- Around line 8-30: The test function name in the selector test uses underscores
and should be renamed to camelCase to match the repo’s Go naming rule. Update
the `TestDDLSelector_CoverageDebtLikeBehavior` symbol in `selector_test.go` to a
camelCase form with no underscores, keeping the same test logic and coverage
behavior unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0746e1ef-db06-4c9f-9650-30fdb250b2c9

📥 Commits

Reviewing files that changed from the base of the PR and between 657ec82 and 00e40aa.

📒 Files selected for processing (42)

Makefile
pkg/eventservice/event_broker.go
pkg/eventservice/event_broker_test.go
tests/integration_tests/_utils/test_prepare
tests/integration_tests/run_weekly_rand_ddl_it_in_ci.sh
tests/integration_tests/weekly_rand_multi/conf/changefeed.toml
tests/integration_tests/weekly_rand_multi/conf/changefeed_mysql.toml
tests/integration_tests/weekly_rand_multi/conf/consumer.toml
tests/integration_tests/weekly_rand_multi/run.sh
tests/integration_tests/weekly_rand_multi_failover/conf/changefeed.toml
tests/integration_tests/weekly_rand_multi_failover/conf/changefeed_mysql.toml
tests/integration_tests/weekly_rand_multi_failover/conf/consumer.toml
tests/integration_tests/weekly_rand_multi_failover/run.sh
tests/integration_tests/weekly_rand_single/conf/changefeed_mysql.toml
tests/integration_tests/weekly_rand_single/conf/consumer.toml
tests/integration_tests/weekly_rand_single/run.sh
tests/integration_tests/weekly_rand_slow_lossy_ddl/conf/changefeed_mysql.toml
tests/integration_tests/weekly_rand_slow_lossy_ddl/run.sh
tests/utils/random_ddl_test_runner/autotune.go
tests/utils/random_ddl_test_runner/autotune_test.go
tests/utils/random_ddl_test_runner/bootstrap.go
tests/utils/random_ddl_test_runner/config.go
tests/utils/random_ddl_test_runner/db.go
tests/utils/random_ddl_test_runner/ddl.go
tests/utils/random_ddl_test_runner/ddl_test.go
tests/utils/random_ddl_test_runner/ddl_worker.go
tests/utils/random_ddl_test_runner/dml.go
tests/utils/random_ddl_test_runner/dml_test.go
tests/utils/random_ddl_test_runner/extra_workers.go
tests/utils/random_ddl_test_runner/failover.go
tests/utils/random_ddl_test_runner/health.go
tests/utils/random_ddl_test_runner/logger.go
tests/utils/random_ddl_test_runner/logscan.go
tests/utils/random_ddl_test_runner/logscan_test.go
tests/utils/random_ddl_test_runner/main.go
tests/utils/random_ddl_test_runner/model.go
tests/utils/random_ddl_test_runner/motif.go
tests/utils/random_ddl_test_runner/runner.go
tests/utils/random_ddl_test_runner/selector.go
tests/utils/random_ddl_test_runner/selector_test.go
tests/utils/random_ddl_test_runner/syncpoint_diff.go
tests/utils/random_ddl_test_runner/workload.go

coderabbitai · 2026-06-29T10:07:19Z

+
+trap 'cleanup' EXIT
+
+rm -rf $WORK_DIR && mkdir -p $WORK_DIR


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Guard the workdir reset before rm -rf.

Line 149 deletes an unguarded variable path. If WORK_DIR is empty or malformed, this cleanup can wipe the wrong directory and make the job fail destructively.

Suggested fix

-rm -rf $WORK_DIR && mkdir -p $WORK_DIR +rm -rf -- "${WORK_DIR:?WORK_DIR is unset}" && mkdir -p -- "$WORK_DIR"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

rm -rf $WORK_DIR && mkdir -p $WORK_DIR

rm -rf -- "${WORK_DIR:?WORK_DIR is unset}" && mkdir -p -- "$WORK_DIR"

🧰 Tools

🪛 Shellcheck (0.11.0)

[info] 149-149: Double quote to prevent globbing and word splitting.

(SC2086)

[info] 149-149: Double quote to prevent globbing and word splitting.

(SC2086)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration_tests/weekly_rand_multi_failover/run.sh` at line 149, The workdir cleanup in the script is using an unguarded WORK_DIR value, which can make the reset dangerous if the variable is empty or malformed. Update the reset logic around the rm -rf in the run.sh flow to validate WORK_DIR before deletion, and only proceed when it is a non-empty expected path; use the WORK_DIR handling near the rm -rf/mkdir sequence as the place to add the guard.

Source: Linters/SAST tools

coderabbitai · 2026-06-29T10:07:19Z

+	*)
+		SINK_URI="mysql://root@127.0.0.1:3306/"
+		;;


🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Use the configured downstream address for the MySQL sink.

This script writes runner config and downstream SQL against ${DOWN_TIDB_HOST}:${DOWN_TIDB_PORT}, but the mysql sink still points at 127.0.0.1:3306. That breaks the test whenever the downstream endpoint is overridden.

Suggested fix

*) - SINK_URI="mysql://root@127.0.0.1:3306/" + SINK_URI="mysql://root@${DOWN_TIDB_HOST}:${DOWN_TIDB_PORT}/" ;; esac }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

*)

SINK_URI="mysql://root@127.0.0.1:3306/"

;;

*)

SINK_URI="mysql://root@${DOWN_TIDB_HOST}:${DOWN_TIDB_PORT}/"

;;

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration_tests/weekly_rand_multi/run.sh` around lines 118 - 120, The MySQL sink fallback in the run script is hardcoded to localhost, which bypasses the configured downstream endpoint. Update the sink URI construction in the case branch that sets SINK_URI so it uses the same downstream host and port variables already used elsewhere in the script, matching the runner config and SQL target. Keep the change localized to the SINK_URI assignment and use the existing DOWN_TIDB_HOST and DOWN_TIDB_PORT symbols for consistency.

coderabbitai · 2026-06-29T10:07:19Z

+
+trap 'cleanup' EXIT
+
+rm -rf $WORK_DIR && mkdir -p $WORK_DIR


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Guard the recursive delete with quotes and a non-empty check.

If WORK_DIR resolves unexpectedly, this can delete the wrong directory tree.

Suggested fix

-rm -rf $WORK_DIR && mkdir -p $WORK_DIR +rm -rf -- "${WORK_DIR:?WORK_DIR is unset}" && mkdir -p "$WORK_DIR"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

rm -rf $WORK_DIR && mkdir -p $WORK_DIR

rm -rf -- "${WORK_DIR:?WORK_DIR is unset}" && mkdir -p "$WORK_DIR"

🧰 Tools

🪛 Shellcheck (0.11.0)

[info] 143-143: Double quote to prevent globbing and word splitting.

(SC2086)

[info] 143-143: Double quote to prevent globbing and word splitting.

(SC2086)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration_tests/weekly_rand_multi/run.sh` at line 143, The cleanup command in the weekly_rand_multi run script is unsafe because rm -rf uses WORK_DIR unquoted and without validating it first. Update the WORK_DIR cleanup logic to check that WORK_DIR is set and non-empty before deleting, and quote the variable everywhere it is used in the rm/mkdir sequence to avoid accidental expansion or deletion of the wrong directory tree.

Source: Linters/SAST tools

coderabbitai · 2026-06-29T10:07:19Z

+	*)
+		SINK_URI="mysql://root@127.0.0.1:3306/"
+		;;


🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Use the configured downstream address for the MySQL sink.

The runner config, run_sql calls, and final diff all target ${DOWN_TIDB_HOST}:${DOWN_TIDB_PORT}, but the changefeed sink is hard-coded to 127.0.0.1:3306. In any non-default topology this replicates into one downstream and validates another.

Suggested fix

*) - SINK_URI="mysql://root@127.0.0.1:3306/" + SINK_URI="mysql://root@${DOWN_TIDB_HOST}:${DOWN_TIDB_PORT}/" ;; esac }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

*)

SINK_URI="mysql://root@127.0.0.1:3306/"

;;

*)

SINK_URI="mysql://root@${DOWN_TIDB_HOST}:${DOWN_TIDB_PORT}/"

;;

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration_tests/weekly_rand_single/run.sh` around lines 113 - 115, The MySQL sink URI is hard-coded to localhost in the sink selection branch, which makes it diverge from the configured downstream used elsewhere in run.sh. Update the SINK_URI assignment in the MySQL case to use the same downstream host and port variables already referenced by the runner config and run_sql checks, so the changefeed replicates into the intended target. Keep the fix localized to the sink URI setup logic in the shell script.

coderabbitai · 2026-06-29T10:07:19Z

+
+trap 'cleanup' EXIT
+
+rm -rf $WORK_DIR && mkdir -p $WORK_DIR


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Guard the recursive delete with quotes and a non-empty check.

If WORK_DIR is empty or malformed, this can remove an unintended path.

Suggested fix

-rm -rf $WORK_DIR && mkdir -p $WORK_DIR +rm -rf -- "${WORK_DIR:?WORK_DIR is unset}" && mkdir -p "$WORK_DIR"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

rm -rf $WORK_DIR && mkdir -p $WORK_DIR

rm -rf -- "${WORK_DIR:?WORK_DIR is unset}" && mkdir -p "$WORK_DIR"

🧰 Tools

🪛 Shellcheck (0.11.0)

[info] 138-138: Double quote to prevent globbing and word splitting.

(SC2086)

[info] 138-138: Double quote to prevent globbing and word splitting.

(SC2086)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration_tests/weekly_rand_single/run.sh` at line 138, The cleanup step in the weekly_rand_single run script uses an unquoted recursive delete on WORK_DIR, which can target the wrong path if the variable is empty or malformed. Update the rm -rf invocation to guard against empty WORK_DIR and quote the variable expansion, and keep the mkdir -p step tied to the same validated WORK_DIR value so the safe path is only removed and recreated when it is set correctly.

Source: Linters/SAST tools

coderabbitai · 2026-06-29T10:07:20Z

+func (s tableSchema) clone() tableSchema {
+	cp := s
+	cp.columns = append([]column(nil), s.columns...)
+	cp.primaryKey = append([]string(nil), s.primaryKey...)
+	cp.indexes = append([]index(nil), s.indexes...)
+	return cp


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Deep-copy index.columns in tableSchema.clone().

cp.indexes = append([]index(nil), s.indexes...) only copies the outer slice. Each cloned index.columns still shares the same backing array, so later DDL mutations can leak across schema, initialSchema, and sibling tables cloned from the same family.

Possible fix

func (s tableSchema) clone() tableSchema { cp := s cp.columns = append([]column(nil), s.columns...) cp.primaryKey = append([]string(nil), s.primaryKey...) - cp.indexes = append([]index(nil), s.indexes...) + cp.indexes = make([]index, len(s.indexes)) + for i, idx := range s.indexes { + cp.indexes[i] = idx + cp.indexes[i].columns = append([]string(nil), idx.columns...) + } return cp }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

func (s tableSchema) clone() tableSchema {

cp := s

cp.columns = append([]column(nil), s.columns...)

cp.primaryKey = append([]string(nil), s.primaryKey...)

cp.indexes = append([]index(nil), s.indexes...)

return cp

func (s tableSchema) clone() tableSchema {

cp := s

cp.columns = append([]column(nil), s.columns...)

cp.primaryKey = append([]string(nil), s.primaryKey...)

cp.indexes = make([]index, len(s.indexes))

for i, idx := range s.indexes {

cp.indexes[i] = idx

cp.indexes[i].columns = append([]string(nil), idx.columns...)

}

return cp

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/utils/random_ddl_test_runner/model.go` around lines 113 - 118, `tableSchema.clone()` is only doing a shallow copy of `indexes`, so each `index.columns` slice still aliases the original backing array. Update the cloning logic in `tableSchema.clone()` to deep-copy every `index` entry, including its `columns` field, while preserving the existing copies of `columns` and `primaryKey`. Use the `tableSchema.clone()` and `index` symbols to locate the fix.

coderabbitai · 2026-06-29T10:07:20Z

+	if err := motifAddSiteCode(ctx, db, model, trace, logger); err == nil {
+		atomic.StoreInt32(motifStep, 1)
+	}
+
+	if err := sleepWithContext(ctx, step2At-step1At); err != nil {
+		return
+	}
+	if err := motifUnifySiteCodeDefault(ctx, db, model, trace, logger); err == nil {
+		atomic.StoreInt32(motifStep, 2)
+	}
+
+	if err := sleepWithContext(ctx, step3At-step2At); err != nil {
+		return
+	}
+	if err := motifAddCompositePK(ctx, db, model, trace, logger); err == nil {
+		atomic.StoreInt32(motifStep, 3)


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Stop the motif sequence on the first failed phase.

Step 2 and step 3 still run after earlier failures. That can set motifStep=3 without ever finishing the default-unification phase, so DML workers switch to the post-PK pattern against rows that can never match site_code=''.

Suggested fix

if err := sleepWithContext(ctx, step1At); err != nil { return } - if err := motifAddSiteCode(ctx, db, model, trace, logger); err == nil { - atomic.StoreInt32(motifStep, 1) + if err := motifAddSiteCode(ctx, db, model, trace, logger); err != nil { + return } + atomic.StoreInt32(motifStep, 1) if err := sleepWithContext(ctx, step2At-step1At); err != nil { return } - if err := motifUnifySiteCodeDefault(ctx, db, model, trace, logger); err == nil { - atomic.StoreInt32(motifStep, 2) + if err := motifUnifySiteCodeDefault(ctx, db, model, trace, logger); err != nil { + return } + atomic.StoreInt32(motifStep, 2) if err := sleepWithContext(ctx, step3At-step2At); err != nil { return } - if err := motifAddCompositePK(ctx, db, model, trace, logger); err == nil { - atomic.StoreInt32(motifStep, 3) + if err := motifAddCompositePK(ctx, db, model, trace, logger); err != nil { + return } + atomic.StoreInt32(motifStep, 3)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if err := motifAddSiteCode(ctx, db, model, trace, logger); err == nil {

atomic.StoreInt32(motifStep, 1)

}

if err := sleepWithContext(ctx, step2At-step1At); err != nil {

return

}

if err := motifUnifySiteCodeDefault(ctx, db, model, trace, logger); err == nil {

atomic.StoreInt32(motifStep, 2)

}

if err := sleepWithContext(ctx, step3At-step2At); err != nil {

return

}

if err := motifAddCompositePK(ctx, db, model, trace, logger); err == nil {

atomic.StoreInt32(motifStep, 3)

if err := motifAddSiteCode(ctx, db, model, trace, logger); err != nil {

return

}

atomic.StoreInt32(motifStep, 1)

if err := sleepWithContext(ctx, step2At-step1At); err != nil {

return

}

if err := motifUnifySiteCodeDefault(ctx, db, model, trace, logger); err != nil {

return

}

atomic.StoreInt32(motifStep, 2)

if err := sleepWithContext(ctx, step3At-step2At); err != nil {

return

}

if err := motifAddCompositePK(ctx, db, model, trace, logger); err != nil {

return

}

atomic.StoreInt32(motifStep, 3)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/utils/random_ddl_test_runner/motif.go` around lines 38 - 53, Stop the motif sequence as soon as any phase fails in the motif runner. In the function that drives the phases around motifAddSiteCode, motifUnifySiteCodeDefault, and motifAddCompositePK, check each call result and return immediately on error instead of continuing to later sleeps and steps. Only advance motifStep after each phase succeeds so DML workers never switch to the post-PK pattern before the default-unification phase completes.

coderabbitai · 2026-06-29T10:07:20Z

+	var lastPrimary uint64
+	checked := 0
+	for checked < required {
+		n, err := r.runSyncpointDiffChecks(ctx, down, model, required-checked, &lastPrimary, true)


🎯 Functional Correctness | 🟠 Major | 🏗️ Heavy lift

Carry the syncpoint cursor into post-workload catch-up.

ensureSyncpointDiffAfterWorkload restarts from lastPrimary = 0, while workload.go only passes the remaining count. If the periodic loop already checked syncpoints, catch-up can recheck the earliest candidates and still report completion without adding the intended new coverage.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/utils/random_ddl_test_runner/syncpoint_diff.go` around lines 161 - 164, Carry the syncpoint cursor forward into post-workload catch-up instead of restarting from zero. In ensureSyncpointDiffAfterWorkload, preserve and reuse the lastPrimary value from the periodic syncpoint scan so the follow-up checks continue from the next candidate rather than rechecking earlier ones. Update the call flow between workload.go and runSyncpointDiffChecks to pass the current cursor along with the remaining count, using the same lastPrimary tracking already present in the syncpoint_diff.go loop.

coderabbitai · 2026-06-29T10:07:20Z

+	resp, err := http.DefaultClient.Do(req)
+	if err != nil {


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # Inspect the relevant file and nearby context wc -l tests/utils/random_ddl_test_runner/syncpoint_diff.go sed -n '180,270p' tests/utils/random_ddl_test_runner/syncpoint_diff.go

Repository: pingcap/ticdc

Length of output: 2938

🏁 Script executed:

#!/bin/bash set -euo pipefail set -euo pipefail rg -n "fetchDDLWindows\\(|pickNextSyncpointCandidate\\(|context.WithTimeout\\(|WithDeadline\\(" tests/utils/random_ddl_test_runner/syncpoint_diff.go tests/utils/random_ddl_test_runner -g'*.go' sed -n '1,180p' tests/utils/random_ddl_test_runner/syncpoint_diff.go

Repository: pingcap/ticdc

Length of output: 6779

🏁 Script executed:

#!/bin/bash set -euo pipefail rg -n "fetchDDLWindows\\(|pickNextSyncpointCandidate\\(|context.WithTimeout\\(|WithDeadline\\(" tests/utils/random_ddl_test_runner/syncpoint_diff.go tests/utils/random_ddl_test_runner -g'*.go' sed -n '1,180p' tests/utils/random_ddl_test_runner/syncpoint_diff.go

Repository: pingcap/ticdc

Length of output: 6779

🏁 Script executed:

#!/bin/bash set -euo pipefail # Inspect where syncpoint diff loop is started and how its context is scoped. rg -n "syncpointDiffLoop\\(|ensureSyncpointDiffAfterWorkload\\(|runSyncpointDiffChecks\\(" tests/utils/random_ddl_test_runner -g'*.go' sed -n '160,260p' tests/utils/random_ddl_test_runner/workload.go sed -n '260,360p' tests/utils/random_ddl_test_runner/workload.go

Repository: pingcap/ticdc

Length of output: 7455

Use a bounded timeout for /ddl/history. fetchDDLWindows uses http.DefaultClient with the long-lived workload context, so a stalled TiDB handler can freeze periodic syncpoint diffing until the test ends. A per-request timeout keeps the loop moving.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/utils/random_ddl_test_runner/syncpoint_diff.go` around lines 225 - 226, fetchDDLWindows currently calls http.DefaultClient.Do with the long-lived workload context, which can block syncpoint diffing on a stalled /ddl/history request. Update the request path in fetchDDLWindows to use a bounded per-request timeout by creating a request-scoped context or dedicated http.Client with timeout before the Do call, so the periodic loop can continue even if TiDB hangs. Keep the change localized to the fetchDDLWindows HTTP request handling.

coderabbitai · 2026-06-29T10:07:20Z

+		if lastCheckpoint == 0 {
+			lastCheckpoint = st.Checkpoint
+			lastAdvance = now
+		} else if st.Checkpoint != 0 && st.Checkpoint != lastCheckpoint {
+			lastCheckpoint = st.Checkpoint
+			lastAdvance = now
+		}


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Don’t reset checkpoint progress while checkpoint is zero.

When st.Checkpoint remains 0, lastCheckpoint == 0 stays true and lastAdvance is refreshed every health tick, so NoAdvanceHard never triggers. Only update progress when a non-zero checkpoint changes.

Proposed fix

- if lastCheckpoint == 0 { - lastCheckpoint = st.Checkpoint - lastAdvance = now - } else if st.Checkpoint != 0 && st.Checkpoint != lastCheckpoint { + if st.Checkpoint != 0 && st.Checkpoint != lastCheckpoint { lastCheckpoint = st.Checkpoint lastAdvance = now }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if lastCheckpoint == 0 {

lastCheckpoint = st.Checkpoint

lastAdvance = now

} else if st.Checkpoint != 0 && st.Checkpoint != lastCheckpoint {

lastCheckpoint = st.Checkpoint

lastAdvance = now

}

if st.Checkpoint != 0 && st.Checkpoint != lastCheckpoint {

lastCheckpoint = st.Checkpoint

lastAdvance = now

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/utils/random_ddl_test_runner/workload.go` around lines 260 - 266, In the health tracking logic around lastCheckpoint and lastAdvance, avoid treating repeated zero checkpoints as progress. Update the progress timestamp only when st.Checkpoint is non-zero and differs from the previous checkpoint, so the initial zero state does not keep resetting lastAdvance and prevent NoAdvanceHard from firing.

hongyunyan and others added 16 commits January 16, 2026 10:32

update

4b73c7f

update

4f27424

tests: add standalone random ddl ci entry

001298f

Merge branch 'master' of https://github.com/pingcap/ticdc into 0115-d…

2c69462

…dl-test

Merge branch '0115-ddl-test' of https://github.com/hongyunyan/tigate …

40a0cf7

…into 0115-ddl-test

update

423c658

temp-update

fe967e2

Merge branch '0115-ddl-test' of https://github.com/hongyunyan/tigate …

e9ef21a

…into 0115-ddl-test

update

47fbbf4

Signed-off-by: root <lidezhu@pingcap.com>

Merge remote-tracking branch 'upstream/master' into 0115-ddl-test

0ffe03f

# Conflicts: # Makefile

Merge remote-tracking branch 'upstream/master' into 0115-ddl-test

df21892

test: extend weekly random DDL converge timeout

6303059

test: reduce weekly random single DML pressure

629180b

debug: disable event service scan window

a0c6c73

eventservice: advance scan window for pending syncpoints

66abece

tests: avoid random DML false positives in log scan

00e40aa

ti-chi-bot Bot added do-not-merge/needs-linked-issue release-note-none Denotes a PR that doesn't merit a release note. labels Jun 29, 2026

ti-chi-bot Bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jun 29, 2026

ti-chi-bot Bot added do-not-merge/needs-triage-completed and removed do-not-merge/needs-linked-issue labels Jun 29, 2026

gemini-code-assist Bot reviewed Jun 29, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 29, 2026

View reviewed changes

asddongmen mentioned this pull request Jun 30, 2026

eventservice: scan window may stuck wheren there are pending syncpoints #5546

Open

asddongmen marked this pull request as draft June 30, 2026 02:23

asddongmen mentioned this pull request Jun 30, 2026

eventservice: advance scan window for pending syncpoints #5547

Open

ti-chi-bot Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 30, 2026

	rm -rf $WORK_DIR && mkdir -p $WORK_DIR
	rm -rf -- "${WORK_DIR:?WORK_DIR is unset}" && mkdir -p -- "$WORK_DIR"

Uh oh!

Conversation

asddongmen commented Jun 29, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What is changed and how it works?

Check List

Tests

Questions

Will it cause performance regression or break compatibility?

Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Summary by CodeRabbit

Uh oh!

ti-chi-bot Bot commented Jun 29, 2026

Uh oh!

coderabbitai Bot commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

asddongmen commented Jun 29, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 29, 2026 •

edited

Loading