Skip to content

tests: add weekly random DDL integration coverage#5542

Draft
asddongmen wants to merge 16 commits into
pingcap:masterfrom
asddongmen:codex/weekly-rand-ddl-debug-20260628
Draft

tests: add weekly random DDL integration coverage#5542
asddongmen wants to merge 16 commits into
pingcap:masterfrom
asddongmen:codex/weekly-rand-ddl-debug-20260628

Conversation

@asddongmen

@asddongmen asddongmen commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

What problem does this PR solve?

Issue Number: close #5543

This PR adds weekly random DDL+DML integration coverage and fixes issues found while running it. The runner also needs to avoid false failures when random DML payload strings contain words such as panic or fatal.

What is changed and how it works?

  • Add a standalone random DDL test runner and weekly random DDL integration cases.
  • Tune the weekly single case DML pressure and extend converge timeout for slow catch-up.
  • Advance the event service scan window when pending syncpoints are present, similar to DDL handling.
  • Tighten runner and shell log scans to match real failure signals such as [PANIC], [FATAL], panic:, fatal error:, and WARNING: DATA RACE, while keeping custom substring patterns supported.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)

Manual checks run:

  • PATH=/usr/local/go/bin:$PATH GOTOOLCHAIN=auto go test ./tests/utils/random_ddl_test_runner -count=1
  • bash -n for weekly random DDL scripts and _utils/test_prepare
  • Re-scanned the failed /tmp/tidb_cdc_test/weekly_rand_single logs with the new shell rule.
  • Ran make integration_test_weekly_rand_ddl_mysql; workload/converge/syncpoint diff passed, then old log scan rule failed on random payload false positives.

Questions

Will it cause performance regression or break compatibility?

No. Runtime code only expands event-service scan progress when pending syncpoints exist; the remaining changes are test tooling and integration cases.

Do you need to update user documentation, design documentation or monitoring documentation?

No.

Release note

None

Summary by CodeRabbit

  • New Features

    • Added new weekly random DDL/DML integration test runners for single, multi, failover, and lossy-DDL scenarios.
    • Expanded test tooling to support log scanning, sync-point checks, failover handling, and automated diff verification.
    • Introduced improved workload generation with DDL/DML auto-tuning, conflict traffic, and large transaction coverage.
  • Bug Fixes

    • Improved scan-range handling so progress continues correctly when sync-point barriers are present.
  • Tests

    • Added unit coverage for tuning, DDL generation, selector behavior, log scanning, and scan-range edge cases.

@ti-chi-bot ti-chi-bot Bot added do-not-merge/needs-linked-issue release-note-none Denotes a PR that doesn't merit a release note. labels Jun 29, 2026
@ti-chi-bot

ti-chi-bot Bot commented Jun 29, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign sdojjy for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jun 29, 2026
@coderabbitai

coderabbitai Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Adds a new random_ddl_test_runner Go binary with config, schema model, concurrent DML/DDL workers, auto-tune, motif DDL sequencing, log scanning, capture failover, and syncpoint diff logic. Four new weekly integration test scenarios are introduced alongside Makefile wiring. A separate fix extends getScanTaskDataRange to advance the local scan window when pending sync-point events are present.

Changes

Event broker sync-point scan fix

Layer / File(s) Summary
getScanTaskDataRange sync-point logic and test
pkg/eventservice/event_broker.go, pkg/eventservice/event_broker_test.go
Loads nextSyncPoint timestamp alongside pending-DDL detection and expands the local-scan advancement condition to cover pending sync-point events; adds a unit test asserting CommitTsEnd advances to nextSyncPointTs+1 when global capping would otherwise produce an empty range.

Random DDL test runner binary and integration test scenarios

Layer / File(s) Summary
Runner config, DB helpers, and entry point
tests/utils/random_ddl_test_runner/config.go, tests/utils/random_ddl_test_runner/db.go, tests/utils/random_ddl_test_runner/runner.go, tests/utils/random_ddl_test_runner/logger.go, tests/utils/random_ddl_test_runner/main.go
Defines all JSON-mapped config structs with profile-dependent defaults and validation; MySQL connection open/pool helpers; runner struct constructor; and the main() CLI entry point dispatching bootstrap or workload phases.
In-memory cluster model
tests/utils/random_ddl_test_runner/model.go
Defines domain constants, column/index/schema SQL helpers, deterministic table family catalog, mutable table state, clusterModel construction, and hot/cold/domain table selection helpers plus deterministic value generators.
Bootstrap phase
tests/utils/random_ddl_test_runner/bootstrap.go
Implements runner.bootstrap() to drop/create databases and tables on both upstream and downstream, then inserts deterministic initial rows in batches using parameterized INSERTs.
DDL catalog, selector, and worker
tests/utils/random_ddl_test_runner/ddl.go, tests/utils/random_ddl_test_runner/ddl_worker.go, tests/utils/random_ddl_test_runner/selector.go, tests/utils/random_ddl_test_runner/ddl_test.go, tests/utils/random_ddl_test_runner/selector_test.go
Defines weighted ddlKind catalog and all DDL generators; inverse-weight ddlSelector with sliding-window record; ddlWorker submission loop with tracing and model-apply on success; unit tests for drop-column PK safety, partition guard, and selector coverage-debt behavior.
DML worker, motif DDL sequence, and extra workers
tests/utils/random_ddl_test_runner/dml.go, tests/utils/random_ddl_test_runner/motif.go, tests/utils/random_ddl_test_runner/extra_workers.go, tests/utils/random_ddl_test_runner/dml_test.go
Implements dmlWorker, generic/motif DML builders, buildRandomValue; three-step primary motif DDL sequence (add site_code, unify default, composite PK); bigTxnWorker, conflictWriter; unit tests for ASCII generation and motif insert site_code omission.
Auto-tune concurrency controller
tests/utils/random_ddl_test_runner/autotune.go, tests/utils/random_ddl_test_runner/autotune_test.go
Implements autoTuneStep scaling DML/DDL worker counts based on checkpoint stall and DML success rate with hard-fail on prolonged stall; four unit tests covering all scaling branches.
Health monitor and capture failover
tests/utils/random_ddl_test_runner/health.go, tests/utils/random_ddl_test_runner/failover.go
Implements getChangefeedStatus HTTP client with TSO parsing; failoverLoop killing/restarting captures with optional checkpoint-advance gating; getCapturePID and execCommand helpers.
Log scanning utility
tests/utils/random_ddl_test_runner/logscan.go, tests/utils/random_ddl_test_runner/logscan_test.go
Implements scanLogsForPatterns with buffered line scanning, special-case multi-token detection for panic/fatal/DATA RACE, and file globbing; seven unit tests.
Syncpoint diff loop and inspector integration
tests/utils/random_ddl_test_runner/syncpoint_diff.go
Implements periodic syncpointDiffLoop, DDL-window skipping, TOML config writing, runSyncDiffInspectorWithSnapshotGuard (external-ts-read toggle), retries, and tail capture for failure classification.
Workload orchestration
tests/utils/random_ddl_test_runner/workload.go
Implements workload() starting all concurrent goroutines with cancellation, convergence via finish-mark polling, DML stats writing, final diff_config.toml generation, and log scanning.
Integration test scripts, configs, and CI wiring
Makefile, tests/integration_tests/_utils/test_prepare, tests/integration_tests/run_weekly_rand_ddl_it_in_ci.sh, tests/integration_tests/weekly_rand_single/..., tests/integration_tests/weekly_rand_multi/..., tests/integration_tests/weekly_rand_multi_failover/..., tests/integration_tests/weekly_rand_slow_lossy_ddl/...
Adds four integration test run.sh scripts with matching TOML configs enabling sync-point and scheduler settings; CI entry-point script dispatching by sink type; scan_logs_for_failures in shared test_prepare; new integration_test_weekly_rand_ddl_mysql Makefile target with reduced binary checks.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Suggested labels

lgtm, approved

Suggested reviewers

  • hongyunyan
  • lidezhu
  • wk989898

Poem

🐇 Hopping through schemas and DDL galore,
With sync-points and motifs and diff-checks in store,
The workload runs weekly, the logs get a scan,
No panic or fatal escapes the grand plan.
The rabbit approves — let the test suite begin! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 1.65% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title is concise and accurately summarizes the main change: weekly random DDL integration coverage.
Description check ✅ Passed The description includes the required issue link, change summary, tests, questions, and release note sections.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive random DDL and DML weekly integration testing suite, including a deterministic workload generator and verifier utility (random_ddl_test_runner) and several test scenarios (single capture, multi-capture, failover, and slow lossy DDL). It also updates the event broker to handle pending syncpoint events when advancing scan windows. The review feedback identifies multiple high-severity data races in ddl_worker.go and extra_workers.go due to concurrent unsynchronized reads of table names via tbl.fqName(), and suggests a minor improvement in bootstrap.go to use the standard library's math.MaxInt32 instead of a custom bitwise function.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +113 to +125
start := time.Now()
_, err := db.ExecContext(ctx, sqlText)
if err == nil {
apply()
selector.record(kind.name)
}
if logger != nil {
logger.Printf("ddl worker=%d kind=%s target=%s elapsed=%s err=%v",
workerID, kind.name, tbl.fqName(), time.Since(start), err)
}
if trace != nil {
trace.record(kind.name, tbl.fqName(), sqlText, err)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The tbl.fqName() method reads tbl.name without holding tbl.mu. Since tbl.name can be mutated concurrently by the apply function of genRenameTable under tbl.mu.Lock(), calling tbl.fqName() concurrently without a lock causes a data race. To prevent this, retrieve the fully qualified name while holding tbl.mu.Lock() before executing the DDL, and use the cached string for logging and tracing.

Suggested change
start := time.Now()
_, err := db.ExecContext(ctx, sqlText)
if err == nil {
apply()
selector.record(kind.name)
}
if logger != nil {
logger.Printf("ddl worker=%d kind=%s target=%s elapsed=%s err=%v",
workerID, kind.name, tbl.fqName(), time.Since(start), err)
}
if trace != nil {
trace.record(kind.name, tbl.fqName(), sqlText, err)
}
start := time.Now()
tbl.mu.Lock()
fqName := tbl.fqName()
tbl.mu.Unlock()
_, err := db.ExecContext(ctx, sqlText)
if err == nil {
apply()
selector.record(kind.name)
}
if logger != nil {
logger.Printf("ddl worker=%d kind=%s target=%s elapsed=%s err=%v",
workerID, kind.name, fqName, time.Since(start), err)
}
if trace != nil {
trace.record(kind.name, fqName, sqlText, err)
}

Comment on lines +55 to +119
func runBigInsertTxn(ctx context.Context, db *sql.DB, tbl *table, rows int) error {
// Build a single multi-row INSERT inside a transaction to create a "big txn" workload.
tbl.mu.Lock()
if !tbl.exists {
tbl.mu.Unlock()
return nil
}
schema := tbl.schema.clone()
startID := tbl.nextID
tbl.nextID += int64(rows)
tbl.mu.Unlock()

var cols []column
for _, c := range schema.columns {
if c.generated != "" {
continue
}
cols = append(cols, c)
}
if len(cols) == 0 {
return nil
}

colNames := make([]string, 0, len(cols))
for _, c := range cols {
colNames = append(colNames, c.name)
}

var valuesSQL strings.Builder
var args []any
for i := 0; i < rows; i++ {
if i > 0 {
valuesSQL.WriteString(",")
}
valuesSQL.WriteString("(")
for j := range cols {
if j > 0 {
valuesSQL.WriteString(",")
}
valuesSQL.WriteString("?")
}
valuesSQL.WriteString(")")
rowID := startID + int64(i)
for _, c := range cols {
args = append(args, buildRandomValue(rand.New(rand.NewSource(rowID)), tbl, c, rowID))
}
}

stmt := fmt.Sprintf("INSERT INTO %s (%s) VALUES %s",
tbl.fqName(),
backtickJoin(colNames),
valuesSQL.String(),
)

tx, err := db.BeginTx(ctx, nil)
if err != nil {
return err
}
if _, err := tx.ExecContext(ctx, stmt, args...); err != nil {
_ = tx.Rollback()
return err
}
return tx.Commit()
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The tbl.fqName() method reads tbl.name without holding tbl.mu. Since tbl.name can be mutated concurrently by the apply function of genRenameTable under tbl.mu.Lock(), calling tbl.fqName() concurrently without a lock causes a data race. To prevent this, retrieve the fully qualified name while holding tbl.mu.Lock() at the beginning of runBigInsertTxn, and use the cached string to construct the SQL statement.

func runBigInsertTxn(ctx context.Context, db *sql.DB, tbl *table, rows int) error {
	// Build a single multi-row INSERT inside a transaction to create a "big txn" workload.
	tbl.mu.Lock()
	if !tbl.exists {
		tbl.mu.Unlock()
		return nil
	}
	fqName := tbl.fqName()
	schema := tbl.schema.clone()
	startID := tbl.nextID
	tbl.nextID += int64(rows)
	tbl.mu.Unlock()

	var cols []column
	for _, c := range schema.columns {
		if c.generated != "" {
			continue
		}
		cols = append(cols, c)
	}
	if len(cols) == 0 {
		return nil
	}

	colNames := make([]string, 0, len(cols))
	for _, c := range cols {
		colNames = append(colNames, c.name)
	}

	var valuesSQL strings.Builder
	var args []any
	for i := 0; i < rows; i++ {
		if i > 0 {
			valuesSQL.WriteString(",")
		}
		valuesSQL.WriteString("(")
		for j := range cols {
			if j > 0 {
				valuesSQL.WriteString(",")
			}
			valuesSQL.WriteString("?")
		}
		valuesSQL.WriteString(")")
		rowID := startID + int64(i)
		for _, c := range cols {
			args = append(args, buildRandomValue(rand.New(rand.NewSource(rowID)), tbl, c, rowID))
		}
	}

	stmt := fmt.Sprintf("INSERT INTO %s (%s) VALUES %s",
		fqName,
		backtickJoin(colNames),
		valuesSQL.String(),
	)

	tx, err := db.BeginTx(ctx, nil)
	if err != nil {
		return err
	}
	if _, err := tx.ExecContext(ctx, stmt, args...); err != nil {
		_ = tx.Rollback()
		return err
	}
	return tx.Commit()
}

Comment on lines +120 to +173
func conflictWriter(
ctx context.Context,
db *sql.DB,
model *clusterModel,
seed int64,
cfg dmlConfig,
counters *dmlCounters,
) {
// conflictWriter continuously upserts into a small key space to create write conflicts.
// This targets row-level contention and duplicate key paths.
if !cfg.KeyConflictEnabled || cfg.KeyConflictKeyspace <= 0 {
return
}
rng := rand.New(rand.NewSource(seed))
targetTables := collectConflictTables(model)
if len(targetTables) == 0 {
return
}

for {
select {
case <-ctx.Done():
return
default:
}

tbl := targetTables[rng.Intn(len(targetTables))]
tbl.mu.Lock()
exists := tbl.exists
tbl.mu.Unlock()
if !exists {
_ = sleepWithContext(ctx, 200*time.Millisecond)
continue
}

key := rng.Intn(cfg.KeyConflictKeyspace) + 1
stmt := fmt.Sprintf("INSERT INTO %s (`id`,`a`,`b`,`c`,`d`,`e`,`bin`) VALUES (?,?,?,?,?,?,?) "+
"ON DUPLICATE KEY UPDATE `a`=VALUES(`a`),`b`=VALUES(`b`),`c`=VALUES(`c`)",
tbl.fqName(),
)
args := []any{
int64(key),
int32(rng.Intn(1_000_000)),
randASCII(rng, 16),
deterministicDecimal(int64(key)),
deterministicTime(int64(key)),
fmt.Sprintf("{\"k\":%d}", key),
[]byte(fmt.Sprintf("%064x", key)),
}
_, err := db.ExecContext(ctx, stmt, args...)
counters.record(err)
_ = sleepWithContext(ctx, 20*time.Millisecond)
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The tbl.fqName() method reads tbl.name without holding tbl.mu. Since tbl.name can be mutated concurrently by the apply function of genRenameTable under tbl.mu.Lock(), calling tbl.fqName() concurrently without a lock causes a data race. To prevent this, retrieve the fully qualified name while holding tbl.mu.Lock() inside conflictWriter, and use the cached string to construct the SQL statement.

func conflictWriter(
	ctx context.Context,
	db *sql.DB,
	model *clusterModel,
	seed int64,
	cfg dmlConfig,
	counters *dmlCounters,
) {
	// conflictWriter continuously upserts into a small key space to create write conflicts.
	// This targets row-level contention and duplicate key paths.
	if !cfg.KeyConflictEnabled || cfg.KeyConflictKeyspace <= 0 {
		return
	}
	rng := rand.New(rand.NewSource(seed))
	targetTables := collectConflictTables(model)
	if len(targetTables) == 0 {
		return
	}

	for {
		select {
		case <-ctx.Done():
			return
		default:
		}

		tbl := targetTables[rng.Intn(len(targetTables))]
		tbl.mu.Lock()
		exists := tbl.exists
		fqName := tbl.fqName()
		tbl.mu.Unlock()
		if !exists {
			_ = sleepWithContext(ctx, 200*time.Millisecond)
			continue
		}

		key := rng.Intn(cfg.KeyConflictKeyspace) + 1
		stmt := fmt.Sprintf("INSERT INTO %s (`id`,`a`,`b`,`c`,`d`,`e`,`bin`) VALUES (?,?,?,?,?,?,?) "+
			"ON DUPLICATE KEY UPDATE `a`=VALUES(`a`),`b`=VALUES(`b`),`c`=VALUES(`c`)",
			fqName,
		)
		args := []any{
			int64(key),
			int32(rng.Intn(1_000_000)),
			randASCII(rng, 16),
			deterministicDecimal(int64(key)),
			deterministicTime(int64(key)),
			fmt.Sprintf("{\"k\":%d}", key),
			[]byte(fmt.Sprintf("%064x", key)),
		}
		_, err := db.ExecContext(ctx, stmt, args...)
		counters.record(err)
		_ = sleepWithContext(ctx, 20*time.Millisecond)
	}
}

Comment on lines +197 to +199
func mathMaxInt32() int64 {
return int64(^uint32(0) >> 1)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Instead of implementing a custom bitwise function mathMaxInt32() to compute the maximum value of a signed 32-bit integer, use math.MaxInt32 from the standard math package. This is cleaner, more idiomatic, and improves code maintainability.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 20

🧹 Nitpick comments (2)
tests/utils/random_ddl_test_runner/ddl_test.go (1)

9-47: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Rename these tests to camelCase.

Lines 9 and 33 declare test functions with underscores, which violates the repo Go naming rule for functions. As per coding guidelines, "Functions should use camelCase naming and do not include underscores (e.g., getPartitionNum, not get_partition_num)."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/utils/random_ddl_test_runner/ddl_test.go` around lines 9 - 47, Rename
the test functions in ddl_test.go to camelCase to match the Go naming rule and
repo guidelines: update TestGenDropColumn_DoesNotDropPrimaryKey and
TestGenAddPartition_RequiresRangePartition to use camelCase without underscores,
while keeping their behavior and references to genDropColumn and genAddPartition
unchanged.

Source: Coding guidelines

tests/utils/random_ddl_test_runner/selector_test.go (1)

8-30: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Rename this test to camelCase.

Line 8 uses underscores in the test function name, which violates the repo Go naming rule for functions. As per coding guidelines, "Functions should use camelCase naming and do not include underscores (e.g., getPartitionNum, not get_partition_num)."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/utils/random_ddl_test_runner/selector_test.go` around lines 8 - 30, The
test function name in the selector test uses underscores and should be renamed
to camelCase to match the repo’s Go naming rule. Update the
`TestDDLSelector_CoverageDebtLikeBehavior` symbol in `selector_test.go` to a
camelCase form with no underscores, keeping the same test logic and coverage
behavior unchanged.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/integration_tests/weekly_rand_multi_failover/run.sh`:
- Line 149: The workdir cleanup in the script is using an unguarded WORK_DIR
value, which can make the reset dangerous if the variable is empty or malformed.
Update the reset logic around the rm -rf in the run.sh flow to validate WORK_DIR
before deletion, and only proceed when it is a non-empty expected path; use the
WORK_DIR handling near the rm -rf/mkdir sequence as the place to add the guard.

In `@tests/integration_tests/weekly_rand_multi/run.sh`:
- Line 143: The cleanup command in the weekly_rand_multi run script is unsafe
because rm -rf uses WORK_DIR unquoted and without validating it first. Update
the WORK_DIR cleanup logic to check that WORK_DIR is set and non-empty before
deleting, and quote the variable everywhere it is used in the rm/mkdir sequence
to avoid accidental expansion or deletion of the wrong directory tree.
- Around line 118-120: The MySQL sink fallback in the run script is hardcoded to
localhost, which bypasses the configured downstream endpoint. Update the sink
URI construction in the case branch that sets SINK_URI so it uses the same
downstream host and port variables already used elsewhere in the script,
matching the runner config and SQL target. Keep the change localized to the
SINK_URI assignment and use the existing DOWN_TIDB_HOST and DOWN_TIDB_PORT
symbols for consistency.

In `@tests/integration_tests/weekly_rand_single/run.sh`:
- Around line 113-115: The MySQL sink URI is hard-coded to localhost in the sink
selection branch, which makes it diverge from the configured downstream used
elsewhere in run.sh. Update the SINK_URI assignment in the MySQL case to use the
same downstream host and port variables already referenced by the runner config
and run_sql checks, so the changefeed replicates into the intended target. Keep
the fix localized to the sink URI setup logic in the shell script.
- Line 138: The cleanup step in the weekly_rand_single run script uses an
unquoted recursive delete on WORK_DIR, which can target the wrong path if the
variable is empty or malformed. Update the rm -rf invocation to guard against
empty WORK_DIR and quote the variable expansion, and keep the mkdir -p step tied
to the same validated WORK_DIR value so the safe path is only removed and
recreated when it is set correctly.

In `@tests/integration_tests/weekly_rand_slow_lossy_ddl/run.sh`:
- Line 95: Guard the workdir reset in the weekly_rand_slow_lossy_ddl run.sh
script before the rm -rf call, since WORK_DIR can be blank or malformed and
cause deletion outside the test sandbox. Update the startup cleanup around the
WORK_DIR handling so the recursive delete only runs when WORK_DIR is verified as
a safe, non-empty path, keeping the existing mkdir -p flow intact.
- Around line 123-128: The weekly_rand_slow_lossy_ddl run script exits right
after check_sync_diff, so late PANIC/FATAL/DATA RACE messages in captured logs
can be missed. Add the same post-diff shell-side log scan used by the sibling
weekly failover script immediately after check_sync_diff in run.sh, using the
existing scan_logs_for_failures helper and the current WORK_DIR/TEST_NAME
context to detect failures before printing success.

In `@tests/utils/random_ddl_test_runner/bootstrap.go`:
- Around line 17-18: The bootstrap flow is using an unbounded context for
DDL/DML execution, which can let schema creation or bulk insert calls hang
indefinitely. Update `bootstrap()` in `random_ddl_test_runner/bootstrap.go` to
create a timeout/deadline-aware context for the `ExecContext` work, and thread
that same bounded context through the related bootstrap phases and helpers at
the referenced call sites so the TiDB/MySQL operations fail fast instead of
wedging the weekly job.

In `@tests/utils/random_ddl_test_runner/config.go`:
- Around line 210-221: The config defaults in the DML initialization logic are
overwriting explicit user-supplied false/0 values, so update the
parsing/defaulting path to distinguish “unset” from “set to zero/false.” In the
code that sets c.DML.HotspotRatio, c.DML.HotTableRatio, c.DML.BigTxnInterval,
c.DML.BigTxnEnabled, and c.DML.KeyConflictEnabled (and the related
log_scan_enabled/fail_on_panic_match handling noted in the review), only apply
defaults when the field was omitted, not when it was explicitly provided as 0 or
false. Use presence-aware fields or separate “was set” checks in the config
loader so runner_config.json values are preserved.

In `@tests/utils/random_ddl_test_runner/ddl_worker.go`:
- Around line 132-145: The pickMissingTable helper is using only 10 random
probes, which can miss an existing dropped table and return nil too often.
Update pickMissingTable to deterministically scan the candidates slice for a
table whose exists flag is false, using the existing table.mu lock, and return
the first missing table found; keep the rand.Rand parameter only if it is still
needed elsewhere in ddl_worker.go.

In `@tests/utils/random_ddl_test_runner/dml.go`:
- Around line 342-346: The VARCHAR generator in randASCII handling should
respect the actual c.typ.varcharN limit instead of forcing a minimum of 8, since
that can exceed narrowed columns and trigger Data too long errors. Update the
logic in the VARCHAR case to cap the requested length by varcharN and use 1 as
the lower bound, while preserving the special pad column behavior in the same
switch.

In `@tests/utils/random_ddl_test_runner/extra_workers.go`:
- Around line 155-170: The conflictWriter upsert in extra_workers.go is
hard-coded to a fixed column list, so it keeps failing after random DDL changes
to tbl’s schema. Update the logic around the conflict insert path to either
derive the INSERT/UPDATE columns from the current table schema while holding
tbl.mu, or switch conflictWriter to a table family that is not touched by
column-altering DDL. Keep the reference points in conflictWriter and
collectConflictTables in mind so the statement always matches the live schema
before calling db.ExecContext.

In `@tests/utils/random_ddl_test_runner/failover.go`:
- Around line 121-133: The status probe in the capture helper is only limited by
the workload context, so a slow `/status` request can hang failover handling.
Update the request flow in the capture status check to use a short per-request
timeout, ideally by creating a derived context or client timeout around the
existing http.DefaultClient.Do calls in this helper, and keep the fallback
basic-auth retry path in the same timeout-bounded logic.

In `@tests/utils/random_ddl_test_runner/health.go`:
- Around line 54-64: The checkpoint decode in the health response parser is
losing TSO precision because `json.Decoder` is unmarshaling numeric fields into
`float64` before `parseUint64` sees them. Update the decode path in `health.go`
to call `dec.UseNumber()` and make the `changefeedStatus` parsing logic handle
`json.Number` for `checkpoint_tso` and `checkpoint_ts` so the exact integer
value is preserved during comparisons.

In `@tests/utils/random_ddl_test_runner/main.go`:
- Around line 77-84: `main()` currently dispatches to `runner.workload()`
without validating workload-only required config, so bad `cdc.changefeed_id`
values fail later with weaker errors. Add an upfront check in the workload
branch before calling `runner.workload()`, using the existing config/runner
setup that reads `cdc.changefeed_id`, and return a clear error immediately when
it is missing or empty. Keep the bootstrap path unchanged and use the `main()`
phase switch plus `runner.workload()` as the entry points for the fix.

In `@tests/utils/random_ddl_test_runner/model.go`:
- Around line 113-118: `tableSchema.clone()` is only doing a shallow copy of
`indexes`, so each `index.columns` slice still aliases the original backing
array. Update the cloning logic in `tableSchema.clone()` to deep-copy every
`index` entry, including its `columns` field, while preserving the existing
copies of `columns` and `primaryKey`. Use the `tableSchema.clone()` and `index`
symbols to locate the fix.

In `@tests/utils/random_ddl_test_runner/motif.go`:
- Around line 38-53: Stop the motif sequence as soon as any phase fails in the
motif runner. In the function that drives the phases around motifAddSiteCode,
motifUnifySiteCodeDefault, and motifAddCompositePK, check each call result and
return immediately on error instead of continuing to later sleeps and steps.
Only advance motifStep after each phase succeeds so DML workers never switch to
the post-PK pattern before the default-unification phase completes.

In `@tests/utils/random_ddl_test_runner/syncpoint_diff.go`:
- Around line 161-164: Carry the syncpoint cursor forward into post-workload
catch-up instead of restarting from zero. In ensureSyncpointDiffAfterWorkload,
preserve and reuse the lastPrimary value from the periodic syncpoint scan so the
follow-up checks continue from the next candidate rather than rechecking earlier
ones. Update the call flow between workload.go and runSyncpointDiffChecks to
pass the current cursor along with the remaining count, using the same
lastPrimary tracking already present in the syncpoint_diff.go loop.
- Around line 225-226: fetchDDLWindows currently calls http.DefaultClient.Do
with the long-lived workload context, which can block syncpoint diffing on a
stalled /ddl/history request. Update the request path in fetchDDLWindows to use
a bounded per-request timeout by creating a request-scoped context or dedicated
http.Client with timeout before the Do call, so the periodic loop can continue
even if TiDB hangs. Keep the change localized to the fetchDDLWindows HTTP
request handling.

In `@tests/utils/random_ddl_test_runner/workload.go`:
- Around line 260-266: In the health tracking logic around lastCheckpoint and
lastAdvance, avoid treating repeated zero checkpoints as progress. Update the
progress timestamp only when st.Checkpoint is non-zero and differs from the
previous checkpoint, so the initial zero state does not keep resetting
lastAdvance and prevent NoAdvanceHard from firing.

---

Nitpick comments:
In `@tests/utils/random_ddl_test_runner/ddl_test.go`:
- Around line 9-47: Rename the test functions in ddl_test.go to camelCase to
match the Go naming rule and repo guidelines: update
TestGenDropColumn_DoesNotDropPrimaryKey and
TestGenAddPartition_RequiresRangePartition to use camelCase without underscores,
while keeping their behavior and references to genDropColumn and genAddPartition
unchanged.

In `@tests/utils/random_ddl_test_runner/selector_test.go`:
- Around line 8-30: The test function name in the selector test uses underscores
and should be renamed to camelCase to match the repo’s Go naming rule. Update
the `TestDDLSelector_CoverageDebtLikeBehavior` symbol in `selector_test.go` to a
camelCase form with no underscores, keeping the same test logic and coverage
behavior unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0746e1ef-db06-4c9f-9650-30fdb250b2c9

📥 Commits

Reviewing files that changed from the base of the PR and between 657ec82 and 00e40aa.

📒 Files selected for processing (42)
  • Makefile
  • pkg/eventservice/event_broker.go
  • pkg/eventservice/event_broker_test.go
  • tests/integration_tests/_utils/test_prepare
  • tests/integration_tests/run_weekly_rand_ddl_it_in_ci.sh
  • tests/integration_tests/weekly_rand_multi/conf/changefeed.toml
  • tests/integration_tests/weekly_rand_multi/conf/changefeed_mysql.toml
  • tests/integration_tests/weekly_rand_multi/conf/consumer.toml
  • tests/integration_tests/weekly_rand_multi/run.sh
  • tests/integration_tests/weekly_rand_multi_failover/conf/changefeed.toml
  • tests/integration_tests/weekly_rand_multi_failover/conf/changefeed_mysql.toml
  • tests/integration_tests/weekly_rand_multi_failover/conf/consumer.toml
  • tests/integration_tests/weekly_rand_multi_failover/run.sh
  • tests/integration_tests/weekly_rand_single/conf/changefeed_mysql.toml
  • tests/integration_tests/weekly_rand_single/conf/consumer.toml
  • tests/integration_tests/weekly_rand_single/run.sh
  • tests/integration_tests/weekly_rand_slow_lossy_ddl/conf/changefeed_mysql.toml
  • tests/integration_tests/weekly_rand_slow_lossy_ddl/run.sh
  • tests/utils/random_ddl_test_runner/autotune.go
  • tests/utils/random_ddl_test_runner/autotune_test.go
  • tests/utils/random_ddl_test_runner/bootstrap.go
  • tests/utils/random_ddl_test_runner/config.go
  • tests/utils/random_ddl_test_runner/db.go
  • tests/utils/random_ddl_test_runner/ddl.go
  • tests/utils/random_ddl_test_runner/ddl_test.go
  • tests/utils/random_ddl_test_runner/ddl_worker.go
  • tests/utils/random_ddl_test_runner/dml.go
  • tests/utils/random_ddl_test_runner/dml_test.go
  • tests/utils/random_ddl_test_runner/extra_workers.go
  • tests/utils/random_ddl_test_runner/failover.go
  • tests/utils/random_ddl_test_runner/health.go
  • tests/utils/random_ddl_test_runner/logger.go
  • tests/utils/random_ddl_test_runner/logscan.go
  • tests/utils/random_ddl_test_runner/logscan_test.go
  • tests/utils/random_ddl_test_runner/main.go
  • tests/utils/random_ddl_test_runner/model.go
  • tests/utils/random_ddl_test_runner/motif.go
  • tests/utils/random_ddl_test_runner/runner.go
  • tests/utils/random_ddl_test_runner/selector.go
  • tests/utils/random_ddl_test_runner/selector_test.go
  • tests/utils/random_ddl_test_runner/syncpoint_diff.go
  • tests/utils/random_ddl_test_runner/workload.go


trap 'cleanup' EXIT

rm -rf $WORK_DIR && mkdir -p $WORK_DIR

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Guard the workdir reset before rm -rf.

Line 149 deletes an unguarded variable path. If WORK_DIR is empty or malformed, this cleanup can wipe the wrong directory and make the job fail destructively.

Suggested fix
-rm -rf $WORK_DIR && mkdir -p $WORK_DIR
+rm -rf -- "${WORK_DIR:?WORK_DIR is unset}" && mkdir -p -- "$WORK_DIR"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
rm -rf $WORK_DIR && mkdir -p $WORK_DIR
rm -rf -- "${WORK_DIR:?WORK_DIR is unset}" && mkdir -p -- "$WORK_DIR"
🧰 Tools
🪛 Shellcheck (0.11.0)

[info] 149-149: Double quote to prevent globbing and word splitting.

(SC2086)


[info] 149-149: Double quote to prevent globbing and word splitting.

(SC2086)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration_tests/weekly_rand_multi_failover/run.sh` at line 149, The
workdir cleanup in the script is using an unguarded WORK_DIR value, which can
make the reset dangerous if the variable is empty or malformed. Update the reset
logic around the rm -rf in the run.sh flow to validate WORK_DIR before deletion,
and only proceed when it is a non-empty expected path; use the WORK_DIR handling
near the rm -rf/mkdir sequence as the place to add the guard.

Source: Linters/SAST tools

Comment on lines +118 to +120
*)
SINK_URI="mysql://root@127.0.0.1:3306/"
;;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Use the configured downstream address for the MySQL sink.

This script writes runner config and downstream SQL against ${DOWN_TIDB_HOST}:${DOWN_TIDB_PORT}, but the mysql sink still points at 127.0.0.1:3306. That breaks the test whenever the downstream endpoint is overridden.

Suggested fix
 	*)
-		SINK_URI="mysql://root@127.0.0.1:3306/"
+		SINK_URI="mysql://root@${DOWN_TIDB_HOST}:${DOWN_TIDB_PORT}/"
 		;;
 	esac
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
*)
SINK_URI="mysql://root@127.0.0.1:3306/"
;;
*)
SINK_URI="mysql://root@${DOWN_TIDB_HOST}:${DOWN_TIDB_PORT}/"
;;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration_tests/weekly_rand_multi/run.sh` around lines 118 - 120, The
MySQL sink fallback in the run script is hardcoded to localhost, which bypasses
the configured downstream endpoint. Update the sink URI construction in the case
branch that sets SINK_URI so it uses the same downstream host and port variables
already used elsewhere in the script, matching the runner config and SQL target.
Keep the change localized to the SINK_URI assignment and use the existing
DOWN_TIDB_HOST and DOWN_TIDB_PORT symbols for consistency.


trap 'cleanup' EXIT

rm -rf $WORK_DIR && mkdir -p $WORK_DIR

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Guard the recursive delete with quotes and a non-empty check.

If WORK_DIR resolves unexpectedly, this can delete the wrong directory tree.

Suggested fix
-rm -rf $WORK_DIR && mkdir -p $WORK_DIR
+rm -rf -- "${WORK_DIR:?WORK_DIR is unset}" && mkdir -p "$WORK_DIR"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
rm -rf $WORK_DIR && mkdir -p $WORK_DIR
rm -rf -- "${WORK_DIR:?WORK_DIR is unset}" && mkdir -p "$WORK_DIR"
🧰 Tools
🪛 Shellcheck (0.11.0)

[info] 143-143: Double quote to prevent globbing and word splitting.

(SC2086)


[info] 143-143: Double quote to prevent globbing and word splitting.

(SC2086)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration_tests/weekly_rand_multi/run.sh` at line 143, The cleanup
command in the weekly_rand_multi run script is unsafe because rm -rf uses
WORK_DIR unquoted and without validating it first. Update the WORK_DIR cleanup
logic to check that WORK_DIR is set and non-empty before deleting, and quote the
variable everywhere it is used in the rm/mkdir sequence to avoid accidental
expansion or deletion of the wrong directory tree.

Source: Linters/SAST tools

Comment on lines +113 to +115
*)
SINK_URI="mysql://root@127.0.0.1:3306/"
;;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Use the configured downstream address for the MySQL sink.

The runner config, run_sql calls, and final diff all target ${DOWN_TIDB_HOST}:${DOWN_TIDB_PORT}, but the changefeed sink is hard-coded to 127.0.0.1:3306. In any non-default topology this replicates into one downstream and validates another.

Suggested fix
 	*)
-		SINK_URI="mysql://root@127.0.0.1:3306/"
+		SINK_URI="mysql://root@${DOWN_TIDB_HOST}:${DOWN_TIDB_PORT}/"
 		;;
 	esac
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
*)
SINK_URI="mysql://root@127.0.0.1:3306/"
;;
*)
SINK_URI="mysql://root@${DOWN_TIDB_HOST}:${DOWN_TIDB_PORT}/"
;;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration_tests/weekly_rand_single/run.sh` around lines 113 - 115,
The MySQL sink URI is hard-coded to localhost in the sink selection branch,
which makes it diverge from the configured downstream used elsewhere in run.sh.
Update the SINK_URI assignment in the MySQL case to use the same downstream host
and port variables already referenced by the runner config and run_sql checks,
so the changefeed replicates into the intended target. Keep the fix localized to
the sink URI setup logic in the shell script.


trap 'cleanup' EXIT

rm -rf $WORK_DIR && mkdir -p $WORK_DIR

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Guard the recursive delete with quotes and a non-empty check.

If WORK_DIR is empty or malformed, this can remove an unintended path.

Suggested fix
-rm -rf $WORK_DIR && mkdir -p $WORK_DIR
+rm -rf -- "${WORK_DIR:?WORK_DIR is unset}" && mkdir -p "$WORK_DIR"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
rm -rf $WORK_DIR && mkdir -p $WORK_DIR
rm -rf -- "${WORK_DIR:?WORK_DIR is unset}" && mkdir -p "$WORK_DIR"
🧰 Tools
🪛 Shellcheck (0.11.0)

[info] 138-138: Double quote to prevent globbing and word splitting.

(SC2086)


[info] 138-138: Double quote to prevent globbing and word splitting.

(SC2086)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration_tests/weekly_rand_single/run.sh` at line 138, The cleanup
step in the weekly_rand_single run script uses an unquoted recursive delete on
WORK_DIR, which can target the wrong path if the variable is empty or malformed.
Update the rm -rf invocation to guard against empty WORK_DIR and quote the
variable expansion, and keep the mkdir -p step tied to the same validated
WORK_DIR value so the safe path is only removed and recreated when it is set
correctly.

Source: Linters/SAST tools

Comment on lines +113 to +118
func (s tableSchema) clone() tableSchema {
cp := s
cp.columns = append([]column(nil), s.columns...)
cp.primaryKey = append([]string(nil), s.primaryKey...)
cp.indexes = append([]index(nil), s.indexes...)
return cp

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Deep-copy index.columns in tableSchema.clone().

cp.indexes = append([]index(nil), s.indexes...) only copies the outer slice. Each cloned index.columns still shares the same backing array, so later DDL mutations can leak across schema, initialSchema, and sibling tables cloned from the same family.

Possible fix
 func (s tableSchema) clone() tableSchema {
 	cp := s
 	cp.columns = append([]column(nil), s.columns...)
 	cp.primaryKey = append([]string(nil), s.primaryKey...)
-	cp.indexes = append([]index(nil), s.indexes...)
+	cp.indexes = make([]index, len(s.indexes))
+	for i, idx := range s.indexes {
+		cp.indexes[i] = idx
+		cp.indexes[i].columns = append([]string(nil), idx.columns...)
+	}
 	return cp
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func (s tableSchema) clone() tableSchema {
cp := s
cp.columns = append([]column(nil), s.columns...)
cp.primaryKey = append([]string(nil), s.primaryKey...)
cp.indexes = append([]index(nil), s.indexes...)
return cp
func (s tableSchema) clone() tableSchema {
cp := s
cp.columns = append([]column(nil), s.columns...)
cp.primaryKey = append([]string(nil), s.primaryKey...)
cp.indexes = make([]index, len(s.indexes))
for i, idx := range s.indexes {
cp.indexes[i] = idx
cp.indexes[i].columns = append([]string(nil), idx.columns...)
}
return cp
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/utils/random_ddl_test_runner/model.go` around lines 113 - 118,
`tableSchema.clone()` is only doing a shallow copy of `indexes`, so each
`index.columns` slice still aliases the original backing array. Update the
cloning logic in `tableSchema.clone()` to deep-copy every `index` entry,
including its `columns` field, while preserving the existing copies of `columns`
and `primaryKey`. Use the `tableSchema.clone()` and `index` symbols to locate
the fix.

Comment on lines +38 to +53
if err := motifAddSiteCode(ctx, db, model, trace, logger); err == nil {
atomic.StoreInt32(motifStep, 1)
}

if err := sleepWithContext(ctx, step2At-step1At); err != nil {
return
}
if err := motifUnifySiteCodeDefault(ctx, db, model, trace, logger); err == nil {
atomic.StoreInt32(motifStep, 2)
}

if err := sleepWithContext(ctx, step3At-step2At); err != nil {
return
}
if err := motifAddCompositePK(ctx, db, model, trace, logger); err == nil {
atomic.StoreInt32(motifStep, 3)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Stop the motif sequence on the first failed phase.

Step 2 and step 3 still run after earlier failures. That can set motifStep=3 without ever finishing the default-unification phase, so DML workers switch to the post-PK pattern against rows that can never match site_code=''.

Suggested fix
 	if err := sleepWithContext(ctx, step1At); err != nil {
 		return
 	}
-	if err := motifAddSiteCode(ctx, db, model, trace, logger); err == nil {
-		atomic.StoreInt32(motifStep, 1)
+	if err := motifAddSiteCode(ctx, db, model, trace, logger); err != nil {
+		return
 	}
+	atomic.StoreInt32(motifStep, 1)
 
 	if err := sleepWithContext(ctx, step2At-step1At); err != nil {
 		return
 	}
-	if err := motifUnifySiteCodeDefault(ctx, db, model, trace, logger); err == nil {
-		atomic.StoreInt32(motifStep, 2)
+	if err := motifUnifySiteCodeDefault(ctx, db, model, trace, logger); err != nil {
+		return
 	}
+	atomic.StoreInt32(motifStep, 2)
 
 	if err := sleepWithContext(ctx, step3At-step2At); err != nil {
 		return
 	}
-	if err := motifAddCompositePK(ctx, db, model, trace, logger); err == nil {
-		atomic.StoreInt32(motifStep, 3)
+	if err := motifAddCompositePK(ctx, db, model, trace, logger); err != nil {
+		return
 	}
+	atomic.StoreInt32(motifStep, 3)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if err := motifAddSiteCode(ctx, db, model, trace, logger); err == nil {
atomic.StoreInt32(motifStep, 1)
}
if err := sleepWithContext(ctx, step2At-step1At); err != nil {
return
}
if err := motifUnifySiteCodeDefault(ctx, db, model, trace, logger); err == nil {
atomic.StoreInt32(motifStep, 2)
}
if err := sleepWithContext(ctx, step3At-step2At); err != nil {
return
}
if err := motifAddCompositePK(ctx, db, model, trace, logger); err == nil {
atomic.StoreInt32(motifStep, 3)
if err := motifAddSiteCode(ctx, db, model, trace, logger); err != nil {
return
}
atomic.StoreInt32(motifStep, 1)
if err := sleepWithContext(ctx, step2At-step1At); err != nil {
return
}
if err := motifUnifySiteCodeDefault(ctx, db, model, trace, logger); err != nil {
return
}
atomic.StoreInt32(motifStep, 2)
if err := sleepWithContext(ctx, step3At-step2At); err != nil {
return
}
if err := motifAddCompositePK(ctx, db, model, trace, logger); err != nil {
return
}
atomic.StoreInt32(motifStep, 3)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/utils/random_ddl_test_runner/motif.go` around lines 38 - 53, Stop the
motif sequence as soon as any phase fails in the motif runner. In the function
that drives the phases around motifAddSiteCode, motifUnifySiteCodeDefault, and
motifAddCompositePK, check each call result and return immediately on error
instead of continuing to later sleeps and steps. Only advance motifStep after
each phase succeeds so DML workers never switch to the post-PK pattern before
the default-unification phase completes.

Comment on lines +161 to +164
var lastPrimary uint64
checked := 0
for checked < required {
n, err := r.runSyncpointDiffChecks(ctx, down, model, required-checked, &lastPrimary, true)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | 🏗️ Heavy lift

Carry the syncpoint cursor into post-workload catch-up.

ensureSyncpointDiffAfterWorkload restarts from lastPrimary = 0, while workload.go only passes the remaining count. If the periodic loop already checked syncpoints, catch-up can recheck the earliest candidates and still report completion without adding the intended new coverage.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/utils/random_ddl_test_runner/syncpoint_diff.go` around lines 161 - 164,
Carry the syncpoint cursor forward into post-workload catch-up instead of
restarting from zero. In ensureSyncpointDiffAfterWorkload, preserve and reuse
the lastPrimary value from the periodic syncpoint scan so the follow-up checks
continue from the next candidate rather than rechecking earlier ones. Update the
call flow between workload.go and runSyncpointDiffChecks to pass the current
cursor along with the remaining count, using the same lastPrimary tracking
already present in the syncpoint_diff.go loop.

Comment on lines +225 to +226
resp, err := http.DefaultClient.Do(req)
if err != nil {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect the relevant file and nearby context
wc -l tests/utils/random_ddl_test_runner/syncpoint_diff.go
sed -n '180,270p' tests/utils/random_ddl_test_runner/syncpoint_diff.go

Repository: pingcap/ticdc

Length of output: 2938


🏁 Script executed:

#!/bin/bash
set -euo pipefail

set -euo pipefail
rg -n "fetchDDLWindows\\(|pickNextSyncpointCandidate\\(|context.WithTimeout\\(|WithDeadline\\(" tests/utils/random_ddl_test_runner/syncpoint_diff.go tests/utils/random_ddl_test_runner -g'*.go'
sed -n '1,180p' tests/utils/random_ddl_test_runner/syncpoint_diff.go

Repository: pingcap/ticdc

Length of output: 6779


🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n "fetchDDLWindows\\(|pickNextSyncpointCandidate\\(|context.WithTimeout\\(|WithDeadline\\(" tests/utils/random_ddl_test_runner/syncpoint_diff.go tests/utils/random_ddl_test_runner -g'*.go'
sed -n '1,180p' tests/utils/random_ddl_test_runner/syncpoint_diff.go

Repository: pingcap/ticdc

Length of output: 6779


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect where syncpoint diff loop is started and how its context is scoped.
rg -n "syncpointDiffLoop\\(|ensureSyncpointDiffAfterWorkload\\(|runSyncpointDiffChecks\\(" tests/utils/random_ddl_test_runner -g'*.go'
sed -n '160,260p' tests/utils/random_ddl_test_runner/workload.go
sed -n '260,360p' tests/utils/random_ddl_test_runner/workload.go

Repository: pingcap/ticdc

Length of output: 7455


Use a bounded timeout for /ddl/history. fetchDDLWindows uses http.DefaultClient with the long-lived workload context, so a stalled TiDB handler can freeze periodic syncpoint diffing until the test ends. A per-request timeout keeps the loop moving.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/utils/random_ddl_test_runner/syncpoint_diff.go` around lines 225 - 226,
fetchDDLWindows currently calls http.DefaultClient.Do with the long-lived
workload context, which can block syncpoint diffing on a stalled /ddl/history
request. Update the request path in fetchDDLWindows to use a bounded per-request
timeout by creating a request-scoped context or dedicated http.Client with
timeout before the Do call, so the periodic loop can continue even if TiDB
hangs. Keep the change localized to the fetchDDLWindows HTTP request handling.

Comment on lines +260 to +266
if lastCheckpoint == 0 {
lastCheckpoint = st.Checkpoint
lastAdvance = now
} else if st.Checkpoint != 0 && st.Checkpoint != lastCheckpoint {
lastCheckpoint = st.Checkpoint
lastAdvance = now
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Don’t reset checkpoint progress while checkpoint is zero.

When st.Checkpoint remains 0, lastCheckpoint == 0 stays true and lastAdvance is refreshed every health tick, so NoAdvanceHard never triggers. Only update progress when a non-zero checkpoint changes.

Proposed fix
-		if lastCheckpoint == 0 {
-			lastCheckpoint = st.Checkpoint
-			lastAdvance = now
-		} else if st.Checkpoint != 0 && st.Checkpoint != lastCheckpoint {
+		if st.Checkpoint != 0 && st.Checkpoint != lastCheckpoint {
 			lastCheckpoint = st.Checkpoint
 			lastAdvance = now
 		}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if lastCheckpoint == 0 {
lastCheckpoint = st.Checkpoint
lastAdvance = now
} else if st.Checkpoint != 0 && st.Checkpoint != lastCheckpoint {
lastCheckpoint = st.Checkpoint
lastAdvance = now
}
if st.Checkpoint != 0 && st.Checkpoint != lastCheckpoint {
lastCheckpoint = st.Checkpoint
lastAdvance = now
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/utils/random_ddl_test_runner/workload.go` around lines 260 - 266, In
the health tracking logic around lastCheckpoint and lastAdvance, avoid treating
repeated zero checkpoints as progress. Update the progress timestamp only when
st.Checkpoint is non-zero and differs from the previous checkpoint, so the
initial zero state does not keep resetting lastAdvance and prevent NoAdvanceHard
from firing.

@asddongmen asddongmen marked this pull request as draft June 30, 2026 02:23
@ti-chi-bot ti-chi-bot Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/needs-triage-completed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Weekly random DDL test can stall or fail on false log scan matches

3 participants