Skip to content

fix(planner): correct retry wall-clock formula + lock table contract#212

Merged
bdchatham merged 1 commit intomainfrom
fix/planner-retry-comment-and-test
May 8, 2026
Merged

fix(planner): correct retry wall-clock formula + lock table contract#212
bdchatham merged 1 commit intomainfrom
fix/planner-retry-comment-and-test

Conversation

@bdchatham
Copy link
Copy Markdown
Collaborator

@bdchatham bdchatham commented May 8, 2026

Follow-up to #209. Coral platform-engineer review caught two items post-merge.

1. Wall-clock formula in taskMaxRetries godoc was wrong

The executor at internal/planner/executor.go:189-196 increments t.RetryCount before calling retryBackoff, so the smallest argument ever passed is 1, not 0:

if t.MaxRetries > 0 && t.RetryCount < t.MaxRetries {
    t.RetryCount++  // increments to 1+ first
    ...
    return ctrl.Result{RequeueAfter: retryBackoff(t.RetryCount)}
}

retryBackoff(1) = 10s, retryBackoff(2) = 20s, retryBackoff(N≥3) caps at 30s. Real wall-clock for N retries:

N=1: 10s
N≥2: 10s + 20s + 30s*(N-2)

The prior comment claimed 5s + 10s + 20s + 30s*(N-3) for N>=3 — under-counts by one term and starts at the wrong value. Operators sizing budgets from the godoc would compute the wrong wall-clock — e.g. discoverPeersMaxRetries=20 was documented as ~9 min but the real value is ~9.5 min.

Bigger drift on the larger budgets: genesisConfigureMaxRetries=180 is documented in bootstrap.go:151-153 as ~30 min but actual wall-clock is ~89.5 min. Fixing bootstrap.go's comment is out of scope here (separate pre-existing bug, file as follow-up if it bites).

2. Lock the lookup table directly

The existing TestArchivePlanner_WithPeers (archive_test.go:79-83) catches the MaxRetries != discoverPeersMaxRetries regression but doesn't pin the lookup itself — a refactor that swaps discoverPeersMaxRetries for genesisConfigureMaxRetries in the switch would compile fine and the assertion would still find a non-zero value. Adding a direct table test:

func TestTaskMaxRetries(t *testing.T) {
    cases := map[string]int{
        TaskConfigureGenesis: genesisConfigureMaxRetries,
        TaskAssembleGenesis:  groupAssemblyMaxRetries,
        TaskDiscoverPeers:    discoverPeersMaxRetries,
        "unknown-task-type":  0,
        "":                   0,
    }
    ...
}

Locks the contract. Adding a new retry-aware task type adds one switch entry + one map entry — no per-call-site test changes needed.

Test plan

  • go test ./... green
  • CI builds new controller image (no behavior change in the binary, just a comment + a test)

Note

Low Risk
Low risk: no runtime behavior changes, only a comment correction and a small unit test asserting the retry-budget mapping for task types.

Overview
Corrects the taskMaxRetries godoc to reflect the executor’s actual retry/backoff sequence (since RetryCount is incremented before retryBackoff is called).

Adds TestTaskMaxRetries to directly assert the task-type→max-retries mapping (including unknown/empty task types returning 0), helping prevent accidental regressions during refactors.

Reviewed by Cursor Bugbot for commit 252dc11. Bugbot is set up for automated code reviews on this repo. Configure here.

Follow-up to #209. Coral platform-engineer review caught two items
post-merge:

1. The wall-clock formula in taskMaxRetries' godoc was wrong. The
   executor increments t.RetryCount BEFORE calling retryBackoff
   (executor.go:189-196), so the smallest argument ever passed is 1,
   not 0. retryBackoff(1)=10s, retryBackoff(2)=20s, retryBackoff(N>=3)
   caps at 30s. Real wall-clock: 10s for N=1; 10s + 20s + 30s*(N-2)
   for N>=2.

   Operators sizing budgets from the prior comment would compute the
   wrong wall-clock — e.g. discoverPeersMaxRetries=20 was documented
   as ~9 min but the real value is ~9.5 min.

2. The existing test in archive_test.go catches the regression
   (0 != 20) but doesn't pin the lookup table itself. A direct
   TestTaskMaxRetries makes a table swap fail loud — e.g. swapping
   discoverPeersMaxRetries for genesisConfigureMaxRetries would
   compile fine but the table test would catch it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bdchatham bdchatham merged commit 2609c17 into main May 8, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant