Skip to content

fix(proton): classify server errors as permanent in batch retry loops#70

Merged
yokofly merged 4 commits intomainfrom
fix/proton-permanent-error-retry
Apr 7, 2026
Merged

fix(proton): classify server errors as permanent in batch retry loops#70
yokofly merged 4 commits intomainfrom
fix/proton-permanent-error-retry

Conversation

@yokofly
Copy link
Copy Markdown
Collaborator

@yokofly yokofly commented Apr 4, 2026

Summary

  • Proton server errors (deterministic rejects like duplicate columns, syntax errors) now fail fast via backoff.Permanent instead of retrying for up to 5 minutes
  • Transient errors (network timeout, EOF, connection reset) continue to retry with exponential backoff
  • Applied at all retry sites: processBatch, BulkImportStreamColumnar, ExecContext, and outer import retry boundary in task_run.go
  • Added unit tests for permanent/transient error classification and retry behavior

Context

Previously, all errors in batch retry loops were retried equally. A server rejection like "Column id specified more than once" would retry for the full 5-minute backoff budget, causing apparent deadlocks in CI. With this change, server errors are detected by matching the proto.Exception string format ("code: NN, message: ...") and wrapped with backoff.Permanent to stop immediately.

Test plan

  • TestPermanentServerError — 8 subtests verifying classification of server vs transient errors
  • TestTransientErrorRetries — proves permanent error = 1 attempt, transient error = retries until success
  • Smoke test suite: 15 passed, 3 failed (pre-existing timezone issue, not related)

🤖 Generated with Claude Code

yokofly and others added 4 commits April 4, 2026 02:10
Add retry classification so Proton server errors (deterministic rejects
like duplicate columns, syntax errors) fail fast via backoff.Permanent
instead of retrying for 5 minutes. Transient errors (network timeout,
EOF, connection reset) continue to retry with exponential backoff.

- isProtonPermanentError: string-pattern match on proto.Exception format
- PermanentIfServerError: wraps server errors with backoff.Permanent
- Applied at all retry sites: processBatch, BulkImportStreamColumnar,
  ExecContext, and outer import retry boundary in task_run.go
- Add unit tests for permanent/transient classification and retry behavior

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…fication

Replace string-only matching with three-level detection: typed
*proton.Exception first, g.ErrType.OrigErr recursion second, tightened
string fallback (requiring digits after "code: ") last. This eliminates
dependence on rendered error strings as the primary classification contract.

Also distinguishes permanent server rejection from retry-budget-exhausted
in the proton-to-proton import error message, and aligns the outer
retryWithBackoff InitialInterval with the database helper.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
backoff.RetryNotify unwraps *backoff.PermanentError before returning,
so errors.As could never find it. Use a closure-captured flag set when
PermanentIfServerError wraps the error instead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TestServerRejectionFlag verifies the pattern used in runProtonToProton:
- permanent server error sets the closure flag (backoff unwraps
  PermanentError before returning, so errors.As cannot detect it)
- transient error exhausting retry budget does NOT set the flag

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@yokofly yokofly marked this pull request as ready for review April 7, 2026 03:18
@yokofly yokofly merged commit c8f123b into main Apr 7, 2026
@yokofly yokofly deleted the fix/proton-permanent-error-retry branch April 7, 2026 03:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant