Skip to content

fix: add retry with backoff to _update_submission and stop swallowing…#2416

Open
dconstancy wants to merge 1 commit into
codalab:developfrom
dconstancy:develop
Open

fix: add retry with backoff to _update_submission and stop swallowing…#2416
dconstancy wants to merge 1 commit into
codalab:developfrom
dconstancy:develop

Conversation

@dconstancy

@dconstancy dconstancy commented Jun 16, 2026

Copy link
Copy Markdown

Title: fix: add retry with backoff to _update_submission() and stop swallowing terminal exceptions


Fixes #2415

Problem

_update_submission() performs a single PATCH request with no retry. On any transient network error (502, timeout, connection reset), the submission result is permanently lost. The caller _update_status() silently swallows all exceptions, so the compute worker moves on while Django never receives the final status. The submission stays stuck in Scoring or Running forever with no user-facing error.

Changes

compute_worker/compute_worker.py:

  • _update_submission(): add retry loop with exponential backoff (5 attempts, 2^n seconds + jitter). Catches both HTTP error responses and network-level RequestException. Only raises after all retries are exhausted.
  • _update_status(): re-raise exceptions for terminal statuses (Finished, Failed) so Celery marks the task as failed and the submission transitions to Failed in the UI instead of hanging silently. Intermediate statuses (Preparing, Running, Scoring) remain silently caught — losing an intermediate status update is not critical.
  • HTTP session adapter: increase total retries from 3 to 5, add status_forcelist=[502, 503, 504] and allowed_methods=["PATCH", "GET", "PUT", "POST"] to automatically retry on server errors at the transport level.

Retry behavior observed in logs (post-fix):

INFO  | Updating submission (attempt 1/5) with data = {status: Finished}
WARN  | Submission patch failed (attempt 1/5) with status = 502
INFO  | Retrying in 2.7s...
INFO  | Updating submission (attempt 2/5) with data = {status: Finished}
INFO  | Submission updated successfully!

Backward compatibility

  • No API changes, no migration needed
  • Retry is transparent to Django — same PATCH payload, just sent multiple times if needed
  • PATCH to /api/submissions/{id}/ is idempotent (sets status to a fixed value), so retries are safe
  • Intermediate status exceptions remain silently caught (no behavior change for non-terminal updates)
  • Only terminal status failures (Finished, Failed) now propagate — this is strictly better than the previous silent loss

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

_update_submission() has no retry and _update_status() silently swallows exceptions, causing permanent submission loss

1 participant