Skip to content

fix(deploy): make verify step tolerate transient curl errors#23

Merged
JohnRDOrazio merged 1 commit intomainfrom
fix/verify-step-resilience
Apr 29, 2026
Merged

fix(deploy): make verify step tolerate transient curl errors#23
JohnRDOrazio merged 1 commit intomainfrom
fix/verify-step-resilience

Conversation

@JohnRDOrazio
Copy link
Copy Markdown
Member

Summary

The verify step from #22 exited with code 56 (CURLE_RECV_ERROR) on its
second polling attempt during a workflow_dispatch run today, killing
the workflow on a single transient network blip. Three robustness
fixes:

  • set -e safety net — append || echo '{}' to each curl call
    so the script keeps running through transient failures. The empty
    JSON makes jq return empty strings, which keeps the page in
    PENDING for the next attempt (correct: "not yet translated").
  • Curl-layer retry--max-time 15 --retry 3 --retry-delay 5 --retry-all-errors for short network blips at the transport layer.
  • Bigger windowMAX_ATTEMPTS=60 (30 min, was 10). On
    workflow_dispatch (DEPLOY_ALL=true) we enqueue up to 50
    translations and observed throughput is ~1 every 5-8s; the old
    10-minute cap couldn't fit a full bulk recovery.

Also: progress messages now show "X/N done" for at-a-glance status.

Why now

Today's workflow_dispatch recovery run (run 25120538762) enqueued
50 translations, and the verify step failed at attempt 2 with curl
exit 56 — leaving the run reported as failed even though 30+
translations did succeed in the background. This patch lets the
verify step ride out brief network blips and gives enough time for
a 50-job bulk recovery to actually finish.

Test plan

  • Re-trigger the deploy workflow once OpenAI's HTTP 502s on
    large-content translations clear; confirm the verify step
    polls cleanly through the full window
  • Confirm the new "X/N done" progress messages appear in the
    run log

🤖 Generated with Claude Code

The verify step exited with code 56 (CURLE_RECV_ERROR) during the
first big-batch poll, killing the entire workflow run on a single
transient blip. Three robustness fixes:

- Append `|| echo '{}'` to each curl call so `set -e` doesn't kill
  the script when curl fails for any reason — the empty JSON makes
  jq return empty strings, which keeps the page in PENDING for the
  next attempt (correct semantics: "not yet translated").
- Add `--max-time 15 --retry 3 --retry-delay 5 --retry-all-errors`
  so curl handles short network blips at the transport layer.
- Append `2>/dev/null || echo ""` to jq calls for the same reason.

Also bump MAX_ATTEMPTS from 20 to 60 (10 min → 30 min). The current
deploy enqueues up to 50 translations at a time (10 docs × 5 langs
on workflow_dispatch / first deploy), and observed throughput is
~1 translation every 5–8 seconds — 50 jobs needs ~5 minutes minimum,
plus retry budget for OpenAI capacity issues.

Progress messages now show "X/N done" so the run status is readable
at a glance.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@JohnRDOrazio JohnRDOrazio merged commit 52760d6 into main Apr 29, 2026
1 check passed
@JohnRDOrazio JohnRDOrazio deleted the fix/verify-step-resilience branch April 29, 2026 18:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant