fix(ext-workflow): retry transient gRPC errors in wait_for_orchestration#1069
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1069 +/- ##
==========================================
- Coverage 86.63% 82.62% -4.01%
==========================================
Files 84 146 +62
Lines 4473 14842 +10369
==========================================
+ Hits 3875 12263 +8388
- Misses 598 2579 +1981 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
You can ignore the codecov build failures, but yes pls fix the rest 😄 |
|
also pls fix DCO 🙏 |
e58655d to
7f9f3a3
Compare
There was a problem hiding this comment.
Pull request overview
This PR updates the durabletask workflow gRPC client in ext/dapr-ext-workflow to make wait_for_orchestration_start and wait_for_orchestration_completion resilient to transient sidecar/runtime unavailability (e.g., immediately after a daprd restart), by retrying specific gRPC status codes with exponential backoff.
Changes:
- Adds a shared
_call_with_transient_retryhelper that retriesFAILED_PRECONDITIONandUNAVAILABLEwith capped exponential backoff. - Routes both
wait_for_orchestration_startandwait_for_orchestration_completionthrough the new retry helper and maps deadline/budget exhaustion toTimeoutError. - Introduces a private
_TransientTimeoutsentinel exception to preserve the public timeout behavior.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
7f9f3a3 to
dc5ee8c
Compare
…ion_*
wait_for_orchestration_start and wait_for_orchestration_completion call
the workflow runtime through the local Dapr sidecar. Immediately after a
sidecar restart (placement re-dissemination not yet applied, actor
registration still propagating, etc.), the sidecar can return
FAILED_PRECONDITION or UNAVAILABLE for an instance whose persistent
state is intact. The previous implementation surfaced these as a hard
error to the caller, so a client polling a long-running workflow would
fail permanently even though the workflow itself was recoverable.
Apply the same fix to both the sync and async clients:
- TaskHubGrpcClient (sync) and AsyncTaskHubGrpcClient (async) both
route their wait methods through a _call_with_transient_retry
helper. The async variant uses asyncio.sleep; otherwise identical.
- Retry FAILED_PRECONDITION and UNAVAILABLE with capped exponential
backoff (0.5s, doubling, cap 5s).
- Respect the caller's timeout. timeout in (0, None) means unbounded.
The first call passes the user's timeout verbatim so behavior on a
healthy runtime is unchanged. On retry, both the sleep and the
per-call gRPC deadline are clamped to the remaining budget against
a monotonic deadline anchored to the start of the loop — neither
one can overshoot the user-provided timeout.
- DEADLINE_EXCEEDED and budget exhaustion both surface as the public
TimeoutError (preserved through a private _TransientTimeout
sentinel; moved below the import block to satisfy E402).
- Non-transient RpcErrors propagate immediately, unchanged.
Behavior on a healthy runtime is unchanged: the first call succeeds and
no retry loop runs.
Adds tests covering the retry behaviors: retry-then-succeed for both
transient codes, exhaustion surfacing as TimeoutError, and
non-transient codes propagating without retry.
Signed-off-by: Javier Aliaga <javier@diagrid.io>
dc5ee8c to
09f45af
Compare
| async def wait_for_orchestration_start( | ||
| self, instance_id: str, *, fetch_payloads: bool = False, timeout: int = 0 | ||
| ) -> Optional[WorkflowState]: |
Cap continuous transient-error retries in unbounded mode (timeout=0/None)
at 30s via _MAX_TRANSIENT_RETRY_SECONDS, then re-raise the original
RpcError. This preserves the pre-retry contract: timeout=0 still waits
indefinitely for a healthy workflow and never raises TimeoutError, but a
permanently-unavailable sidecar now surfaces the original error instead
of retrying forever.
Also address review feedback:
- Type wait_for_orchestration_* timeout as Optional[int] (None is a
supported, tested input meaning unbounded).
- Fix sync "up to Nones" log message to treat None as indefinite,
matching the async client.
- Correct the retry-helper docstring: the first call passes grpc_timeout
(None when unbounded), not the timeout value verbatim.
Add a test covering unbounded-mode transient exhaustion surfacing as the
original RpcError (not TimeoutError, not a hang).
Signed-off-by: Javier Aliaga <javier@diagrid.io>
| sleep_for = min(backoff, 5.0) | ||
| if remaining is not None: | ||
| sleep_for = min(sleep_for, remaining) | ||
| if transient_deadline is not None: | ||
| sleep_for = min(sleep_for, transient_deadline - now) |
There was a problem hiding this comment.
Intentional, keeping as-is. These transient codes (FAILED_PRECONDITION/UNAVAILABLE) return immediately rather than long-polling, so skipping the backoff near the deadline would turn the final window into a tight retry loop hammering the sidecar — up to ~5s of rapid-fire calls, since backoff caps at 5s. The clamp-then-stop behavior deliberately backs off instead. The cost is at most one skipped final attempt right at the deadline, which is an acceptable trade for not flooding a struggling sidecar.
| sleep_for = min(backoff, 5.0) | ||
| if remaining is not None: | ||
| sleep_for = min(sleep_for, remaining) | ||
| if transient_deadline is not None: | ||
| sleep_for = min(sleep_for, transient_deadline - now) |
There was a problem hiding this comment.
Same as the sync client — keeping as-is. Skipping the clamped backoff would busy-loop against the sidecar because these transient codes return immediately; the clamp-then-stop behavior is intentional. See the explanation on the sync client.py thread.
What
Retry transient gRPC errors in
wait_for_orchestration_start/wait_for_orchestration_completion(sync and async clients) instead of failing hard.Why
Immediately after a Dapr sidecar restart, the sidecar can briefly return
FAILED_PRECONDITION/UNAVAILABLEfor a workflow whose state is fully intact (e.g. placement re-dissemination still propagating). Previously a client polling a long-running workflow would fail permanently even though the workflow was recoverable.Behavior
Checklist