[Backport release-1.18] fix(ext-workflow): retry transient gRPC errors in wait_for_orchestration#1075
Merged
Merged
Conversation
…ion (#1069) * fix(ext-workflow): retry transient gRPC errors in wait_for_orchestration_* wait_for_orchestration_start and wait_for_orchestration_completion call the workflow runtime through the local Dapr sidecar. Immediately after a sidecar restart (placement re-dissemination not yet applied, actor registration still propagating, etc.), the sidecar can return FAILED_PRECONDITION or UNAVAILABLE for an instance whose persistent state is intact. The previous implementation surfaced these as a hard error to the caller, so a client polling a long-running workflow would fail permanently even though the workflow itself was recoverable. Apply the same fix to both the sync and async clients: - TaskHubGrpcClient (sync) and AsyncTaskHubGrpcClient (async) both route their wait methods through a _call_with_transient_retry helper. The async variant uses asyncio.sleep; otherwise identical. - Retry FAILED_PRECONDITION and UNAVAILABLE with capped exponential backoff (0.5s, doubling, cap 5s). - Respect the caller's timeout. timeout in (0, None) means unbounded. The first call passes the user's timeout verbatim so behavior on a healthy runtime is unchanged. On retry, both the sleep and the per-call gRPC deadline are clamped to the remaining budget against a monotonic deadline anchored to the start of the loop — neither one can overshoot the user-provided timeout. - DEADLINE_EXCEEDED and budget exhaustion both surface as the public TimeoutError (preserved through a private _TransientTimeout sentinel; moved below the import block to satisfy E402). - Non-transient RpcErrors propagate immediately, unchanged. Behavior on a healthy runtime is unchanged: the first call succeeds and no retry loop runs. Adds tests covering the retry behaviors: retry-then-succeed for both transient codes, exhaustion surfacing as TimeoutError, and non-transient codes propagating without retry. Signed-off-by: Javier Aliaga <javier@diagrid.io> * fix(ext-workflow): bound transient retries and address review feedback Cap continuous transient-error retries in unbounded mode (timeout=0/None) at 30s via _MAX_TRANSIENT_RETRY_SECONDS, then re-raise the original RpcError. This preserves the pre-retry contract: timeout=0 still waits indefinitely for a healthy workflow and never raises TimeoutError, but a permanently-unavailable sidecar now surfaces the original error instead of retrying forever. Also address review feedback: - Type wait_for_orchestration_* timeout as Optional[int] (None is a supported, tested input meaning unbounded). - Fix sync "up to Nones" log message to treat None as indefinite, matching the async client. - Correct the retry-helper docstring: the first call passes grpc_timeout (None when unbounded), not the timeout value verbatim. Add a test covering unbounded-mode transient exhaustion surfacing as the original RpcError (not TimeoutError, not a hang). Signed-off-by: Javier Aliaga <javier@diagrid.io> --------- Signed-off-by: Javier Aliaga <javier@diagrid.io> Co-authored-by: Sam <sam@diagrid.io> (cherry picked from commit 71b26be) Signed-off-by: dapr-bot <dapr-bot@users.noreply.github.com>
sicoyle
approved these changes
Jun 1, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Backport 71b26be from #1069.