Skip to content

[Backport release-1.18] fix(ext-workflow): retry transient gRPC errors in wait_for_orchestration#1075

Merged
sicoyle merged 1 commit into
release-1.18from
backport-1069-to-release-1.18
Jun 1, 2026
Merged

[Backport release-1.18] fix(ext-workflow): retry transient gRPC errors in wait_for_orchestration#1075
sicoyle merged 1 commit into
release-1.18from
backport-1069-to-release-1.18

Conversation

@dapr-bot
Copy link
Copy Markdown
Collaborator

@dapr-bot dapr-bot commented Jun 1, 2026

Backport 71b26be from #1069.

…ion (#1069)

* fix(ext-workflow): retry transient gRPC errors in wait_for_orchestration_*

wait_for_orchestration_start and wait_for_orchestration_completion call
the workflow runtime through the local Dapr sidecar. Immediately after a
sidecar restart (placement re-dissemination not yet applied, actor
registration still propagating, etc.), the sidecar can return
FAILED_PRECONDITION or UNAVAILABLE for an instance whose persistent
state is intact. The previous implementation surfaced these as a hard
error to the caller, so a client polling a long-running workflow would
fail permanently even though the workflow itself was recoverable.

Apply the same fix to both the sync and async clients:

  - TaskHubGrpcClient (sync) and AsyncTaskHubGrpcClient (async) both
    route their wait methods through a _call_with_transient_retry
    helper. The async variant uses asyncio.sleep; otherwise identical.
  - Retry FAILED_PRECONDITION and UNAVAILABLE with capped exponential
    backoff (0.5s, doubling, cap 5s).
  - Respect the caller's timeout. timeout in (0, None) means unbounded.
    The first call passes the user's timeout verbatim so behavior on a
    healthy runtime is unchanged. On retry, both the sleep and the
    per-call gRPC deadline are clamped to the remaining budget against
    a monotonic deadline anchored to the start of the loop — neither
    one can overshoot the user-provided timeout.
  - DEADLINE_EXCEEDED and budget exhaustion both surface as the public
    TimeoutError (preserved through a private _TransientTimeout
    sentinel; moved below the import block to satisfy E402).
  - Non-transient RpcErrors propagate immediately, unchanged.

Behavior on a healthy runtime is unchanged: the first call succeeds and
no retry loop runs.

Adds tests covering the retry behaviors: retry-then-succeed for both
transient codes, exhaustion surfacing as TimeoutError, and
non-transient codes propagating without retry.

Signed-off-by: Javier Aliaga <javier@diagrid.io>

* fix(ext-workflow): bound transient retries and address review feedback

Cap continuous transient-error retries in unbounded mode (timeout=0/None)
at 30s via _MAX_TRANSIENT_RETRY_SECONDS, then re-raise the original
RpcError. This preserves the pre-retry contract: timeout=0 still waits
indefinitely for a healthy workflow and never raises TimeoutError, but a
permanently-unavailable sidecar now surfaces the original error instead
of retrying forever.

Also address review feedback:
  - Type wait_for_orchestration_* timeout as Optional[int] (None is a
    supported, tested input meaning unbounded).
  - Fix sync "up to Nones" log message to treat None as indefinite,
    matching the async client.
  - Correct the retry-helper docstring: the first call passes grpc_timeout
    (None when unbounded), not the timeout value verbatim.

Add a test covering unbounded-mode transient exhaustion surfacing as the
original RpcError (not TimeoutError, not a hang).

Signed-off-by: Javier Aliaga <javier@diagrid.io>

---------

Signed-off-by: Javier Aliaga <javier@diagrid.io>
Co-authored-by: Sam <sam@diagrid.io>
(cherry picked from commit 71b26be)
Signed-off-by: dapr-bot <dapr-bot@users.noreply.github.com>
@dapr-bot dapr-bot requested review from a team as code owners June 1, 2026 16:26
@sicoyle sicoyle merged commit 8889990 into release-1.18 Jun 1, 2026
21 of 22 checks passed
@sicoyle sicoyle deleted the backport-1069-to-release-1.18 branch June 1, 2026 21:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants