Skip to content

fix(ext-workflow): retry transient gRPC errors in wait_for_orchestration#1069

Merged
sicoyle merged 3 commits into
dapr:mainfrom
javier-aliaga:fix/workflow-client-transient-retry
Jun 1, 2026
Merged

fix(ext-workflow): retry transient gRPC errors in wait_for_orchestration#1069
sicoyle merged 3 commits into
dapr:mainfrom
javier-aliaga:fix/workflow-client-transient-retry

Conversation

@javier-aliaga
Copy link
Copy Markdown

@javier-aliaga javier-aliaga commented Jun 1, 2026

What

Retry transient gRPC errors in wait_for_orchestration_start / wait_for_orchestration_completion (sync and async clients) instead of failing hard.

Why

Immediately after a Dapr sidecar restart, the sidecar can briefly return FAILED_PRECONDITION / UNAVAILABLE for a workflow whose state is fully intact (e.g. placement re-dissemination still propagating). Previously a client polling a long-running workflow would fail permanently even though the workflow was recoverable.

Behavior

  • Transient errors are retried with capped exponential backoff; non-transient errors propagate unchanged.
  • The caller's timeout is always respected, and a healthy runtime is unaffected — no retry loop runs.
  • With no timeout set, retries are bounded by a short grace window so a genuinely unavailable sidecar surfaces the original error instead of hanging forever.

Checklist

  • Code compiles correctly
  • Created/updated tests
  • Extended the documentation

@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 1, 2026

Codecov Report

❌ Patch coverage is 69.49153% with 54 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.62%. Comparing base (bffb749) to head (3686e28).
⚠️ Report is 138 commits behind head on main.

Files with missing lines Patch % Lines
...kflow/dapr/ext/workflow/_durabletask/aio/client.py 9.09% 50 Missing ⚠️
...-workflow/dapr/ext/workflow/_durabletask/client.py 92.72% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1069      +/-   ##
==========================================
- Coverage   86.63%   82.62%   -4.01%     
==========================================
  Files          84      146      +62     
  Lines        4473    14842   +10369     
==========================================
+ Hits         3875    12263    +8388     
- Misses        598     2579    +1981     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sicoyle
Copy link
Copy Markdown
Contributor

sicoyle commented Jun 1, 2026

You can ignore the codecov build failures, but yes pls fix the rest 😄

@sicoyle
Copy link
Copy Markdown
Contributor

sicoyle commented Jun 1, 2026

also pls fix DCO 🙏

@javier-aliaga javier-aliaga force-pushed the fix/workflow-client-transient-retry branch 2 times, most recently from e58655d to 7f9f3a3 Compare June 1, 2026 14:17
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the durabletask workflow gRPC client in ext/dapr-ext-workflow to make wait_for_orchestration_start and wait_for_orchestration_completion resilient to transient sidecar/runtime unavailability (e.g., immediately after a daprd restart), by retrying specific gRPC status codes with exponential backoff.

Changes:

  • Adds a shared _call_with_transient_retry helper that retries FAILED_PRECONDITION and UNAVAILABLE with capped exponential backoff.
  • Routes both wait_for_orchestration_start and wait_for_orchestration_completion through the new retry helper and maps deadline/budget exhaustion to TimeoutError.
  • Introduces a private _TransientTimeout sentinel exception to preserve the public timeout behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ext/dapr-ext-workflow/dapr/ext/workflow/_durabletask/client.py Outdated
Comment thread ext/dapr-ext-workflow/dapr/ext/workflow/_durabletask/client.py
Comment thread ext/dapr-ext-workflow/dapr/ext/workflow/_durabletask/client.py Outdated
Comment thread ext/dapr-ext-workflow/dapr/ext/workflow/_durabletask/client.py Outdated
…ion_*

wait_for_orchestration_start and wait_for_orchestration_completion call
the workflow runtime through the local Dapr sidecar. Immediately after a
sidecar restart (placement re-dissemination not yet applied, actor
registration still propagating, etc.), the sidecar can return
FAILED_PRECONDITION or UNAVAILABLE for an instance whose persistent
state is intact. The previous implementation surfaced these as a hard
error to the caller, so a client polling a long-running workflow would
fail permanently even though the workflow itself was recoverable.

Apply the same fix to both the sync and async clients:

  - TaskHubGrpcClient (sync) and AsyncTaskHubGrpcClient (async) both
    route their wait methods through a _call_with_transient_retry
    helper. The async variant uses asyncio.sleep; otherwise identical.
  - Retry FAILED_PRECONDITION and UNAVAILABLE with capped exponential
    backoff (0.5s, doubling, cap 5s).
  - Respect the caller's timeout. timeout in (0, None) means unbounded.
    The first call passes the user's timeout verbatim so behavior on a
    healthy runtime is unchanged. On retry, both the sleep and the
    per-call gRPC deadline are clamped to the remaining budget against
    a monotonic deadline anchored to the start of the loop — neither
    one can overshoot the user-provided timeout.
  - DEADLINE_EXCEEDED and budget exhaustion both surface as the public
    TimeoutError (preserved through a private _TransientTimeout
    sentinel; moved below the import block to satisfy E402).
  - Non-transient RpcErrors propagate immediately, unchanged.

Behavior on a healthy runtime is unchanged: the first call succeeds and
no retry loop runs.

Adds tests covering the retry behaviors: retry-then-succeed for both
transient codes, exhaustion surfacing as TimeoutError, and
non-transient codes propagating without retry.

Signed-off-by: Javier Aliaga <javier@diagrid.io>
@javier-aliaga javier-aliaga force-pushed the fix/workflow-client-transient-retry branch from dc5ee8c to 09f45af Compare June 1, 2026 15:01
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 13 comments.

Comment thread ext/dapr-ext-workflow/dapr/ext/workflow/_durabletask/client.py
Comment thread ext/dapr-ext-workflow/dapr/ext/workflow/_durabletask/client.py
Comment thread ext/dapr-ext-workflow/dapr/ext/workflow/_durabletask/client.py Outdated
Comment thread ext/dapr-ext-workflow/dapr/ext/workflow/_durabletask/aio/client.py
Comment thread ext/dapr-ext-workflow/dapr/ext/workflow/_durabletask/aio/client.py
Comment thread ext/dapr-ext-workflow/dapr/ext/workflow/_durabletask/client.py Outdated
Comment on lines 125 to 127
async def wait_for_orchestration_start(
self, instance_id: str, *, fetch_payloads: bool = False, timeout: int = 0
) -> Optional[WorkflowState]:
Comment thread ext/dapr-ext-workflow/dapr/ext/workflow/_durabletask/aio/client.py
Comment thread ext/dapr-ext-workflow/dapr/ext/workflow/_durabletask/client.py
Comment thread ext/dapr-ext-workflow/dapr/ext/workflow/_durabletask/aio/client.py
Cap continuous transient-error retries in unbounded mode (timeout=0/None)
at 30s via _MAX_TRANSIENT_RETRY_SECONDS, then re-raise the original
RpcError. This preserves the pre-retry contract: timeout=0 still waits
indefinitely for a healthy workflow and never raises TimeoutError, but a
permanently-unavailable sidecar now surfaces the original error instead
of retrying forever.

Also address review feedback:
  - Type wait_for_orchestration_* timeout as Optional[int] (None is a
    supported, tested input meaning unbounded).
  - Fix sync "up to Nones" log message to treat None as indefinite,
    matching the async client.
  - Correct the retry-helper docstring: the first call passes grpc_timeout
    (None when unbounded), not the timeout value verbatim.

Add a test covering unbounded-mode transient exhaustion surfacing as the
original RpcError (not TimeoutError, not a hang).

Signed-off-by: Javier Aliaga <javier@diagrid.io>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

Comment on lines +347 to +351
sleep_for = min(backoff, 5.0)
if remaining is not None:
sleep_for = min(sleep_for, remaining)
if transient_deadline is not None:
sleep_for = min(sleep_for, transient_deadline - now)
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional, keeping as-is. These transient codes (FAILED_PRECONDITION/UNAVAILABLE) return immediately rather than long-polling, so skipping the backoff near the deadline would turn the final window into a tight retry loop hammering the sidecar — up to ~5s of rapid-fire calls, since backoff caps at 5s. The clamp-then-stop behavior deliberately backs off instead. The cost is at most one skipped final attempt right at the deadline, which is an acceptable trade for not flooding a struggling sidecar.

Comment on lines +235 to +239
sleep_for = min(backoff, 5.0)
if remaining is not None:
sleep_for = min(sleep_for, remaining)
if transient_deadline is not None:
sleep_for = min(sleep_for, transient_deadline - now)
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the sync client — keeping as-is. Skipping the clamped backoff would busy-loop against the sidecar because these transient codes return immediately; the clamp-then-stop behavior is intentional. See the explanation on the sync client.py thread.

Comment thread ext/dapr-ext-workflow/tests/durabletask/test_orchestration_wait.py
Copy link
Copy Markdown
Contributor

@sicoyle sicoyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you!!

@sicoyle sicoyle added this pull request to the merge queue Jun 1, 2026
Merged via the queue into dapr:main with commit 71b26be Jun 1, 2026
17 of 19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants