fix(ext-workflow): retry transient gRPC errors in wait_for_orchestration by javier-aliaga · Pull Request #1069 · dapr/python-sdk

javier-aliaga · 2026-06-01T13:49:30Z

What

Retry transient gRPC errors in wait_for_orchestration_start / wait_for_orchestration_completion (sync and async clients) instead of failing hard.

Why

Immediately after a Dapr sidecar restart, the sidecar can briefly return FAILED_PRECONDITION / UNAVAILABLE for a workflow whose state is fully intact (e.g. placement re-dissemination still propagating). Previously a client polling a long-running workflow would fail permanently even though the workflow was recoverable.

Behavior

Transient errors are retried with capped exponential backoff; non-transient errors propagate unchanged.
The caller's timeout is always respected, and a healthy runtime is unaffected — no retry loop runs.
With no timeout set, retries are bounded by a short grace window so a genuinely unavailable sidecar surfaces the original error instead of hanging forever.

Checklist

Code compiles correctly
Created/updated tests
Extended the documentation

codecov · 2026-06-01T13:54:07Z

Codecov Report

❌ Patch coverage is 69.49153% with 54 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.62%. Comparing base (bffb749) to head (3686e28).
⚠️ Report is 138 commits behind head on main.

Files with missing lines	Patch %	Lines
...kflow/dapr/ext/workflow/_durabletask/aio/client.py	9.09%	50 Missing ⚠️
...-workflow/dapr/ext/workflow/_durabletask/client.py	92.72%	4 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1069      +/-   ##
==========================================
- Coverage   86.63%   82.62%   -4.01%     
==========================================
  Files          84      146      +62     
  Lines        4473    14842   +10369     
==========================================
+ Hits         3875    12263    +8388     
- Misses        598     2579    +1981

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

sicoyle · 2026-06-01T14:15:18Z

You can ignore the codecov build failures, but yes pls fix the rest 😄

sicoyle · 2026-06-01T14:15:43Z

also pls fix DCO 🙏

Copilot

Pull request overview

This PR updates the durabletask workflow gRPC client in ext/dapr-ext-workflow to make wait_for_orchestration_start and wait_for_orchestration_completion resilient to transient sidecar/runtime unavailability (e.g., immediately after a daprd restart), by retrying specific gRPC status codes with exponential backoff.

Changes:

Adds a shared _call_with_transient_retry helper that retries FAILED_PRECONDITION and UNAVAILABLE with capped exponential backoff.
Routes both wait_for_orchestration_start and wait_for_orchestration_completion through the new retry helper and maps deadline/budget exhaustion to TimeoutError.
Introduces a private _TransientTimeout sentinel exception to preserve the public timeout behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…ion_* wait_for_orchestration_start and wait_for_orchestration_completion call the workflow runtime through the local Dapr sidecar. Immediately after a sidecar restart (placement re-dissemination not yet applied, actor registration still propagating, etc.), the sidecar can return FAILED_PRECONDITION or UNAVAILABLE for an instance whose persistent state is intact. The previous implementation surfaced these as a hard error to the caller, so a client polling a long-running workflow would fail permanently even though the workflow itself was recoverable. Apply the same fix to both the sync and async clients: - TaskHubGrpcClient (sync) and AsyncTaskHubGrpcClient (async) both route their wait methods through a _call_with_transient_retry helper. The async variant uses asyncio.sleep; otherwise identical. - Retry FAILED_PRECONDITION and UNAVAILABLE with capped exponential backoff (0.5s, doubling, cap 5s). - Respect the caller's timeout. timeout in (0, None) means unbounded. The first call passes the user's timeout verbatim so behavior on a healthy runtime is unchanged. On retry, both the sleep and the per-call gRPC deadline are clamped to the remaining budget against a monotonic deadline anchored to the start of the loop — neither one can overshoot the user-provided timeout. - DEADLINE_EXCEEDED and budget exhaustion both surface as the public TimeoutError (preserved through a private _TransientTimeout sentinel; moved below the import block to satisfy E402). - Non-transient RpcErrors propagate immediately, unchanged. Behavior on a healthy runtime is unchanged: the first call succeeds and no retry loop runs. Adds tests covering the retry behaviors: retry-then-succeed for both transient codes, exhaustion surfacing as TimeoutError, and non-transient codes propagating without retry. Signed-off-by: Javier Aliaga <javier@diagrid.io>

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 13 comments.

    async def wait_for_orchestration_start(
        self, instance_id: str, *, fetch_payloads: bool = False, timeout: int = 0
    ) -> Optional[WorkflowState]:


Cap continuous transient-error retries in unbounded mode (timeout=0/None) at 30s via _MAX_TRANSIENT_RETRY_SECONDS, then re-raise the original RpcError. This preserves the pre-retry contract: timeout=0 still waits indefinitely for a healthy workflow and never raises TimeoutError, but a permanently-unavailable sidecar now surfaces the original error instead of retrying forever. Also address review feedback: - Type wait_for_orchestration_* timeout as Optional[int] (None is a supported, tested input meaning unbounded). - Fix sync "up to Nones" log message to treat None as indefinite, matching the async client. - Correct the retry-helper docstring: the first call passes grpc_timeout (None when unbounded), not the timeout value verbatim. Add a test covering unbounded-mode transient exhaustion surfacing as the original RpcError (not TimeoutError, not a hang). Signed-off-by: Javier Aliaga <javier@diagrid.io>

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

javier-aliaga · 2026-06-01T15:59:16Z

+                sleep_for = min(backoff, 5.0)
+                if remaining is not None:
+                    sleep_for = min(sleep_for, remaining)
+                if transient_deadline is not None:
+                    sleep_for = min(sleep_for, transient_deadline - now)


Intentional, keeping as-is. These transient codes (FAILED_PRECONDITION/UNAVAILABLE) return immediately rather than long-polling, so skipping the backoff near the deadline would turn the final window into a tight retry loop hammering the sidecar — up to ~5s of rapid-fire calls, since backoff caps at 5s. The clamp-then-stop behavior deliberately backs off instead. The cost is at most one skipped final attempt right at the deadline, which is an acceptable trade for not flooding a struggling sidecar.

javier-aliaga · 2026-06-01T15:59:22Z

+                sleep_for = min(backoff, 5.0)
+                if remaining is not None:
+                    sleep_for = min(sleep_for, remaining)
+                if transient_deadline is not None:
+                    sleep_for = min(sleep_for, transient_deadline - now)


Same as the sync client — keeping as-is. Skipping the clamped backoff would busy-loop against the sidecar because these transient codes return immediately; the clamp-then-stop behavior is intentional. See the explanation on the sync client.py thread.

sicoyle

thank you!!

javier-aliaga requested review from a team as code owners June 1, 2026 13:49

javier-aliaga added the backport release-1.18 label Jun 1, 2026

sicoyle requested a review from Copilot June 1, 2026 14:14

Copilot started reviewing on behalf of sicoyle June 1, 2026 14:14 View session

javier-aliaga force-pushed the fix/workflow-client-transient-retry branch 2 times, most recently from e58655d to 7f9f3a3 Compare June 1, 2026 14:17

Copilot AI reviewed Jun 1, 2026

View reviewed changes

javier-aliaga force-pushed the fix/workflow-client-transient-retry branch from 7f9f3a3 to dc5ee8c Compare June 1, 2026 15:00

javier-aliaga requested a review from Copilot June 1, 2026 15:00

Copilot started reviewing on behalf of javier-aliaga June 1, 2026 15:01 View session

javier-aliaga force-pushed the fix/workflow-client-transient-retry branch from dc5ee8c to 09f45af Compare June 1, 2026 15:01

Copilot AI reviewed Jun 1, 2026

View reviewed changes

javier-aliaga requested a review from Copilot June 1, 2026 15:46

Copilot started reviewing on behalf of javier-aliaga June 1, 2026 15:47 View session

Copilot AI reviewed Jun 1, 2026

View reviewed changes

Merge branch 'main' into fix/workflow-client-transient-retry

3686e28

sicoyle approved these changes Jun 1, 2026

View reviewed changes

sicoyle added this pull request to the merge queue Jun 1, 2026

Merged via the queue into dapr:main with commit 71b26be Jun 1, 2026
17 of 19 checks passed

dapr-bot mentioned this pull request Jun 1, 2026

[Backport release-1.18] fix(ext-workflow): retry transient gRPC errors in wait_for_orchestration #1075

Open

Conversation

javier-aliaga commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Behavior

Checklist

Uh oh!

codecov Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sicoyle commented Jun 1, 2026

Uh oh!

sicoyle commented Jun 1, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

javier-aliaga Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

javier-aliaga Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sicoyle left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

javier-aliaga commented Jun 1, 2026 •

edited

Loading

codecov Bot commented Jun 1, 2026 •

edited

Loading