Skip to content

fix: drain poll_task after cancel to prevent RuntimeError on closed loop in ActivityWorker/NexusWorker#1560

Open
kenanzamp wants to merge 3 commits into
temporalio:mainfrom
kenanzamp:fix/shutdown-poll-task-race-condition
Open

fix: drain poll_task after cancel to prevent RuntimeError on closed loop in ActivityWorker/NexusWorker#1560
kenanzamp wants to merge 3 commits into
temporalio:mainfrom
kenanzamp:fix/shutdown-poll-task-race-condition

Conversation

@kenanzamp
Copy link
Copy Markdown

@kenanzamp kenanzamp commented May 28, 2026

fix: drain poll_task after cancel to prevent RuntimeError("Event loop is closed") in ActivityWorker and NexusWorker

Summary

_ActivityWorker.run() and _NexusWorker.run() both follow this shutdown pattern when a worker exception fires:

poll_task.cancel()
await exception_task   # re-raises the worker error

poll_task.cancel() schedules a CancelledError delivery but returns immediately — it does not wait for the loop to process it. Meanwhile, the Rust bridge (pyo3-async-runtimes) may still have a result on its way via call_soon_threadsafe, which places a callback in the event loop's _ready queue.

At this point every Python-visible asyncio.Task is done. asyncio.run()'s _cancel_all_tasks sweep finds nothing to cancel and calls loop.close() immediately. If loop.close() races with the pending call_soon_threadsafe callback, Python raises:

RuntimeError: Event loop is closed

This race existed before v1.13.0 but the window was tiny. The v1.13.0 Worker plugin-chain refactor (introduced in #1337 / 5e93e63) wrapped _ActivityWorker.run() in additional async layers, consistently widening the window and making the race reliably reproducible.

Root cause

The race is entirely on the Python side — it lives in the interaction between asyncio.Task.cancel(), asyncio.run() teardown, and call_soon_threadsafe. No changes to sdk-core (Rust) are required or proposed.

Fix

Insert await asyncio.wait([poll_task]) immediately after poll_task.cancel():

poll_task.cancel()
await asyncio.wait([poll_task])   # drain: lets any pending Rust callback fire while loop is still open
await exception_task

asyncio.wait() yields to the event loop for at least one iteration, flushing any queued call_soon_threadsafe callbacks before loop.close() is called. It does not raise on a cancelled task. The same fix is applied symmetrically to _NexusWorker.run().

Testing

tests/worker/test_shutdown_race.py (new file) reproduces the race deterministically by patching call_soon_threadsafe to defer callbacks until after poll_task.cancel() returns. Without the fix, 10/10 iterations raise RuntimeError("Event loop is closed"). With the fix, all iterations complete cleanly.

Workaround context

This fix resolves a race first observed in production at Zamp (see workaround at https://github.com/Zampfi/pantheon/pull/5329). The workaround wrapped the worker in a retry loop; this fix removes the need for that workaround by addressing the root cause in the SDK itself.

Checklist

  • Reproducer confirmed (see tests/worker/test_shutdown_race.py)
  • Fix applied to _activity.py and _nexus.py
  • ruff check --select I — clean
  • ruff format --check — clean
  • mypy — pre-existing error in workflow_sandbox/_importer.py (unrelated, present on main)
  • No changes to sdk-core (Rust)
  • CLA: @kenanzamp (human) will sign the Temporal CLA

Fixes the RuntimeError: Event loop is closed shutdown race made consistently reproducible in v1.13.0.

When a worker exception fires, _ActivityWorker.run() (and the identical
path in _NexusWorker.run()) calls poll_task.cancel() and then immediately
awaits the exception task and returns.  The cancel() call schedules a
CancelledError callback on the event loop but does not wait for it to
be delivered.  At this point every Python-visible Task is done, so
asyncio.run()'s _cancel_all_tasks sweep finds nothing to wait for and
calls loop.close() immediately.

The Rust bridge (pyo3-async-runtimes) delivers its result via
call_soon_threadsafe(), which places a callback in the loop's _ready
queue.  If loop.close() races with that callback, the callback fires
against an already-closed loop and raises RuntimeError("Event loop is
closed").  This is a latent race that became consistently reproducible
starting in v1.13.0 after the Worker plugin-chain refactor wrapped
_ActivityWorker.run() in additional async layers, increasing the window
between poll_task.cancel() and loop teardown.

Fix: insert `await asyncio.wait([poll_task])` immediately after
poll_task.cancel().  asyncio.wait() yields to the event loop for at
least one iteration, allowing any pending Rust-side call_soon_threadsafe
callbacks to be processed while the loop is still open.  asyncio.wait()
does not raise even if the task was already cancelled, so it is safe
as a pure drain primitive.  The same fix is applied symmetrically to
_NexusWorker.run() which has an identical structure.

A regression test (tests/worker/test_shutdown_race.py) is included that
simulates the race by patching call_soon_threadsafe to defer callbacks
until after cancel(), confirming RuntimeError before the fix and clean
shutdown after it.
@kenanzamp kenanzamp requested a review from a team as a code owner May 28, 2026 08:53
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 28, 2026

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
0 out of 2 committers have signed the CLA.

❌ kenanzamp
❌ pace-bot
You have signed the CLA already but the status is still pending? Let us recheck it.

kenanzamp and others added 2 commits May 28, 2026 11:10
….so needed)

The previous test_fixed_activity_worker_poll_exits_cleanly and
test_fixed_nexus_worker_poll_exits_cleanly tests tried to import
_ActivityWorker/_NexusWorker from the source tree, which fails in CI
environments where the Rust bridge (.so) is not compiled.  Replace the
dynamic import + inspect.getsource() approach with a direct open() read of
the source file, which works in any environment.  All 3 tests now pass
without the compiled bridge.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants