Skip to content

fix: cancel periodic status check task when process raises ConstraintError#228

Open
Copilot wants to merge 3 commits intomainfrom
copilot/fix-status-check-constraint-error
Open

fix: cancel periodic status check task when process raises ConstraintError#228
Copilot wants to merge 3 commits intomainfrom
copilot/fix-status-check-constraint-error

Conversation

Copy link
Contributor

Copilot AI commented Mar 10, 2026

After a process fails (e.g. via ConstraintError), the _periodic_status_check background task on waiting components was not being cancelled — causing it to log "State backend not connected, skipping status check" every 20s indefinitely after process teardown.

Summary

Python's asyncio.wait() does not propagate cancellation to its child tasks when the outer task is cancelled. When LocalProcess.run()'s TaskGroup cancelled component B's task due to component A raising ConstraintError, the CancelledError raised in _io_read_with_status_check left the _periodic_status_check task orphaned in the event loop.

Changes

  • plugboard/component/component.py: In _io_read_with_status_check, wrap asyncio.wait() in try/except BaseException and explicitly cancel status_task before re-raising. io_task is intentionally left uncancelled — cancelling it leaves stale entries in IOController._read_tasks that break subsequent reads.
# Before
done, pending = await asyncio.wait(
    (
        asyncio.create_task(self._periodic_status_check()),
        asyncio.create_task(self.io.read(timeout=read_timeout)),
    ),
    return_when=asyncio.FIRST_COMPLETED,
)

# After
status_task = asyncio.create_task(self._periodic_status_check())
io_task = asyncio.create_task(self.io.read(timeout=read_timeout))
try:
    done, pending = await asyncio.wait(
        (status_task, io_task),
        return_when=asyncio.FIRST_COMPLETED,
    )
except BaseException:
    status_task.cancel()
    raise
  • tests/integration/test_process_with_components_run.py: Added test_constraint_error_stops_background_status_check — patches IO_READ_TIMEOUT_SECONDS to 0.1s, runs a process where the producer raises ConstraintError, and asserts the consumer's background status check task count does not increase after the process fails.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • astral.sh
    • Triggering command: /usr/bin/curl curl -LsSf REDACTED (dns block)
  • metadata.google.internal
    • Triggering command: /usr/bin/python3 /usr/bin/python3 /home/REDACTED/.local/lib/python3.12/site-packages/ray/dashboard/dashboard.py --host=127.0.0.1 --port=8265 --port-retries=50 --temp-dir=/tmp/ray --log-dir=/tmp/ray/session_2026-03-10_12-24-30_696966_4740/logs --session-dir=/tmp/ray/session_2026-03-10_12-24-30_696966_4740 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:41467 --cluster-id-hex=315478065acf40787bae5a5a2d085e45ba453b1cdf3b6bb9491c8a13 --node-ip-address=127.0.0.1 --stdout-filepath=/tmp/ray/session_2026-03-10_12-24-30_696966_4740/logs/dashboard.out --stderr-filepath=/tmp/ray/session_2026-03-10_12-24-30_696966_4740/logs/dashboard.err de/node/bin/bash (dns block)
    • Triggering command: /usr/bin/python3 /usr/bin/python3 /home/REDACTED/.local/lib/python3.12/site-packages/ray/dashboard/dashboard.py --host=127.0.0.1 --port=8265 --port-retries=50 --temp-dir=/tmp/ray --log-dir=/tmp/ray/session_2026-03-10_12-25-36_456083_5403/logs --session-dir=/tmp/ray/session_2026-03-10_12-25-36_456083_5403 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:36975 --cluster-id-hex=d4d2c57ba441af7a721e338a987f2c67a082d7e7dabcda1628e8b01c --node-ip-address=127.0.0.1 --stdout-filepath=/tmp/ray/session_2026-03-10_12-25-36_456083_5403/logs/dashboard.out --stderr-filepath=/tmp/ray/session_2026-03-10_12-25-36_456083_5403/logs/dashboard.err /home/REDACTED/.config/composer/vendor/bin/git (dns block)
    • Triggering command: /usr/bin/python3 /usr/bin/python3 /home/REDACTED/.local/lib/python3.12/site-packages/ray/dashboard/dashboard.py --host=127.0.0.1 --port=8265 --port-retries=50 --temp-dir=/tmp/ray --log-dir=/tmp/ray/session_2026-03-10_12-28-25_364935_6030/logs --session-dir=/tmp/ray/session_2026-03-10_12-28-25_364935_6030 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:36649 --cluster-id-hex=0a7c71f1c8306fa045621fa679f13bfe448ffad9bf7a61dfae288243 --node-ip-address=127.0.0.1 --stdout-filepath=/tmp/ray/session_2026-03-10_12-28-25_364935_6030/logs/dashboard.out --stderr-filepath=/tmp/ray/session_2026-03-10_12-28-25_364935_6030/logs/dashboard.err sh credential.usernbash (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Original prompt

This section details on the original issue you should resolve

<issue_title>bug: Status check keeps running after process has raised ConstraintError</issue_title>
<issue_description>### Summary

Consider the following MRE:

from plugboard.component import Component, IOController as IO
from plugboard.process import LocalProcess
from plugboard.connector import AsyncioConnector
from plugboard.schemas import ConnectorSpec
from plugboard.exceptions import ConstraintError

class A(Component):
    io = IO(outputs=["out_1"])

    async def step(self) -> None:
        raise ConstraintError("This is a constraint error from component A.")

class B(Component):
    io = IO(inputs=["in_1"])

    async def step(self) -> None:
        pass

process = LocalProcess(
    components=[A(name="component-a"), B(name="component-b")],
    connectors=[
        AsyncioConnector(
            spec=ConnectorSpec(source="component-a.out_1", target="component-b.in_1"),
        ),
    ],
)
async with process:
    await process.run()

This will raise the ConstraintError as expected. However, it will continue to emit log messages afterwards every 20s, for example:

{"cls":"B","name":"component-b","job_id":"Job_I6UAzJs5zhQGCfCi","event":"State backend not connected, skipping status check","level":"warning","timestamp":"2026-03-10T12:02:38.077249Z","module":"component","process":17506}

The process should instead be cleaned up after the exception was raised.

Version Information

Plugboard version: 0.6.0
Platform: Linux-6.6.84.1-microsoft-standard-WSL2-x86_64-with-glibc2.39
Python version: 3.12.11
```</issue_description>

## Comments on the Issue (you are @copilot in this section)

<comments>
</comments>


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Copilot AI and others added 2 commits March 10, 2026 12:31
When a component's run task is cancelled (e.g. due to ConstraintError from
another component), the asyncio.wait() call in _io_read_with_status_check()
did not propagate the cancellation to its child tasks. This caused the
_periodic_status_check() background task to keep running indefinitely after
the process ended, logging 'State backend not connected, skipping status
check' every 20 seconds.

Fix: catch BaseException in _io_read_with_status_check() and explicitly cancel
the status_task (periodic status check) before re-raising. The io_task is
intentionally left uncancelled to avoid breaking existing behavior in
IOController._read_tasks state management.

Also adds an integration test to verify the fix using a short timeout patch.

Co-authored-by: toby-coleman <13170610+toby-coleman@users.noreply.github.com>
…bility

Co-authored-by: toby-coleman <13170610+toby-coleman@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix status check running after ConstraintError in process fix: cancel periodic status check task when process raises ConstraintError Mar 10, 2026
@github-actions
Copy link

Benchmark comparison for 64adfa9a (base) vs b1e024b9 (PR)


------------------------------------------------------------------------------------------------------------------ benchmark: 2 tests ------------------------------------------------------------------------------------------------------------------
Name (time in ms)                                                                         Min                 Max                Mean             StdDev              Median               IQR            Outliers     OPS            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_process_run (main/.benchmarks/Linux-CPython-3.12-64bit/0001_base)     447.6668 (1.0)      458.3038 (1.0)      452.2086 (1.0)       4.0220 (1.0)      452.6681 (1.0)      4.9563 (1.0)           2;0  2.2114 (1.0)           5           1
test_benchmark_process_run (pr/.benchmarks/Linux-CPython-3.12-64bit/0001_pr)         457.8914 (1.02)     487.4245 (1.06)     464.8393 (1.03)     12.7069 (3.16)     459.4070 (1.01)     9.9649 (2.01)          1;1  2.1513 (0.97)          5           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean

@toby-coleman toby-coleman marked this pull request as ready for review March 10, 2026 20:26
@codecov
Copy link

codecov bot commented Mar 10, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: Status check keeps running after process has raised ConstraintError

2 participants