Skip to content

fix(stdio): recover from a reload-orphaned socket instead of hanging#1217

Open
mertakdut wants to merge 3 commits into
CoplayDev:betafrom
mertakdut:beta
Open

fix(stdio): recover from a reload-orphaned socket instead of hanging#1217
mertakdut wants to merge 3 commits into
CoplayDev:betafrom
mertakdut:beta

Conversation

@mertakdut

@mertakdut mertakdut commented Jun 27, 2026

Copy link
Copy Markdown

Description

Every script recompile makes Unity do a domain reload, which tears down and rebuilds the bridge. The Unity process itself keeps running, so the OS never resets the old loopback connection, and the Python client is left holding a socket that still looks connected but is actually dead. The old code kept reusing that socket: each recv waited the full connection_timeout, and those waits piled up across retries with no overall limit, so a tool call could hang for several minutes. I ran into this with refresh_unity, run_tests, and get_test_job stuck for 10+ minutes at a time.

The fix does two things. When the client sees that Unity is reloading, it drops the stale socket so the next command reconnects to the fresh bridge instead of writing into the dead one. And it puts a single time limit on the whole command, so retries can't stack up forever.

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Test update

Changes Made

  1. Server/src/transport/legacy/unity_connection.py
  • Reconnect-on-reload: when the preflight status reports reloading, drop the socket (under _io_lock, so it is never closed under a concurrent send/recv) so the next command reconnects to the rebuilt bridge instead of reusing a half-open one.
  • Per-command total deadline: new optional deadline arg on send_command, derived from command_total_timeout and enforced at the top of the attempt loop; threaded through send_command_with_retry (one shared deadline across the reload-wait loop) so the whole call has a single ceiling rather than one per inner send.
  1. Server/src/core/config.py: new command_total_timeout (default 90s).
  2. Server/tests/integration/test_connection_deadline.py: deterministic regression tests using a fake bridge that completes the handshake then goes silent.

Compatibility / Package Source

  • Unity version(s) tested: 6000.5.0f1 (Unity 6.5)
  • Package source used (#beta, #main, tag, branch, or file:): Python server run from local Server/ source (uvx --from ); the Unity C# package was unchanged (beta).
  • Resolved commit hash from Packages/packages-lock.json (if using a Git package URL): N/A -- this PR changes only the Python server; no Unity C# package change.

Testing/Screenshots/Recordings

  • Python tests (cd Server && uv run pytest tests/ -v)
  • Unity EditMode tests
  • Unity PlayMode tests
  • Package import/compile check
  • Not applicable (explain why in Additional Notes)

Documentation Updates

  • I have added/removed/modified tools or resources
  • If yes, I have updated all documentation files using:
    • The LLM prompt at tools/UPDATE_DOCS_PROMPT.md (recommended)
    • Manual review of the generated changes

Related Issues

Relates to #891 (MCP gets stuck after a Unity reload until manually nudged) and #657 (more deterministic / bounded reload-wait in unity_connection.py). The same reload/test-boundary connection-drop pattern also affects the HTTP transport (e.g. #1207, #1164) via a different code path; this PR addresses the stdio client only.

Additional Notes

  1. This change is Python-only (no C# / Unity API change), so the Unity EditMode/PlayMode and package-import/compile checks are not applicable.
  2. The bug is timing/state-dependent and does not reproduce on demand, so the new test reproduces it deterministically: a fake bridge that handshakes then goes silent drives the real connection code, failing (unbounded) before the fix and bounded after. It was also verified live on Unity 6.5 against the exact workflow that originally hung (repeated edit -> force refresh -> run_tests / get_test_job): previously wedged for minutes; now every call returns within ~10s and reconnects cleanly across each reload.
  3. Related but intentionally out of scope: after heavy reload churn the Unity-side bridge sometimes fails to resume (its status file is deleted on stop and not recreated, leaving the bridge down until the editor restarts). That is a separate, bridge-side issue.

Summary by CodeRabbit

Summary by CodeRabbit

  • New Features

    • Added a command_total_timeout configuration to cap how long commands can run across retries.
  • Bug Fixes

    • Improved Unity command sending so timeouts are enforced consistently, even after a reload leaves the socket half-open.
    • Strengthened reload handling to prevent concurrent send/receive races and to ensure timed-out sockets are reset.
  • Tests

    • Added integration coverage for silent/half-open TCP behavior and reload-related retry timing and error reasons.

@coderabbitai

coderabbitai Bot commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3abcffd5-4c3a-49b7-8486-cf3c37310117

📥 Commits

Reviewing files that changed from the base of the PR and between eb91b98 and 36bf6c0.

📒 Files selected for processing (1)
  • Server/src/transport/legacy/unity_connection.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • Server/src/transport/legacy/unity_connection.py

📝 Walkthrough

Walkthrough

Adds a command_total_timeout setting and threads a shared deadline through Unity command sending and retry handling. Reload-detected sockets are reset under lock, and new integration tests cover wedged-socket timeout, bounded reload retries, and reconnect behavior.

Changes

Command Total Deadline

Layer / File(s) Summary
Config field and connection timeout parameters
Server/src/core/config.py, Server/src/transport/legacy/unity_connection.py
Adds command_total_timeout: float = 90.0 to ServerConfig and lets UnityConnection.connect accept an optional timeout that defaults from config.
Deadline-aware send_command and retry flow
Server/src/transport/legacy/unity_connection.py
Adds deadline capping for connect, receive, and backoff timeouts in send_command; resets reload-detected sockets under _io_lock; and passes a shared deadline through send_command_with_retry until it expires.
Integration tests for wedged socket and reload handling
Server/tests/integration/test_connection_deadline.py
Adds a silent TCP bridge fixture plus three tests that cover deadline-bounded timeout, reload-bounded retry, and socket reset on reload.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • CoplayDev/unity-mcp#510: Modifies the same send_command_with_retry reload/retry path and response-reason handling that this PR now bounds with a shared deadline.

Suggested labels

safe-to-test, full-matrix

Poem

🐇 A deadline ticks in quiet light,
The socket naps no more tonight.
Reload winds brush the lockful gate,
And tests declare the timing straight.
Hop, hop—timeouts keep things bright.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 68.75% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main fix: preventing hangs by recovering from a stale reload socket.
Description check ✅ Passed The description follows the template and includes the required sections, change summary, testing, compatibility, and related issues.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
Server/src/transport/legacy/unity_connection.py (1)

268-268: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Make params explicitly nullable.

Ruff reports RUF013 because params defaults to None while annotated as dict[str, Any]. Use dict[str, Any] | None to keep the new signature type-correct.

Proposed fix
-    def send_command(self, command_type: str, params: dict[str, Any] = None, max_attempts: int | None = None, deadline: float | None = None) -> dict[str, Any]:
+    def send_command(self, command_type: str, params: dict[str, Any] | None = None, max_attempts: int | None = None, deadline: float | None = None) -> dict[str, Any]:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Server/src/transport/legacy/unity_connection.py` at line 268, The
send_command method in UnityConnection has a nullable default for params but its
type annotation is not nullable, which triggers the Ruff RUF013 warning. Update
the send_command signature so params is explicitly typed as dict[str, Any] |
None, keeping the existing default of None and leaving the rest of the method
behavior unchanged.

Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@Server/src/transport/legacy/unity_connection.py`:
- Around line 349-352: The deadline in unity_connection.py is only enforced at
attempt boundaries, so blocking connect/receive calls and retry sleeps can still
exceed the total budget. Update the timeout logic in the command path around the
deadline check and the blocking I/O/retry handling to compute the remaining time
from deadline - time.monotonic(), then cap socket timeouts and sleep durations
to that remaining budget. Make sure the relevant command flow in UnityConnection
(including the connect/receive retry loop) never blocks longer than the
configured command_total_timeout.

---

Nitpick comments:
In `@Server/src/transport/legacy/unity_connection.py`:
- Line 268: The send_command method in UnityConnection has a nullable default
for params but its type annotation is not nullable, which triggers the Ruff
RUF013 warning. Update the send_command signature so params is explicitly typed
as dict[str, Any] | None, keeping the existing default of None and leaving the
rest of the method behavior unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2bdfee40-47a9-487b-873e-81b797828bf4

📥 Commits

Reviewing files that changed from the base of the PR and between dccecd6 and de43399.

📒 Files selected for processing (3)
  • Server/src/core/config.py
  • Server/src/transport/legacy/unity_connection.py
  • Server/tests/integration/test_connection_deadline.py

Comment thread Server/src/transport/legacy/unity_connection.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
Server/src/transport/legacy/unity_connection.py (1)

392-402: 🩺 Stability & Availability | 🔵 Trivial | 🏗️ Heavy lift

Thread the shared deadline through _read_exact
receive_full_response() can make many blocking recv() calls under one socket timeout. A slow peer that keeps dribbling bytes can keep the command alive past command_total_timeout, so the “hard ceiling” only applies to the first read. Pass the deadline into _read_exact and cap/check before each recv().

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Server/src/transport/legacy/unity_connection.py` around lines 392 - 402, The
shared command deadline is only being applied before the first socket read in
receive_full_response() / _read_exact, so a slow peer can still extend past the
total timeout. Thread the deadline into _read_exact and, using the existing
helpers in UnityConnection, cap the recv timeout and re-check the deadline
before each recv() loop iteration so every blocking read respects the same hard
ceiling.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@Server/src/transport/legacy/unity_connection.py`:
- Around line 392-402: The shared command deadline is only being applied before
the first socket read in receive_full_response() / _read_exact, so a slow peer
can still extend past the total timeout. Thread the deadline into _read_exact
and, using the existing helpers in UnityConnection, cap the recv timeout and
re-check the deadline before each recv() loop iteration so every blocking read
respects the same hard ceiling.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b3d9831b-306d-47d5-a671-bbadd92e4d85

📥 Commits

Reviewing files that changed from the base of the PR and between de43399 and eb91b98.

📒 Files selected for processing (2)
  • Server/src/transport/legacy/unity_connection.py
  • Server/tests/integration/test_connection_deadline.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • Server/tests/integration/test_connection_deadline.py

@mertakdut

Copy link
Copy Markdown
Author

🧹 Nitpick comments (1)

Server/src/transport/legacy/unity_connection.py (1)> 392-402: 🩺 Stability & Availability | 🔵 Trivial | 🏗️ Heavy lift

Thread the shared deadline through _read_exact
receive_full_response() can make many blocking recv() calls under one socket timeout. A slow peer that keeps dribbling bytes can keep the command alive past command_total_timeout, so the “hard ceiling” only applies to the first read. Pass the deadline into _read_exact and cap/check before each recv().

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Server/src/transport/legacy/unity_connection.py` around lines 392 - 402, The
shared command deadline is only being applied before the first socket read in
receive_full_response() / _read_exact, so a slow peer can still extend past the
total timeout. Thread the deadline into _read_exact and, using the existing
helpers in UnityConnection, cap the recv timeout and re-check the deadline
before each recv() loop iteration so every blocking read respects the same hard
ceiling.

🤖 Prompt for all review comments with AI agents

ℹ️ Review info

@coderabbitai true in principle, but it can't really happen here. The peer is the Unity bridge on 127.0.0.1 and it sends each response in a single sendall, so over loopback _read_exact reads it in a couple of passes and never gets close to the timeout. The case that actually matters (socket wedged, no bytes at all) is already capped on the first recv, and a slow-but-alive Unity is covered by the heartbeat window. I'll leave _read_exact as-is for now, happy to revisit if this ever runs over a real network.

@coderabbitai

coderabbitai Bot commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Tip

For best results, initiate chat on the files or code changes.

@mertakdut Good clarification — agreed. Given the current transport contract in Server/src/transport/legacy/unity_connection.py (Unity bridge on 127.0.0.1, responses written via a single sendall, and heartbeat handling covering the slow-but-alive case), my _read_exact deadline concern isn’t material for this PR. I’m withdrawing that nit.

(ᵕ—ᴗ—)


✏️ Learnings added
Learnt from: mertakdut
URL: https://github.com/CoplayDev/unity-mcp/pull/1217

Timestamp: 2026-06-27T17:43:57.674Z
Learning: In `Server/src/transport/legacy/unity_connection.py`, the peer is the Unity bridge on `127.0.0.1`, and it sends each response in a single `sendall`. Because this transport is loopback-only, `_read_exact` completes in only a few passes and the practically relevant timeout case is a wedged socket with no bytes arriving, which is already capped on the first `recv`; slow-but-alive Unity behavior is instead governed by the heartbeat window.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

@mertakdut mertakdut marked this pull request as draft June 27, 2026 18:48
@mertakdut mertakdut marked this pull request as ready for review June 27, 2026 20:07
@fs-paulrahme

Copy link
Copy Markdown

🧹 Nitpick comments (1)

Server/src/transport/legacy/unity_connection.py (1)> 392-402: 🩺 Stability & Availability | 🔵 Trivial | 🏗️ Heavy lift

Thread the shared deadline through _read_exact
receive_full_response() can make many blocking recv() calls under one socket timeout. A slow peer that keeps dribbling bytes can keep the command alive past command_total_timeout, so the “hard ceiling” only applies to the first read. Pass the deadline into _read_exact and cap/check before each recv().
🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Server/src/transport/legacy/unity_connection.py` around lines 392 - 402, The
shared command deadline is only being applied before the first socket read in
receive_full_response() / _read_exact, so a slow peer can still extend past the
total timeout. Thread the deadline into _read_exact and, using the existing
helpers in UnityConnection, cap the recv timeout and re-check the deadline
before each recv() loop iteration so every blocking read respects the same hard
ceiling.

🤖 Prompt for all review comments with AI agents
ℹ️ Review info

@coderabbitai true in principle, but it can't really happen here. The peer is the Unity bridge on 127.0.0.1 and it sends each response in a single sendall, so over loopback _read_exact reads it in a couple of passes and never gets close to the timeout. The case that actually matters (socket wedged, no bytes at all) is already capped on the first recv, and a slow-but-alive Unity is covered by the heartbeat window. I'll leave _read_exact as-is for now, happy to revisit if this ever runs over a real network.

We've been chasing intermittent infinite "Reloading Domain" hangs (sometimes script recompile, sometimes and play-mode entry). Dumps of the not-responding Unity.exe, with an ongoing Claude session from the last 2-3 weeks to help with pattern matching and narrowing it down. I've re-edited most of this response myself, so hope it doesn't still read too AI generated, or apologies if you're already further than us 😄 Just sharing in case it surfaces any further insight...

Our .exe dumps consistently show the main thread stuck in mono_domain_try_unload, waiting on a thread that never exits — and the "Thread Pool I/O Selector" thread blocked in ws2_32!select (via mono_poll) on a 127.0.0.1 bridge socket. That's Unity's documented domain-reload-hang cause: a thread in a blocking socket call can't be terminated during reload, so mono_domain_try_unload never completes.

Thus the editor-side bridge's pending socket receive isn't cancelled/closed before the domain unloads.

The Python-side recovery can help recover from / work around this in the meantime, but it looks like the cause goes deeper into editor code: on AssemblyReloadEvents.beforeAssemblyReload (and EditorApplication.quitting), close the bridge socket and stop/join its receive thread so the reload completes cleanly. This PR's client-side recovery handles the client holding a dead socket; the editor still needs to tear down its end on reload to avoid the hard hang.

@coderabbitai

coderabbitai Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Tip

For best results, initiate chat on the files or code changes.

@fs-paulrahme This is extremely helpful context — thank you. Your dump analysis makes the distinction much clearer: this PR addresses the Python client’s stale/dead-socket recovery path, but the underlying infinite reload hang appears to be on the Unity editor side when a blocking bridge receive survives into domain unload.

The mono_domain_try_unload + mono_poll/ws2_32!select pattern on the loopback bridge socket is a strong signal that the editor bridge needs explicit shutdown during reload/quit. I agree the right fix there is to tear down the bridge on AssemblyReloadEvents.beforeAssemblyReload and EditorApplication.quitting, including closing the socket and stopping/joining the receive thread so unload can complete cleanly.

I’ve noted this for future review context around the Unity bridge/reload path.
(•ㅅ•)ゞ


✏️ Learnings added
Learnt from: fs-paulrahme
URL: https://github.com/CoplayDev/unity-mcp/pull/1217

Timestamp: 2026-06-29T10:21:46.871Z
Learning: In the `CoplayDev/unity-mcp` Unity editor bridge, intermittent infinite "Reloading Domain" hangs correlate with `Unity.exe` dumps where the main thread is blocked in `mono_domain_try_unload` while the "Thread Pool I/O Selector" thread is blocked in `ws2_32!select` via `mono_poll` on the `127.0.0.1` bridge socket. This indicates the editor-side bridge is leaving a blocking socket receive alive during domain unload, preventing the reload from completing.

Learnt from: fs-paulrahme
URL: https://github.com/CoplayDev/unity-mcp/pull/1217

Timestamp: 2026-06-29T10:21:46.871Z
Learning: For `CoplayDev/unity-mcp`, the Python-side fix in PR `#1217` is a client-side mitigation for stale/dead sockets after Unity reloads, but the underlying Unity-side hard hang requires explicit bridge teardown in the Unity C# editor code on `AssemblyReloadEvents.beforeAssemblyReload` and `EditorApplication.quitting`: close the bridge socket and stop/join the receive thread before domain unload.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants