fix(stdio): recover from a reload-orphaned socket instead of hanging#1217
fix(stdio): recover from a reload-orphaned socket instead of hanging#1217mertakdut wants to merge 3 commits into
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds a ChangesCommand Total Deadline
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
Server/src/transport/legacy/unity_connection.py (1)
268-268: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winMake
paramsexplicitly nullable.Ruff reports RUF013 because
paramsdefaults toNonewhile annotated asdict[str, Any]. Usedict[str, Any] | Noneto keep the new signature type-correct.Proposed fix
- def send_command(self, command_type: str, params: dict[str, Any] = None, max_attempts: int | None = None, deadline: float | None = None) -> dict[str, Any]: + def send_command(self, command_type: str, params: dict[str, Any] | None = None, max_attempts: int | None = None, deadline: float | None = None) -> dict[str, Any]:🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@Server/src/transport/legacy/unity_connection.py` at line 268, The send_command method in UnityConnection has a nullable default for params but its type annotation is not nullable, which triggers the Ruff RUF013 warning. Update the send_command signature so params is explicitly typed as dict[str, Any] | None, keeping the existing default of None and leaving the rest of the method behavior unchanged.Source: Linters/SAST tools
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@Server/src/transport/legacy/unity_connection.py`:
- Around line 349-352: The deadline in unity_connection.py is only enforced at
attempt boundaries, so blocking connect/receive calls and retry sleeps can still
exceed the total budget. Update the timeout logic in the command path around the
deadline check and the blocking I/O/retry handling to compute the remaining time
from deadline - time.monotonic(), then cap socket timeouts and sleep durations
to that remaining budget. Make sure the relevant command flow in UnityConnection
(including the connect/receive retry loop) never blocks longer than the
configured command_total_timeout.
---
Nitpick comments:
In `@Server/src/transport/legacy/unity_connection.py`:
- Line 268: The send_command method in UnityConnection has a nullable default
for params but its type annotation is not nullable, which triggers the Ruff
RUF013 warning. Update the send_command signature so params is explicitly typed
as dict[str, Any] | None, keeping the existing default of None and leaving the
rest of the method behavior unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 2bdfee40-47a9-487b-873e-81b797828bf4
📒 Files selected for processing (3)
Server/src/core/config.pyServer/src/transport/legacy/unity_connection.pyServer/tests/integration/test_connection_deadline.py
There was a problem hiding this comment.
🧹 Nitpick comments (1)
Server/src/transport/legacy/unity_connection.py (1)
392-402: 🩺 Stability & Availability | 🔵 Trivial | 🏗️ Heavy liftThread the shared deadline through
_read_exact
receive_full_response()can make many blockingrecv()calls under one socket timeout. A slow peer that keeps dribbling bytes can keep the command alive pastcommand_total_timeout, so the “hard ceiling” only applies to the first read. Pass the deadline into_read_exactand cap/check before eachrecv().🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@Server/src/transport/legacy/unity_connection.py` around lines 392 - 402, The shared command deadline is only being applied before the first socket read in receive_full_response() / _read_exact, so a slow peer can still extend past the total timeout. Thread the deadline into _read_exact and, using the existing helpers in UnityConnection, cap the recv timeout and re-check the deadline before each recv() loop iteration so every blocking read respects the same hard ceiling.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@Server/src/transport/legacy/unity_connection.py`:
- Around line 392-402: The shared command deadline is only being applied before
the first socket read in receive_full_response() / _read_exact, so a slow peer
can still extend past the total timeout. Thread the deadline into _read_exact
and, using the existing helpers in UnityConnection, cap the recv timeout and
re-check the deadline before each recv() loop iteration so every blocking read
respects the same hard ceiling.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b3d9831b-306d-47d5-a671-bbadd92e4d85
📒 Files selected for processing (2)
Server/src/transport/legacy/unity_connection.pyServer/tests/integration/test_connection_deadline.py
🚧 Files skipped from review as they are similar to previous changes (1)
- Server/tests/integration/test_connection_deadline.py
@coderabbitai true in principle, but it can't really happen here. The peer is the Unity bridge on 127.0.0.1 and it sends each response in a single sendall, so over loopback _read_exact reads it in a couple of passes and never gets close to the timeout. The case that actually matters (socket wedged, no bytes at all) is already capped on the first recv, and a slow-but-alive Unity is covered by the heartbeat window. I'll leave _read_exact as-is for now, happy to revisit if this ever runs over a real network. |
|
Tip For best results, initiate chat on the files or code changes.
(ᵕ—ᴗ—) ✏️ Learnings added
|
We've been chasing intermittent infinite "Reloading Domain" hangs (sometimes script recompile, sometimes and play-mode entry). Dumps of the not-responding Our Thus the editor-side bridge's pending socket receive isn't cancelled/closed before the domain unloads. The Python-side recovery can help recover from / work around this in the meantime, but it looks like the cause goes deeper into editor code: on |
|
Tip For best results, initiate chat on the files or code changes.
The I’ve noted this for future review context around the Unity bridge/reload path. ✏️ Learnings added
|
Description
Every script recompile makes Unity do a domain reload, which tears down and rebuilds the bridge. The Unity process itself keeps running, so the OS never resets the old loopback connection, and the Python client is left holding a socket that still looks connected but is actually dead. The old code kept reusing that socket: each recv waited the full connection_timeout, and those waits piled up across retries with no overall limit, so a tool call could hang for several minutes. I ran into this with refresh_unity, run_tests, and get_test_job stuck for 10+ minutes at a time.
The fix does two things. When the client sees that Unity is reloading, it drops the stale socket so the next command reconnects to the fresh bridge instead of writing into the dead one. And it puts a single time limit on the whole command, so retries can't stack up forever.
Type of Change
Changes Made
Compatibility / Package Source
#beta,#main, tag, branch, orfile:): Python server run from local Server/ source (uvx --from ); the Unity C# package was unchanged (beta).Packages/packages-lock.json(if using a Git package URL): N/A -- this PR changes only the Python server; no Unity C# package change.Testing/Screenshots/Recordings
cd Server && uv run pytest tests/ -v)Documentation Updates
tools/UPDATE_DOCS_PROMPT.md(recommended)Related Issues
Relates to #891 (MCP gets stuck after a Unity reload until manually nudged) and #657 (more deterministic / bounded reload-wait in unity_connection.py). The same reload/test-boundary connection-drop pattern also affects the HTTP transport (e.g. #1207, #1164) via a different code path; this PR addresses the stdio client only.
Additional Notes
Summary by CodeRabbit
Summary by CodeRabbit
New Features
command_total_timeoutconfiguration to cap how long commands can run across retries.Bug Fixes
Tests