Summary
agents.sandbox file materialization deadlocks whenever the Docker daemon is reached over TLS — e.g. a Docker-in-Docker sidecar or a remote DOCKER_HOST=tcp://…:2376 with DOCKER_TLS_VERIFY=1. session.write() (and therefore apply_manifest() during workspace setup) hangs forever and never returns.
Root cause
DockerSandboxSession._stream_into_exec (src/agents/sandbox/sandboxes/docker.py) writes the payload into a docker exec running tar -x / cat reading from stdin, then signals end-of-input with:
try:
if hasattr(raw_sock, "shutdown"):
raw_sock.shutdown(socket.SHUT_WR)
...
except Exception:
pass
Over a TLS transport, a half-close on the raw socket does not deliver a clean stdin-EOF to the container (there is no TLS close_notify, and the attempt is silently swallowed by the except Exception: pass). The in-container tar -x / cat therefore blocks forever waiting for input that never terminates, the exec never exits, the daemon never closes the hijacked stream, and the client's drain loop (while raw_sock.recv(...)) blocks indefinitely.
This is not hypothetical — it reproduces reliably against:
- a Docker-in-Docker sidecar exposing only TLS on
:2376 (common in CI / Kubernetes dev environments), and
- any remote
DOCKER_HOST reached over TLS.
Over a unix socket the half-close works, which is why local runs don't hit it.
Why not put_archive()
The obvious "use docker cp" fix is explicitly avoided in this file (see the comments in read()/write()): with volume-driver-backed mounts attached, daemon archive operations can re-run volume mount setup and some plugins reject the duplicate Mount call for the same container id. So the fix should keep the exec+stdin approach.
Proposed fix
Make the in-container reader terminate on a byte count instead of a stdin half-close: measure the payload length and pipe the real command through head -c <n>:
payload_length, stream = _measure_stream(stream)
framed_cmd = ["sh", "-c", 'n=$1; shift; head -c "$n" | "$@"', "sh", str(payload_length), *cmd]
head -c <n> stops after exactly <n> bytes and closes its stdout, so the downstream tar/cat gets EOF from the pipe regardless of whether the exec-stdin half-close is ever delivered. This works identically over unix sockets, TLS TCP, and DinD, and keeps the deliberate avoidance of put_archive().
Repro (minimal)
# DOCKER_HOST=tcp://<tls-daemon>:2376, DOCKER_TLS_VERIFY=1, DOCKER_CERT_PATH=...
session = await <bring up a DockerSandboxSession>
await session.write(Path("/workspace/x"), io.BytesIO(b"hello")) # hangs forever
Environment
openai-agents (reproduced on main @ current, and on 0.14.6 as pinned by downstream strix-agent)
- Docker daemon reached via TLS (
DOCKER_HOST=tcp://…:2376, DOCKER_TLS_VERIFY=1)
Summary
agents.sandboxfile materialization deadlocks whenever the Docker daemon is reached over TLS — e.g. a Docker-in-Docker sidecar or a remoteDOCKER_HOST=tcp://…:2376withDOCKER_TLS_VERIFY=1.session.write()(and thereforeapply_manifest()during workspace setup) hangs forever and never returns.Root cause
DockerSandboxSession._stream_into_exec(src/agents/sandbox/sandboxes/docker.py) writes the payload into adocker execrunningtar -x/catreading from stdin, then signals end-of-input with:Over a TLS transport, a half-close on the raw socket does not deliver a clean stdin-EOF to the container (there is no TLS
close_notify, and the attempt is silently swallowed by theexcept Exception: pass). The in-containertar -x/cattherefore blocks forever waiting for input that never terminates, the exec never exits, the daemon never closes the hijacked stream, and the client's drain loop (while raw_sock.recv(...)) blocks indefinitely.This is not hypothetical — it reproduces reliably against:
:2376(common in CI / Kubernetes dev environments), andDOCKER_HOSTreached over TLS.Over a unix socket the half-close works, which is why local runs don't hit it.
Why not
put_archive()The obvious "use
docker cp" fix is explicitly avoided in this file (see the comments inread()/write()): with volume-driver-backed mounts attached, daemon archive operations can re-run volume mount setup and some plugins reject the duplicateMountcall for the same container id. So the fix should keep the exec+stdin approach.Proposed fix
Make the in-container reader terminate on a byte count instead of a stdin half-close: measure the payload length and pipe the real command through
head -c <n>:head -c <n>stops after exactly<n>bytes and closes its stdout, so the downstreamtar/catgets EOF from the pipe regardless of whether the exec-stdin half-close is ever delivered. This works identically over unix sockets, TLS TCP, and DinD, and keeps the deliberate avoidance ofput_archive().Repro (minimal)
Environment
openai-agents(reproduced onmain@ current, and on0.14.6as pinned by downstreamstrix-agent)DOCKER_HOST=tcp://…:2376,DOCKER_TLS_VERIFY=1)