Skip to content

Docker sandbox hangs on session.write()/apply_manifest() over a TLS DOCKER_HOST (DinD, remote daemon) #3718

Description

@imran31415

Summary

agents.sandbox file materialization deadlocks whenever the Docker daemon is reached over TLS — e.g. a Docker-in-Docker sidecar or a remote DOCKER_HOST=tcp://…:2376 with DOCKER_TLS_VERIFY=1. session.write() (and therefore apply_manifest() during workspace setup) hangs forever and never returns.

Root cause

DockerSandboxSession._stream_into_exec (src/agents/sandbox/sandboxes/docker.py) writes the payload into a docker exec running tar -x / cat reading from stdin, then signals end-of-input with:

try:
    if hasattr(raw_sock, "shutdown"):
        raw_sock.shutdown(socket.SHUT_WR)
    ...
except Exception:
    pass

Over a TLS transport, a half-close on the raw socket does not deliver a clean stdin-EOF to the container (there is no TLS close_notify, and the attempt is silently swallowed by the except Exception: pass). The in-container tar -x / cat therefore blocks forever waiting for input that never terminates, the exec never exits, the daemon never closes the hijacked stream, and the client's drain loop (while raw_sock.recv(...)) blocks indefinitely.

This is not hypothetical — it reproduces reliably against:

  • a Docker-in-Docker sidecar exposing only TLS on :2376 (common in CI / Kubernetes dev environments), and
  • any remote DOCKER_HOST reached over TLS.

Over a unix socket the half-close works, which is why local runs don't hit it.

Why not put_archive()

The obvious "use docker cp" fix is explicitly avoided in this file (see the comments in read()/write()): with volume-driver-backed mounts attached, daemon archive operations can re-run volume mount setup and some plugins reject the duplicate Mount call for the same container id. So the fix should keep the exec+stdin approach.

Proposed fix

Make the in-container reader terminate on a byte count instead of a stdin half-close: measure the payload length and pipe the real command through head -c <n>:

payload_length, stream = _measure_stream(stream)
framed_cmd = ["sh", "-c", 'n=$1; shift; head -c "$n" | "$@"', "sh", str(payload_length), *cmd]

head -c <n> stops after exactly <n> bytes and closes its stdout, so the downstream tar/cat gets EOF from the pipe regardless of whether the exec-stdin half-close is ever delivered. This works identically over unix sockets, TLS TCP, and DinD, and keeps the deliberate avoidance of put_archive().

Repro (minimal)

# DOCKER_HOST=tcp://<tls-daemon>:2376, DOCKER_TLS_VERIFY=1, DOCKER_CERT_PATH=...
session = await <bring up a DockerSandboxSession>
await session.write(Path("/workspace/x"), io.BytesIO(b"hello"))  # hangs forever

Environment

  • openai-agents (reproduced on main @ current, and on 0.14.6 as pinned by downstream strix-agent)
  • Docker daemon reached via TLS (DOCKER_HOST=tcp://…:2376, DOCKER_TLS_VERIFY=1)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions