Skip to content

test(e2e): verify opensandbox runtime with codex via CI sidecar#12

Open
zpzjzj wants to merge 4 commits into
mainfrom
dreamy-noether-ca1913
Open

test(e2e): verify opensandbox runtime with codex via CI sidecar#12
zpzjzj wants to merge 4 commits into
mainfrom
dreamy-noether-ca1913

Conversation

@zpzjzj
Copy link
Copy Markdown
Collaborator

@zpzjzj zpzjzj commented May 15, 2026

Summary

Adds executed CI coverage for the OpenSandbox runtime. Until now the
opensandbox path never ran in CI — the existing codex opensandbox test
always skipped for lack of an external sandbox service. This PR adds a CI
job that self-hosts the OpenSandbox server on the runner, so the path is
exercised on every PR.

The opensandbox job runs the codex agent: claude_code already has
real-model coverage in the none-runtime e2e job, whereas codex is otherwise
only exercised against a fake binary — so this is its sole real coverage and
adds one more agent overall.

Changes

  • .github/workflows/ci.yml: new e2e-opensandbox job. Installs
    opensandbox-server via uv, starts it as a background process (Docker
    runtime, host network, host docker.sock), health-checks it, then runs
    TestAgent_Codex_OpenSandboxRuntime against http://127.0.0.1:8080.
  • e2e/agent_test.go: new TestAgent_ClaudeCode_OpenSandboxRuntime
    (mirrors the codex test; runnable locally / in future CI) and an
    openSandboxE2EImage() helper that reads OPENSANDBOX_IMAGE.
  • Both opensandbox tests preserve the in-sandbox agent workspace as a CI
    artifact for post-mortem when execution fails inside the sandbox.

Test plan

  • make test / make verify pass
  • go test -tags e2e -run OpenSandbox ./e2e passes locally (skips
    cleanly without OPENSANDBOX_API_KEY)
  • New e2e-opensandbox CI job is green on this PR

Notes for reviewers

  • The sandbox image is node:22: skill-up bootstraps the agent CLI inside
    the sandbox (nvm/node/npm install), which needs curl/git/node.
    Bare ubuntu:latest lacks these. Override via OPENSANDBOX_IMAGE.
  • The OpenSandbox server is self-hosted, so no OPENSANDBOX_* secret is
    needed — but the agent still calls a real model, so the job gates on
    DASHSCOPE_API_KEY and skips on fork PRs (same pattern as the e2e job).
  • execd_image is pinned to opensandbox/execd:v1.0.16; bump if the
    server log reports an execd compatibility error.

🤖 Generated with Claude Code

zpzjzj and others added 4 commits May 15, 2026 16:46
Add an end-to-end test that exercises the claude_code engine against a real
OpenSandbox runtime, and a CI job that self-hosts the OpenSandbox server on
the runner so the opensandbox path is verified on every PR instead of always
skipping for lack of an external service.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
astral-sh/setup-uv has no floating v8 major tag, so @v8 fails to resolve.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The opensandbox runtime bootstraps the claude CLI inside the sandbox, so
skipIfClaudeUnavailable wrongly skipped the test when the runner lacked a
host claude binary. The codex opensandbox test already omits this check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
claude_code already has real-model coverage in the none-runtime e2e job,
whereas codex is otherwise only exercised against a fake binary. Running
codex here makes the opensandbox job the sole real codex coverage and
covers one more agent overall.

Also preserve the in-sandbox agent workspace as a CI artifact so failures
inside the sandbox (agent bootstrap, model calls) are debuggable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@zpzjzj zpzjzj changed the title test(e2e): verify opensandbox runtime with claude_code via CI sidecar test(e2e): verify opensandbox runtime with codex via CI sidecar May 15, 2026
@zpzjzj zpzjzj requested review from jwx0925, lbfsc and lijunfeng722 May 15, 2026 10:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant