Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
157 commits
Select commit Hold shift + click to select a range
084c93b
fix(installer): repair dev install package and service setup (#1252)
drew May 7, 2026
62619ee
fix(docker): use supervisor image entrypoint path (#1259)
drew May 8, 2026
8ab5ee8
fix(vm): harden compute driver socket (#1248)
drew May 8, 2026
52097f2
ci(release): run package release canaries (#1256)
drew May 8, 2026
645b880
feat(install): add rpm dev installer support (#1262)
drew May 8, 2026
1f35abb
feat(sandbox): add Kubernetes user namespace isolation (hostUsers: fa…
mrunalp May 8, 2026
a4efc0b
feat(server): add generate-certs subcommand; replace alpine PKI hook …
TaylorMutch May 8, 2026
b74d24b
fix(docs): constrain landing terminal height (#1269)
drew May 8, 2026
3cfc915
ci(os-132): remove stale remote buildx mode (#1267)
jtoelke2 May 8, 2026
1d3b741
feat(providers): support sandbox provider attach lifecycle (#1242)
johntmyers May 8, 2026
31f0345
ci(os-132): remove obsolete shadow workflows (#1273)
jtoelke2 May 8, 2026
daa2a36
fix(packaging): enable mTLS for local packages (#1271)
drew May 8, 2026
eec949d
fix(installer): stop forcing Homebrew VM driver (#1277)
drew May 8, 2026
316c788
fix(helm): derive grpcEndpoint from chart context (#1241)
TaylorMutch May 8, 2026
b8e8743
fix(e2e): isolate kubernetes user namespace test (#1276)
drew May 8, 2026
7ad823e
fix(install): register local gateway before probing listener (#1280)
drew May 8, 2026
4041798
fix(helm): derive sandboxNamespace from Release.Namespace instead of …
sauagarwa May 8, 2026
529be37
chore(installer): promote package install script (#1261)
drew May 9, 2026
1c79b21
feat: agent-driven policy management MVP (#1151)
zredlined May 9, 2026
af60d4e
docs: document OPENSHELL_SSH_HANDSHAKE_SECRET in Getting Started (#1287)
russellb May 9, 2026
57a80ed
fix(gateway): update Podman supervisor build task name (#1288)
russellb May 9, 2026
072f227
fix(installer): guard incompatible v0.0.37 upgrades (#1294)
drew May 9, 2026
8d83776
fix(docker): add SELinux labeling to bind mounts (#1291)
derekwaynecarr May 9, 2026
4350482
docs(readme): add roadmap and RFC issue guidance (#1284)
drew May 9, 2026
ca63841
docs(rfc): move policy management RFC to 0002 (#1283)
drew May 9, 2026
24cbaa1
feat(driver-kubernetes): disable service account token auto-mounting …
derekwaynecarr May 10, 2026
977be31
fix(docker): route VM-Docker runtimes through host-gateway (#1301)
laitingsheng May 11, 2026
dfd4768
(feat) early snap support (#1238)
zyga May 11, 2026
5c98604
feat(gpu): honor device IDs in Docker and Podman (#1253)
elezar May 11, 2026
6184d24
feat(k8s): support ImageVolumeSource for supervisor sideload (#1300)
mrunalp May 11, 2026
59475aa
ci(kubernetes): add kube gateway e2e tests and gated CI workflow (#1251)
TaylorMutch May 11, 2026
b9b8bc3
fix(driver-kubernetes): propagate log_level as OPENSHELL_LOG_LEVEL en…
mesutoezdil May 11, 2026
957daa0
docs(helm): document supervisor.sideloadMethod and sandboxNamespace d…
mesutoezdil May 11, 2026
764d930
fix(vm): use bash 3.2-safe empty array expansion in supervisor build …
benoitf May 11, 2026
b33bbd2
fix(vm): correct /sandbox ownership when rootfs is built by non-root …
benoitf May 11, 2026
3f0a058
docs(rfc): add gateway configuration file RFC (#951)
TaylorMutch May 11, 2026
5abc36c
feat(relay): route forwarding through ForwardTcp (#1029)
pimlock May 12, 2026
9ea94b6
fix(sandbox): rewrite messaging credential placeholders (#1286)
ericksoa May 12, 2026
df5a8b9
fix(providers): read opencode config file during credential discovery…
ericcurtin May 12, 2026
3b61c9c
feat(k8s): support nodeSelector and tolerations from platform_config …
Arnonrgo May 12, 2026
ba77967
refactor(docker): split gateway/supervisor Dockerfiles and use native…
TaylorMutch May 12, 2026
8322e4f
docs: style fixes (#1341)
miyoungc May 12, 2026
afcd3a9
fix(cli): use OS trust store for reqwest TLS verification (#1342)
sjenning May 12, 2026
2532687
fix(secret): Add custom derive Debug for SecretResolver to prevent se…
alangou May 13, 2026
0797fef
feat(gateway): add local-domain service routing (#1101)
pimlock May 13, 2026
5159ebc
fix(server): restrict SQLite database file permissions to 0o600 (#1359)
alangou May 13, 2026
96d909d
feat(ci): add helm-unittest mise task and CI step (#1367)
mesutoezdil May 13, 2026
ea2fddb
feat(policy): agent-driven policy management — the agent half (#1323)
zredlined May 13, 2026
0c8c723
fix(images): remove image-specific owner and mode set for gateway bin…
sjenning May 13, 2026
c99849b
fix(cli): cp-style sandbox download and workspace-boundary check (#1353)
laitingsheng May 14, 2026
bbfcac8
test(e2e): add bypass detection test for sandbox REJECT rules (#1368)
russellb May 14, 2026
52c7757
feat(helm): support custom CA for OIDC issuer TLS verification (#1373)
sjenning May 14, 2026
0471c6d
fix(gateway): keep vm driver opt-in (#1375)
drew May 14, 2026
f855c3d
feat(cli): add sandbox resource flags (#1376)
drew May 14, 2026
1c31764
docs: replace --sync with --upload . in sync-files example (#1366)
mesutoezdil May 14, 2026
6deb1f0
refactor(core): eliminate duplicate utilities across crates (#1381)
ericcurtin May 14, 2026
668c712
perf(build): speed up local CLI rebuilds (#1387)
johntmyers May 14, 2026
94025d8
fix(server): downgrade expected connection teardown errors to debug (…
russellb May 14, 2026
7a0c444
refactor!(auth): drop SSH handshake secret (#1274)
TaylorMutch May 14, 2026
0dee90a
fix(vm): make /sandbox chown non-fatal for virtiofs rootless hosts (#…
russellb May 14, 2026
f5b546e
Revert "perf(build): speed up local CLI rebuilds (#1387)" (#1395)
drew May 15, 2026
f58a434
fix(installer): dump gateway logs on startup timeout (#1396)
drew May 15, 2026
c27dd88
fix(vm): enable NFT_LOG kernel module for nftables bypass detection (…
russellb May 15, 2026
44e843e
feat(vm): fall back to Podman socket when Docker is unavailable (#1370)
russellb May 15, 2026
9a7c0df
fix(sandbox): remove DNS resolution from mechanistic mapper to preven…
russellb May 15, 2026
9f8edb5
docs(installation): add container gateway page with docker run and co…
ericcurtin May 15, 2026
c94cddb
feat(server): separate HTTPS from mTLS authentication (#1351)
sjenning May 15, 2026
590acde
fix(vm): collapse nested if blocks in container engine connect (#1406)
TaylorMutch May 15, 2026
c8bf387
feat(tui): add OIDC authentication support (#1405)
sjenning May 15, 2026
f672f75
chore: remove SSH handshake secret residuals and fix agent memory (#1…
maxamillion May 15, 2026
a1fb9bd
fix(scripts): replace mapfile with bash 3.2-compatible read loop (#1334)
benoitf May 15, 2026
63bdcd1
fix(sandbox): exempt host gateway from SSRF block for rootless Podman…
maxamillion May 15, 2026
442b0b6
feat(exec): add bidirectional streaming for interactive TTY sessions …
benoitf May 15, 2026
283defd
fix(sandbox): allow HEAD where GET is permitted in L7 policy (#1382)
mesutoezdil May 15, 2026
b61a98d
feat(gateway): add TOML configuration file (RFC 0003) (#1317)
TaylorMutch May 15, 2026
910d3f0
feat(vm): boot sandboxes from ext4 root disks (#1263)
drew May 15, 2026
403c754
fix(ci): skip helm plugin verification in CI image (#1411)
drew May 15, 2026
f819f7d
fix(vm): restore sandboxes after gateway restart (#1407)
drew May 15, 2026
09bd8a9
fix(vm): preserve guest TLS hostname (#1416)
drew May 17, 2026
0cbd2d6
feat(cli): add -o json/yaml output format to sandbox list (#1422)
benoitf May 18, 2026
71209e6
feat(rpm): replace init-pki.sh with openshell-gateway generate-certs …
maxamillion May 18, 2026
b4c7bc4
fix(sandbox): stabilize forked socket owner test (#1417)
derekwaynecarr May 18, 2026
555680c
fix(docker): fall back to host arch for local builds (#1420)
elezar May 18, 2026
7f16d60
feat(persistence): implement optimistic concurrency control with CAS …
derekwaynecarr May 18, 2026
f257ed0
refactor(packaging): rely on gateway runtime defaults (#1415)
drew May 18, 2026
dbba580
fix(security): refresh CI and gateway image dependencies (#1432)
johntmyers May 18, 2026
a54758c
test(sandbox): cover inference stream truncation errors (#1418)
mjamiv May 18, 2026
3cd238a
feat(e2e): enable mTLS for Podman compute driver (#1430)
russellb May 18, 2026
a7cd160
docs(helm): add chart readme generation (#1437)
TaylorMutch May 18, 2026
702bb56
ci: extend artifact attestations to all release binaries (#1398)
mesutoezdil May 18, 2026
c5d1d76
refactor(sandbox): replace iptables with nftables for network policy …
russellb May 18, 2026
436c59a
fix(rpm): restore 0.0.0.0 bind address for Podman via default gateway…
maxamillion May 18, 2026
65a3a7c
test(e2e): close Podman driver test coverage gaps (#1439)
russellb May 19, 2026
d620d65
feat(sandbox): inject DENO_CERT into sandbox child environment (#1441)
theFong May 19, 2026
04a39ca
fix(build): add z3 include path for RHEL/Fedora bindgen compatibility…
russellb May 19, 2026
f9435b4
chore(ci): pin all GitHub Actions to SHA digests (#1233)
fcanogab May 19, 2026
d255cdd
feat(providers): add credential refresh foundation (#1349)
johntmyers May 19, 2026
c527341
feat(k8s): make default workspace PVC storage size configurable (#1436)
sjenning May 19, 2026
10af3e6
refactor: deduplicate shared test helpers (#1399)
ericcurtin May 19, 2026
2a5a449
fix(ci): require PR checks to pass (#1461)
pimlock May 19, 2026
2cef120
chore(deps): bump actions/download-artifact from 4.3.0 to 8.0.1 (#1459)
dependabot[bot] May 19, 2026
37ca269
chore(deps): bump softprops/action-gh-release from 2.6.2 to 3.0.0 (#1…
dependabot[bot] May 19, 2026
0a8b35c
fix(build): install binaries built in part build tree (#1462)
zyga May 19, 2026
3b53184
test(persistence): make CAS conflict test deterministic (#1464)
pimlock May 19, 2026
3c87393
feat(agents): add LSM compatibility checks to review and spike skills…
derekwaynecarr May 20, 2026
2a065a5
ci(canary): add kind-based helm chart smoke test (#1336)
TaylorMutch May 20, 2026
cade0bb
test(e2e): default GPU probe image (#1450)
elezar May 20, 2026
be6ac9e
docs(agents): add Docker GPU CDI debug hints (#1448)
elezar May 20, 2026
14c5329
docs(providers): add Providers v2 guide (#1442)
johntmyers May 20, 2026
b332ffd
refactor: deduplicate shared driver and provider constants (#1474)
ericcurtin May 20, 2026
c600b11
docs(agents): add release canary testing skill (#1440)
TaylorMutch May 20, 2026
3cde651
chore(deps): bump azure/setup-helm from 4 to 5 (#1468)
dependabot[bot] May 20, 2026
bdaa08f
fix(server): add ConnectSupervisor and RelayStream to SANDBOX_METHODS…
zanetworker May 20, 2026
2b13bfa
fix(ci): eliminate image-tag race between concurrent workflows (#1413)
mesutoezdil May 20, 2026
77e6c7a
test(server): cover service endpoint plaintext security (#1352)
drew May 20, 2026
c143c81
fix(cli): add auth and TLS support to completion client (#1489)
sjenning May 21, 2026
b93a3d8
fix(scripts): use portable lowercase in normalize_bool for Bash 3.2 (…
benoitf May 21, 2026
e3f009f
refactor(server): extract shared relay-await and sandbox-scan helpers…
ericcurtin May 21, 2026
2d9e532
fix(sandbox): skip fork-exec socket ambiguity test on SELinux-enforci…
derekwaynecarr May 21, 2026
528fb29
fix(sandbox): allow first-label L7 host wildcards (#1304)
mjamiv May 21, 2026
9e8610f
feat(cli): add JSON/YAML output format to gateway list (#1500)
benoitf May 21, 2026
5620c8b
refactor: deduplicate repeated patterns across crates (#1499)
ericcurtin May 21, 2026
f8e3f9b
fix(ci): resolve mirror gate statuses for fork PRs (#1504)
pimlock May 21, 2026
af75374
fix(server): respect OPENSHELL_PODMAN_SOCKET env var in embedded driv…
russellb May 21, 2026
e7f965a
refactor(sandbox,driver-vm): Start moving to rustix (esp over libc un…
cgwalters May 21, 2026
f5b0ad7
fix(packaging): add upgrade migration docs and podman socket retry (#…
maxamillion May 21, 2026
5238937
ci: deduplicate e2e workflows (#1512)
TaylorMutch May 22, 2026
a3b16c1
feat(auth): per-sandbox authentication to gateway (#1404)
TaylorMutch May 22, 2026
c5c3f03
docs(sandboxes): add policy advisor guide (#1480)
johntmyers May 22, 2026
68d4280
fix(docker): use host-gateway callbacks on macOS (#1516)
TaylorMutch May 22, 2026
57b71c6
ci(e2e): load single-arch images into kind (#1518)
TaylorMutch May 22, 2026
18988bd
docs(rfc): add sandbox resource requirements proposal (#1360)
elezar May 22, 2026
48333e5
ci(canary): keep helm jwt secret generation enabled (#1521)
TaylorMutch May 22, 2026
686b24d
fix(cli): add json output for policy get (#1410)
mjamiv May 22, 2026
0cef265
feat(providers): derive discovery from profiles (#1503)
johntmyers May 22, 2026
603b3e2
docs: update NemoClaw/OpenClaw references (#1529)
drew May 22, 2026
521eccd
ci: seed shared Rust caches from main (#1530)
pimlock May 22, 2026
0dc08a1
fix(release): build host Linux binaries with glibc floor (#1490)
pimlock May 22, 2026
7d38aa8
fix(homebrew): repair local driver bootstrap state (#1527)
TaylorMutch May 22, 2026
fbd580b
ci: install cargo-zigbuild from release binaries (#1533)
pimlock May 22, 2026
f0f17bf
fix(cli): propagate --gateway-insecure to OIDC auth flows (#1535)
zanetworker May 22, 2026
c8d405c
ci(release): smoke test rpm artifacts on fedora (#1558)
pimlock May 25, 2026
5c3a1f7
chore(deps): bump docker/login-action from 4.1.0 to 4.2.0 (#1554)
dependabot[bot] May 25, 2026
863d2a2
chore(helm): add missing SPDX header to gateway-config template (#1545)
mesutoezdil May 25, 2026
286ce7c
ci(release): skip python rpm in gateway smoke test (#1559)
pimlock May 25, 2026
cd70249
ci: pin azure/setup-helm and helm/kind-action to commit SHAs (#1544)
mesutoezdil May 25, 2026
9857fa1
refactor: deduplicate shared code across ocsf builders and driver cra…
ericcurtin May 26, 2026
4848c40
fix(python): raise SandboxError instead of FileNotFoundError or KeyEr…
mesutoezdil May 26, 2026
88508a0
fix(scripts): replace mapfile with bash 3.2-compatible read loop in h…
mesutoezdil May 26, 2026
3460e5f
docs: add macOS compiler troubleshooting (#1569)
amfred May 26, 2026
fa84e43
fix(gateway): configure local dev auth (#1575)
krishicks May 26, 2026
9e5aee4
docs: add Pi as supported sandbox (#1572)
vegarsti May 26, 2026
7174983
fix(sandbox): add mechanistic smoke test for L4 deny and document the…
mesutoezdil May 26, 2026
47d208c
docs(readme): whitespace (#1578)
krishicks May 26, 2026
2e03faf
fix(cli): replace outdated name reference (#1582)
krishicks May 26, 2026
a3ed421
fix(sandbox): probe Landlock before build, skip on unsupported kernel…
dims May 27, 2026
c9056bb
fix(sandbox): decouple GPU baseline from network policy (#1524)
elezar May 27, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
1 change: 1 addition & 0 deletions .agents/skills/build-from-issue/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -148,6 +148,7 @@ In the prompt, instruct the reviewer to:
- **Medium**: Multiple files/components, some design decisions, but well-scoped
- **High**: Cross-cutting changes, architectural decisions needed, significant unknowns
8. Call out risks, unknowns, and decisions that need stakeholder input.
9. Assess **LSM compatibility** — if the change touches process identity, `/proc` filesystem access, binary execution, or inter-process visibility, flag whether it will behave differently on hosts running SELinux (enforcing) or AppArmor. In particular, tests that fork+exec into system binaries will fail on SELinux-enforcing hosts due to cross-label `/proc/<pid>/exe` access restrictions.

### A2: Post the Plan Comment

Expand Down
4 changes: 3 additions & 1 deletion .agents/skills/create-spike/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,9 @@ The prompt to the reviewer **must** instruct it to:

9. **Check architecture docs** in the `architecture/` directory for relevant documentation about the affected subsystems.

10. **Determine the issue type:** `feat`, `fix`, `refactor`, `chore`, `perf`, or `docs`.
10. **Assess Linux Security Module (LSM) impact.** If the change involves process identity, `/proc` filesystem access, file labeling, binary execution, or inter-process visibility, call out whether it will behave differently on hosts running SELinux (enforcing) or AppArmor. For example: reading `/proc/<pid>/exe` across an SELinux domain boundary returns ENOENT, not EACCES. Tests that fork+exec into system binaries (different SELinux label) will fail on enforcing hosts. Flag any LSM-sensitive code paths and recommend mitigations.

11. **Determine the issue type:** `feat`, `fix`, `refactor`, `chore`, `perf`, or `docs`.

### What makes a good investigation prompt

Expand Down
60 changes: 57 additions & 3 deletions .agents/skills/debug-openshell-cluster/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,15 +63,46 @@ Use gateway metadata, deployment values, or the user's setup notes to identify t
docker info
docker ps --filter name=openshell
docker logs <container> --tail=200
docker run --rm --entrypoint /openshell-sandbox "${OPENSHELL_DOCKER_SUPERVISOR_IMAGE:-ghcr.io/nvidia/openshell/supervisor:latest}" --version
openshell status
```

For Docker GPU failures, check CDI support and NVIDIA CDI discovery separately:

```bash
docker info --format '{{json .CDISpecDirs}}'
docker info --format '{{json .DiscoveredDevices}}'
for dir in /etc/cdi /var/run/cdi; do
if [ -d "$dir" ]; then
find "$dir" -maxdepth 1 -type f \( -name '*.yaml' -o -name '*.json' \) -print
else
echo "$dir missing"
fi
done
systemctl is-enabled nvidia-cdi-refresh.service nvidia-cdi-refresh.path || true
systemctl is-active nvidia-cdi-refresh.service nvidia-cdi-refresh.path || true
systemctl status nvidia-cdi-refresh.service nvidia-cdi-refresh.path --no-pager --lines=50
journalctl -u nvidia-cdi-refresh.service --no-pager --lines=100
```

When the NVIDIA Container Toolkit CDI refresh units are not enabled or no NVIDIA CDI spec has been generated, enable them and trigger a refresh:

```bash
sudo systemctl enable --now nvidia-cdi-refresh.path
sudo systemctl enable --now nvidia-cdi-refresh.service
sudo systemctl restart nvidia-cdi-refresh.service
docker info --format '{{json .DiscoveredDevices}}'
```

Common findings:

- Docker daemon unavailable: start Docker Desktop or Docker Engine.
- Gateway process stopped: inspect exit status and logs.
- Sandbox image missing or pull denied: verify image reference and registry credentials.
- Docker driver cannot initialize because it cannot find `openshell-sandbox`: verify `OPENSHELL_DOCKER_SUPERVISOR_BIN`, the sibling binary next to `openshell-gateway`, or the configured supervisor image contains `/openshell-sandbox`.
- Sandbox never registers: check gateway logs and supervisor callback endpoint.
- Supervisor image exits before printing `openshell-sandbox --version`: the image should be the scratch supervisor image from `deploy/docker/Dockerfile.supervisor` and must contain a static executable at `/openshell-sandbox`.
- `mise run e2e:docker:gpu` fails with `docker info --format json did not report any discovered NVIDIA CDI GPU devices`: Docker may report `CDISpecDirs` while still having no generated NVIDIA CDI specs. Verify `.DiscoveredDevices` contains entries such as `nvidia.com/gpu=all`, verify `/etc/cdi` or `/var/run/cdi` contains a generated NVIDIA spec, and check that `nvidia-cdi-refresh.service` and `nvidia-cdi-refresh.path` from NVIDIA Container Toolkit are enabled and healthy. The service is a one-shot unit, so `inactive (dead)` can be normal after a successful run; use `systemctl status` and `journalctl` to distinguish success from a skipped or failed refresh. NVIDIA recommends enabling the path and service units, and restarting `nvidia-cdi-refresh.service` to regenerate missing or stale CDI specs. If specs are generated but Docker still reports no discovered devices, restart Docker or reload the daemon and re-check `docker info`.

For source checkout development, restart the local gateway with:

Expand Down Expand Up @@ -111,20 +142,30 @@ Check required Helm deployment secrets:

```bash
kubectl -n openshell get secret \
openshell-ssh-handshake \
openshell-server-tls \
openshell-server-client-ca \
openshell-client-tls
openshell-client-tls \
openshell-jwt-keys
```

If the gateway exits with `failed to read sandbox JWT signing key from
/etc/openshell-jwt/signing.pem`, verify that `openshell-jwt-keys` contains
`signing.pem`, `public.pem`, and `kid`, and that the StatefulSet mounts the
`sandbox-jwt` secret at `/etc/openshell-jwt`. The sandbox JWT mount is required
even when local Helm values disable TLS.

Check the image references currently used by the gateway deployment:

```bash
kubectl -n openshell get statefulset openshell -o jsonpath="{.spec.template.spec.containers[*].image}{\"\n\"}{.spec.template.spec.containers[*].env[?(@.name==\"OPENSHELL_SUPERVISOR_IMAGE\")].value}{\"\n\"}"
helm -n openshell get values openshell | grep -E 'repository|tag|supervisorImage'
```

The gateway image and `server.supervisorImage` should use the same build tag in branch and E2E deploys. A stale supervisor image can make sandbox behavior lag behind gateway policy or proto changes.
The gateway image built from `deploy/docker/Dockerfile.gateway` and the scratch supervisor image built from `deploy/docker/Dockerfile.supervisor` should use the same build tag in branch and E2E deploys. A stale supervisor image can make sandbox behavior lag behind gateway policy or proto changes.

For local/external pull mode (the default local path via `mise run cluster`), local images are tagged to the configured local registry base, pushed to that registry, and pulled by k3s via the `registries.yaml` mirror endpoint. The `cluster` task pushes prebuilt local tags (`openshell/*:dev`, falling back to `localhost:5000/openshell/*:dev` or `127.0.0.1:5000/openshell/*:dev`).

Gateway image builds stage a partial Rust workspace from `deploy/docker/Dockerfile.images`. If cargo fails with a missing manifest under `/build/crates/...`, or an imported symbol exists locally but is missing in the image build, verify that every current gateway dependency crate, including `openshell-driver-docker`, `openshell-driver-kubernetes`, and `openshell-ocsf`, is copied into the staged workspace there.

For plaintext local evaluation, confirm the chart has:

Expand Down Expand Up @@ -171,6 +212,18 @@ helm -n openshell get values openshell | grep sandboxNamespace

Then inspect sandbox resources in that namespace.

Check the configured sandbox service account when TokenReview bootstrap or
sandbox registration fails. Helm creates a dedicated sandbox service account by
default and writes it to `[openshell.drivers.kubernetes].service_account_name`;
the gateway rejects projected tokens from other service accounts.

```bash
helm -n openshell get values openshell | grep -A3 sandboxServiceAccount
kubectl -n <sandbox-namespace> get serviceaccount openshell-sandbox
kubectl -n openshell get configmap openshell-config -o jsonpath='{.data.gateway\.toml}'
kubectl -n <sandbox-namespace> get sandbox <sandbox-name> -o jsonpath='{.spec.template.spec.serviceAccountName}{"\n"}'
```

### Step 6: Check VM-Backed Gateways

Use the VM driver logs and host diagnostics available in the user's environment. Verify:
Expand All @@ -194,6 +247,7 @@ openshell logs <sandbox-name>
| `openshell status` fails | Gateway endpoint unreachable or auth mismatch | `openshell gateway info`, gateway logs |
| Gateway starts but sandbox create fails | Compute driver cannot reach runtime | Docker/Podman/Kubernetes/VM driver logs |
| Docker or Podman sandbox never registers | Wrong callback endpoint or supervisor startup failure | Gateway logs and sandbox container logs |
| Docker GPU e2e fails before GPU sandbox comparison | NVIDIA CDI specs are missing or Docker has not discovered them | `docker info --format '{{json .DiscoveredDevices}}'`, `/etc/cdi`, `/var/run/cdi`, `nvidia-cdi-refresh.service` |
| Kubernetes gateway pod pending | PVC unbound, taint, selector, or insufficient resources | `kubectl -n openshell describe pod <pod>` |
| Kubernetes gateway pod crash loops | Missing secret, bad DB URL, bad TLS config | `kubectl -n openshell logs statefulset/openshell` |
| CLI TLS error | Local mTLS bundle does not match server cert/CA | Check `~/.config/openshell/gateways/<name>/mtls/` |
Expand Down
10 changes: 7 additions & 3 deletions .agents/skills/helm-dev-environment/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,9 @@ mise run helm:k3s:create
```

Creates a k3d cluster and merges its kubeconfig into the worktree-local `kubeconfig` file.
Also applies base manifests (`deploy/kube/manifests/agent-sandbox.yaml`). Traefik is
disabled at cluster creation time.
Also applies base manifests (`deploy/kube/manifests/agent-sandbox.yaml`) and preloads the
default community sandbox image into k3d so the first sandbox create does not wait on a
large registry pull. Traefik is disabled at cluster creation time.

**Multi-worktree support:** the cluster name is derived from the last component of the
current git branch (e.g. branch `kube-support/local-dev/tmutch` → cluster
Expand All @@ -43,6 +44,8 @@ Port mappings created at cluster time (cannot be changed without recreating):

Override with env vars before running `helm:k3s:create`:
- `HELM_K3S_LB_HOST_PORT` (default: `8080`)
- `HELM_K3S_PRELOAD_SANDBOX_IMAGE` (default:
`ghcr.io/nvidia/openshell-community/sandboxes/base:latest`; set to an empty value to skip)

### 2. Deploy OpenShell

Expand All @@ -57,7 +60,8 @@ mise run helm:skaffold:run
```

Both commands build the `gateway` and `supervisor` images and deploy the OpenShell Helm
chart. The `pkiInitJob` hook runs on first install to generate mTLS secrets. Envoy Gateway opt-in; see the Optional Add-ons section below.
chart. The `pkiInitJob` hook (a pre-install Job that runs `openshell-gateway generate-certs`)
generates mTLS secrets on first install. Envoy Gateway opt-in; see the Optional Add-ons section below.

The gateway Service uses ClusterIP. Access is via Envoy Gateway (port `8080`) or `kubectl port-forward`.

Expand Down
1 change: 1 addition & 0 deletions .agents/skills/openshell-cli/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -141,6 +141,7 @@ openshell sandbox create \
Key flags:
- `--provider`: Attach one or more providers (repeatable)
- `--policy`: Custom policy YAML (otherwise uses built-in default or `OPENSHELL_SANDBOX_POLICY` env var)
- `--cpu`, `--memory`: Set per-sandbox compute sizing. Docker/Podman apply limits; Kubernetes applies matching requests and limits.
- `--upload <PATH>[:<DEST>]`: Upload local files into the sandbox (default dest: `/sandbox`)
- `--no-keep`: Delete the sandbox after the initial command or shell exits
- `--forward <PORT>`: Forward a local port and keep the sandbox alive
Expand Down
2 changes: 2 additions & 0 deletions .agents/skills/openshell-cli/cli-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,8 @@ Create a sandbox through the active gateway, wait for readiness, then connect or
| `--no-keep` | Delete sandbox after the initial command or shell exits |
| `--provider <NAME>` | Provider to attach (repeatable) |
| `--policy <PATH>` | Path to custom policy YAML |
| `--cpu <QUANTITY>` | CPU amount for the sandbox (for example: `500m`, `1`, `2.5`) |
| `--memory <QUANTITY>` | Memory amount for the sandbox (for example: `512Mi`, `4Gi`, `8G`) |
| `--forward <PORT>` | Forward local port to sandbox (keeps the sandbox alive) |
| `--tty` | Force pseudo-terminal allocation |
| `--no-tty` | Disable pseudo-terminal allocation |
Expand Down
122 changes: 122 additions & 0 deletions .agents/skills/test-release-canary/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
---
name: test-release-canary
description: Manually dispatch and iterate on the Release Canary workflow that smoke-tests published OpenShell artifacts (install.sh on macOS/Ubuntu/Fedora, Helm chart on kind) after each Release Dev publish. Use when changing `.github/workflows/release-canary.yml`, validating a release before tagging, debugging a canary failure, or reproducing a canary job locally. Trigger keywords - release canary, release-canary, canary failed, canary dispatch, test release canary, post-release smoke, install.sh canary, helm chart canary, kind canary, dispatch canary.
---

# Test Release Canary

The Release Canary (`.github/workflows/release-canary.yml`) smoke-tests the artifacts a `Release Dev` run just published. It is the last automated checkpoint before tagging a public release: if the canary is red, the published `dev` artifacts do not install on a stock environment.

## What the canary verifies

| Job | Runner | Verifies |
|---|---|---|
| `macos` | `macos-latest-xlarge` | `install.sh` resolves the Homebrew formula, brew installs the cask, and `openshell status` reaches the brew-services–backed local gateway with the VM driver. |
| `ubuntu` | `ubuntu-latest` | `install.sh` installs the Debian package, the post-install systemd user service starts, and `openshell status` reaches the local gateway with the Docker driver. |
| `fedora` | `fedora:latest` container | `install.sh` installs the RPM packages, the local gateway starts under Podman, and `openshell status` succeeds. |
| `kubernetes` | `ubuntu-latest` + kind | `helm install oci://ghcr.io/nvidia/openshell/helm-chart --version 0.0.0-dev` succeeds in a kind cluster, the gateway pod becomes Ready, port-forward exposes 8080, and the released CLI registers the in-cluster gateway and runs `openshell status` against it. |

`install.sh` defaults to the *latest tagged* release — the canary is therefore checking that the most recent public release still installs, not the just-published `dev` build. The `kubernetes` job is the exception: it pins to `0.0.0-dev` chart + `:dev` images.

## Trigger paths

The workflow has two triggers:

```yaml
on:
workflow_dispatch:
workflow_run:
workflows: ["Release Dev"]
types: [completed]
```

- **Automatic.** Every successful `Release Dev` run (on `main` or a manual dispatch of Release Dev) fires the canary. Each job gates on `github.event.workflow_run.conclusion == 'success'` so a failed Release Dev does not run the canary.
- **Manual.** `workflow_dispatch` lets you run the canary on demand against any branch's workflow definition.

When dispatched manually, `github.event.workflow_run.head_sha` is empty and the workflow falls back to `github.sha` (the branch tip) for the `install.sh` URL.

## Manual dispatch

Run the canary as-is on the current branch:

```shell
gh workflow run release-canary.yml --ref "$(git branch --show-current)"
```

Watch the run that starts:

```shell
sleep 5 # let GitHub register the dispatch
gh run list --workflow release-canary.yml --limit 1
gh run watch "$(gh run list --workflow release-canary.yml --limit 1 --json databaseId --jq '.[0].databaseId')"
```

View only failed jobs after completion:

```shell
gh run view <run-id> --log-failed
```

## Iterating on the canary itself

When you change `release-canary.yml` on a branch, a manual dispatch on that branch tests *your branch's workflow logic* against *main's published artifacts* (`0.0.0-dev` chart, `:dev` images, latest tagged install.sh assets). This is what you want for iterating on the canary — you're validating that the canary still works against known-good artifacts.

Note `install.sh` is pulled from `raw.githubusercontent.com/NVIDIA/OpenShell/${head_sha}/install.sh`, so changes to `install.sh` on your branch *are* exercised even though the binaries it downloads are from the latest public tag.

## Testing artifacts from a specific SHA

`Release Dev` publishes two chart versions for every dev build (see `.github/actions/release-helm-oci/action.yml:89-102`):

- `oci://ghcr.io/nvidia/openshell/helm-chart:0.0.0-dev` — floating, overwritten on every main push.
- `oci://ghcr.io/nvidia/openshell/helm-chart:0.0.0-dev.<sha>` — immutable, `appVersion` set to the same SHA so it pulls `ghcr.io/nvidia/openshell/gateway:<sha>` and `:supervisor:<sha>`.

To smoke-test the chart for a specific dev build, dispatch `Release Dev` on the branch first, then run the kind canary steps locally pointed at the SHA-pinned chart (see "Local kind reproduction" below). The release-canary workflow itself does not currently expose `chart_version` / `image_tag` inputs.

## Local kind reproduction

The `kubernetes` job can be reproduced on any machine with Docker and `mise install`-provided `kubectl` + `helm`:

```shell
kind create cluster --name release-canary-local

helm install openshell oci://ghcr.io/nvidia/openshell/helm-chart \
--version 0.0.0-dev \
--namespace openshell --create-namespace \
--set server.disableTls=true \
--wait --timeout 5m

kubectl wait --namespace openshell \
--for=condition=Ready pod \
--selector="app.kubernetes.io/name=openshell,app.kubernetes.io/instance=openshell" \
--timeout=300s

kubectl port-forward --namespace openshell svc/openshell 8080:8080 &
openshell gateway add http://127.0.0.1:8080 --local --name kind
openshell status
```

Keep `pkiInitJob.enabled=true` (the chart default), even when
`server.disableTls=true`. The hook also generates the sandbox JWT signing
secret that the gateway pod always mounts.

Swap `0.0.0-dev` for `0.0.0-dev.<sha>` to pin to a specific dev build. Tear down with `kind delete cluster --name release-canary-local`.

Loopback registration auto-derives the gateway name to `openshell` if `--name` is omitted, which collides with the `install.sh`-installed local gateway — always pass `--name kind` (or another distinct name) when registering in addition to a local install.

## Diagnosing failures

| Symptom | Likely cause | Where to look |
|---|---|---|
| `macos`/`ubuntu`/`fedora` job fails on `install.sh` | Latest tagged release missing an asset, checksum mismatch, or `install.sh` regression on this branch. | Job log around the `curl … install.sh \| sh` step. |
| `macos`/`ubuntu`/`fedora` job fails on `openshell status` | Local gateway service did not start (systemd/brew/podman). Often a driver issue. | Service logs in the job log; `OPENSHELL_DRIVERS` env in the "Ensure …" step. |
| `kubernetes` job fails on `helm install --wait` | Chart did not deploy in 5 min — usually image pull failure or readiness probe failing. | "Diagnostics on failure" step dumps `helm status`, manifest, pod describe, pod logs. |
| `kubernetes` job fails on `kubectl wait` | Gateway pod stuck `CrashLoopBackOff` or `ImagePullBackOff`. | Diagnostics dump; check `:dev` image existence at `ghcr.io/nvidia/openshell/gateway`. |
| `kubernetes` job fails on `openshell gateway add` or `status` | Port-forward not reachable, or CLI/gateway proto mismatch. | `port-forward.log` and `openshell gateway list` in the diagnostics dump. |

The `kubernetes` job's diagnostics step (only runs `if: failure()`) emits, in order: helm status, rendered manifest, `kubectl get all`, pod descriptions, pod logs (200 lines per container), port-forward log, gateway list, CLI version. Read it top-to-bottom — most failures fall out by the manifest or pod logs.

## Related

- `helm-dev-environment` skill — local k3d-based dev environment (more featureful than the canary's kind cluster, but uses Skaffold-built local images, not published artifacts).
- `watch-github-actions` skill — generic `gh run` workflow monitoring.
- `debug-openshell-cluster` skill — runtime gateway/sandbox diagnostics that pair with the kind job's diagnostics dump.
Loading
Loading