feat(vm): add vm life cycle extensions by cheese-head · Pull Request #1583 · NVIDIA/OpenShell

cheese-head · 2026-05-26T22:47:26Z

Summary

Adds lifecycle extension points to the VM driver so future integrations can inspect and adjust VM launch plans around sandbox startup and cleanup. The default registry is empty, so existing VM sandbox behavior remains unchanged.

Related Issue

Not linked

Changes

Added a VM lifecycle extension trait, registry, launch plan, and lifecycle error type.
Wired VM provisioning to run lifecycle hooks before launcher spawn, on launch failure, and after sandbox deletion.
Kept backend selection stable by rejecting non-GPU QEMU extension requirements until concrete PCI transport exists.
Tracked QEMU network allocation separately from GPU assignment so cleanup does not depend on GPU state.
Added unit coverage for lifecycle hook ordering, cleanup hooks, backend validation, and resource exhaustion handling.

Testing

mise run pre-commit passes (not run; mise is not available in this shell)
Unit tests added/updated
E2E tests added/updated (not applicable)

Additional validation run:

cargo fmt -p openshell-driver-vm
cargo test -p openshell-driver-vm lifecycle -- --nocapture
cargo test -p openshell-driver-vm qemu -- --nocapture
cargo clippy -p openshell-driver-vm --all-targets -- -D warnings
cargo test -p openshell-driver-vm
git diff --check

Checklist

Follows Conventional Commits
Commits are signed off (DCO)
Architecture docs updated (not applicable)

Signed-off-by: Patrick Riel <priel@nvidia.com>

copy-pr-bot · 2026-05-26T22:47:30Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Patrick Riel <priel@nvidia.com>

maxamillion · 2026-05-29T20:51:31Z

+        rc=$?
+        set -e
+        if [ "$rc" -ne 0 ]; then
+            ts "WARNING: OpenShell VM init drop-in ${dropin##*/} failed with exit code ${rc}"


Should this warn on failure or halt?

drew · 2026-05-29T22:25:31Z

/ok to test 328047d

drew

Generally looks good. Some failing branch checks and Codex picked up a couple reasonable looking findings

[P2] QEMU subnet allocations can leak during cancellable lifecycle hooks crates/openshell-driver-vm/src/driver.rs:662 allocates the QEMU subnet for GPU sandboxes, but crates/openshell-driver-vm/src/driver.rs:730 marks qemu_network_allocated only after configure_launch().await. If deletion aborts provisioning during that hook, delete cleanup sees the flag as false and skips subnet release.
[P2] Lifecycle cleanup is skipped when the VM helper exits
/Users/anewberry/.isotope/worktrees/pr-1573-feat-vfio-multi-device-passthrough/pr-1583-feat-vfio-multi-device-passthrough/pr-1583-feat-vm-lifecycle-extensions/crates/openshell-driver-vm/src/driver.rs:2791 releases driver GPU/subnet allocations on helper exit, but does not invoke after_launch_failed. Extensions that allocated host resources in before_launch leak them until manual delete.

drew · 2026-05-29T23:54:34Z

+#[derive(Debug, Clone)]
+pub struct LaunchPlan {
+    pub backend: VmBackend,
+    pub vcpus: u8,
+    pub mem_mib: u32,
+    pub required_backends: Vec<VmBackend>,
+    pub required_backend_features: Vec<BackendFeature>,
+    pub kernel_profile: Option<String>,
+    pub kernel_image: Option<PathBuf>,
+    pub gpu_bdf: Option<String>,
+    pub tap_device: Option<String>,
+    pub guest_ip: Option<String>,
+    pub host_ip: Option<String>,
+    pub vsock_cid: Option<u32>,
+    pub guest_mac: Option<String>,
+    pub gateway_port: Option<u16>,
+    pub guest_init_dropins: Vec<GuestInitDropin>,
+    pub env: Vec<String>,
+}


drew · 2026-05-30T00:07:18Z

 const GUEST_TLS_CERT_PATH: &str = "/opt/openshell/tls/tls.crt";
 const GUEST_TLS_KEY_PATH: &str = "/opt/openshell/tls/tls.key";
 const GUEST_SANDBOX_TOKEN_PATH: &str = "/opt/openshell/auth/sandbox.jwt";
+const GUEST_INIT_DROPIN_DIR: &str = "/opt/openshell/init.d";


Starting a 🧵 for another finding

[P1] Guest images can inject root init drop-ins crates/openshell-driver-vm/scripts/openshell-vm-sandbox-init.sh:601 scans the merged /opt/openshell/init.d, so user-controlled template.image contents can run as PID 1/root before supervisor policy enforcement. The runner should execute only driver-injected drop-ins, ideally via a per-launch manifest.

I'm trying to decide if this is a bug or a feature. Giving some level of control over VM setup via sandbox imagers could be useful. Early on I tinkered with the idea of bringing in something like cloud-init.

What'd you think?

feat(vm): add vm life cycle extensions

3fe9f59

Signed-off-by: Patrick Riel <priel@nvidia.com>

cheese-head marked this pull request as ready for review May 26, 2026 23:00

cheese-head requested review from a team, derekwaynecarr, maxamillion and mrunalp as code owners May 26, 2026 23:00

cheese-head added 2 commits May 28, 2026 00:49

feat(vm): add additional hooks for configuring vm lifecycle

8f4273a

Signed-off-by: Patrick Riel <priel@nvidia.com>

rename for readability

328047d

Signed-off-by: Patrick Riel <priel@nvidia.com>

maxamillion reviewed May 29, 2026

View reviewed changes

drew reviewed May 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(vm): add vm life cycle extensions#1583

feat(vm): add vm life cycle extensions#1583
cheese-head wants to merge 3 commits into
NVIDIA:mainfrom
cheese-head:feat/vm-lifecycle-extensions

cheese-head commented May 26, 2026

Uh oh!

copy-pr-bot Bot commented May 26, 2026

Uh oh!

maxamillion May 29, 2026

Uh oh!

drew commented May 29, 2026

Uh oh!

drew left a comment

Uh oh!

drew May 29, 2026

Uh oh!

drew May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

cheese-head commented May 26, 2026

Summary

Related Issue

Changes

Testing

Checklist

Uh oh!

copy-pr-bot Bot commented May 26, 2026

Uh oh!

maxamillion May 29, 2026

Choose a reason for hiding this comment

Uh oh!

drew commented May 29, 2026

Uh oh!

drew left a comment

Choose a reason for hiding this comment

Uh oh!

drew May 29, 2026

Choose a reason for hiding this comment

Uh oh!

drew May 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants