Skip to content

feat(vm): add vm life cycle extensions#1583

Open
cheese-head wants to merge 3 commits into
NVIDIA:mainfrom
cheese-head:feat/vm-lifecycle-extensions
Open

feat(vm): add vm life cycle extensions#1583
cheese-head wants to merge 3 commits into
NVIDIA:mainfrom
cheese-head:feat/vm-lifecycle-extensions

Conversation

@cheese-head
Copy link
Copy Markdown
Contributor

Summary

Adds lifecycle extension points to the VM driver so future integrations can inspect and adjust VM launch plans around sandbox startup and cleanup. The default registry is empty, so existing VM sandbox behavior remains unchanged.

Related Issue

Not linked

Changes

  • Added a VM lifecycle extension trait, registry, launch plan, and lifecycle error type.
  • Wired VM provisioning to run lifecycle hooks before launcher spawn, on launch failure, and after sandbox deletion.
  • Kept backend selection stable by rejecting non-GPU QEMU extension requirements until concrete PCI transport exists.
  • Tracked QEMU network allocation separately from GPU assignment so cleanup does not depend on GPU state.
  • Added unit coverage for lifecycle hook ordering, cleanup hooks, backend validation, and resource exhaustion handling.

Testing

  • mise run pre-commit passes (not run; mise is not available in this shell)
  • Unit tests added/updated
  • E2E tests added/updated (not applicable)

Additional validation run:

  • cargo fmt -p openshell-driver-vm
  • cargo test -p openshell-driver-vm lifecycle -- --nocapture
  • cargo test -p openshell-driver-vm qemu -- --nocapture
  • cargo clippy -p openshell-driver-vm --all-targets -- -D warnings
  • cargo test -p openshell-driver-vm
  • git diff --check

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (not applicable)

Signed-off-by: Patrick Riel <priel@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 26, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cheese-head cheese-head marked this pull request as ready for review May 26, 2026 23:00
Signed-off-by: Patrick Riel <priel@nvidia.com>
Signed-off-by: Patrick Riel <priel@nvidia.com>
rc=$?
set -e
if [ "$rc" -ne 0 ]; then
ts "WARNING: OpenShell VM init drop-in ${dropin##*/} failed with exit code ${rc}"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this warn on failure or halt?

@drew
Copy link
Copy Markdown
Collaborator

drew commented May 29, 2026

/ok to test 328047d

Copy link
Copy Markdown
Collaborator

@drew drew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good. Some failing branch checks and Codex picked up a couple reasonable looking findings

  • [P2] QEMU subnet allocations can leak during cancellable lifecycle hooks crates/openshell-driver-vm/src/driver.rs:662 allocates the QEMU subnet for GPU sandboxes, but crates/openshell-driver-vm/src/driver.rs:730 marks qemu_network_allocated only after configure_launch().await. If deletion aborts provisioning during that hook, delete cleanup sees the flag as false and skips subnet release.

  • [P2] Lifecycle cleanup is skipped when the VM helper exits
    /Users/anewberry/.isotope/worktrees/pr-1573-feat-vfio-multi-device-passthrough/pr-1583-feat-vfio-multi-device-passthrough/pr-1583-feat-vm-lifecycle-extensions/crates/openshell-driver-vm/src/driver.rs:2791 releases driver GPU/subnet allocations on helper exit, but does not invoke after_launch_failed. Extensions that allocated host resources in before_launch leak them until manual delete.

Comment on lines +194 to +212
#[derive(Debug, Clone)]
pub struct LaunchPlan {
pub backend: VmBackend,
pub vcpus: u8,
pub mem_mib: u32,
pub required_backends: Vec<VmBackend>,
pub required_backend_features: Vec<BackendFeature>,
pub kernel_profile: Option<String>,
pub kernel_image: Option<PathBuf>,
pub gpu_bdf: Option<String>,
pub tap_device: Option<String>,
pub guest_ip: Option<String>,
pub host_ip: Option<String>,
pub vsock_cid: Option<u32>,
pub guest_mac: Option<String>,
pub gateway_port: Option<u16>,
pub guest_init_dropins: Vec<GuestInitDropin>,
pub env: Vec<String>,
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice 💯

const GUEST_TLS_CERT_PATH: &str = "/opt/openshell/tls/tls.crt";
const GUEST_TLS_KEY_PATH: &str = "/opt/openshell/tls/tls.key";
const GUEST_SANDBOX_TOKEN_PATH: &str = "/opt/openshell/auth/sandbox.jwt";
const GUEST_INIT_DROPIN_DIR: &str = "/opt/openshell/init.d";
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting a 🧵 for another finding

  • [P1] Guest images can inject root init drop-ins crates/openshell-driver-vm/scripts/openshell-vm-sandbox-init.sh:601 scans the merged /opt/openshell/init.d, so user-controlled template.image contents can run as PID 1/root before supervisor policy enforcement. The runner should execute only driver-injected drop-ins, ideally via a per-launch manifest.

I'm trying to decide if this is a bug or a feature. Giving some level of control over VM setup via sandbox imagers could be useful. Early on I tinkered with the idea of bringing in something like cloud-init.

What'd you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants