feat(vm): add vm life cycle extensions#1583
Conversation
Signed-off-by: Patrick Riel <priel@nvidia.com>
Signed-off-by: Patrick Riel <priel@nvidia.com>
Signed-off-by: Patrick Riel <priel@nvidia.com>
| rc=$? | ||
| set -e | ||
| if [ "$rc" -ne 0 ]; then | ||
| ts "WARNING: OpenShell VM init drop-in ${dropin##*/} failed with exit code ${rc}" |
There was a problem hiding this comment.
Should this warn on failure or halt?
|
/ok to test 328047d |
drew
left a comment
There was a problem hiding this comment.
Generally looks good. Some failing branch checks and Codex picked up a couple reasonable looking findings
-
[P2] QEMU subnet allocations can leak during cancellable lifecycle hooks crates/openshell-driver-vm/src/driver.rs:662 allocates the QEMU subnet for GPU sandboxes, but crates/openshell-driver-vm/src/driver.rs:730 marks qemu_network_allocated only after configure_launch().await. If deletion aborts provisioning during that hook, delete cleanup sees the flag as false and skips subnet release.
-
[P2] Lifecycle cleanup is skipped when the VM helper exits
/Users/anewberry/.isotope/worktrees/pr-1573-feat-vfio-multi-device-passthrough/pr-1583-feat-vfio-multi-device-passthrough/pr-1583-feat-vm-lifecycle-extensions/crates/openshell-driver-vm/src/driver.rs:2791 releases driver GPU/subnet allocations on helper exit, but does not invoke after_launch_failed. Extensions that allocated host resources in before_launch leak them until manual delete.
| #[derive(Debug, Clone)] | ||
| pub struct LaunchPlan { | ||
| pub backend: VmBackend, | ||
| pub vcpus: u8, | ||
| pub mem_mib: u32, | ||
| pub required_backends: Vec<VmBackend>, | ||
| pub required_backend_features: Vec<BackendFeature>, | ||
| pub kernel_profile: Option<String>, | ||
| pub kernel_image: Option<PathBuf>, | ||
| pub gpu_bdf: Option<String>, | ||
| pub tap_device: Option<String>, | ||
| pub guest_ip: Option<String>, | ||
| pub host_ip: Option<String>, | ||
| pub vsock_cid: Option<u32>, | ||
| pub guest_mac: Option<String>, | ||
| pub gateway_port: Option<u16>, | ||
| pub guest_init_dropins: Vec<GuestInitDropin>, | ||
| pub env: Vec<String>, | ||
| } |
| const GUEST_TLS_CERT_PATH: &str = "/opt/openshell/tls/tls.crt"; | ||
| const GUEST_TLS_KEY_PATH: &str = "/opt/openshell/tls/tls.key"; | ||
| const GUEST_SANDBOX_TOKEN_PATH: &str = "/opt/openshell/auth/sandbox.jwt"; | ||
| const GUEST_INIT_DROPIN_DIR: &str = "/opt/openshell/init.d"; |
There was a problem hiding this comment.
Starting a 🧵 for another finding
- [P1] Guest images can inject root init drop-ins crates/openshell-driver-vm/scripts/openshell-vm-sandbox-init.sh:601 scans the merged /opt/openshell/init.d, so user-controlled template.image contents can run as PID 1/root before supervisor policy enforcement. The runner should execute only driver-injected drop-ins, ideally via a per-launch manifest.
I'm trying to decide if this is a bug or a feature. Giving some level of control over VM setup via sandbox imagers could be useful. Early on I tinkered with the idea of bringing in something like cloud-init.
What'd you think?
Summary
Adds lifecycle extension points to the VM driver so future integrations can inspect and adjust VM launch plans around sandbox startup and cleanup. The default registry is empty, so existing VM sandbox behavior remains unchanged.
Related Issue
Not linked
Changes
Testing
mise run pre-commitpasses (not run;miseis not available in this shell)Additional validation run:
cargo fmt -p openshell-driver-vmcargo test -p openshell-driver-vm lifecycle -- --nocapturecargo test -p openshell-driver-vm qemu -- --nocapturecargo clippy -p openshell-driver-vm --all-targets -- -D warningscargo test -p openshell-driver-vmgit diff --checkChecklist