Skip to content

support firecracker as hypervisor backend#16

Open
CMGS wants to merge 69 commits intomasterfrom
feature/firecracker
Open

support firecracker as hypervisor backend#16
CMGS wants to merge 69 commits intomasterfrom
feature/firecracker

Conversation

@CMGS
Copy link
Copy Markdown
Contributor

@CMGS CMGS commented Apr 7, 2026

Summary

Add Firecracker as an alternative hypervisor backend, selected via the --fc persistent root flag. Both CH and FC implement the same hypervisor.Hypervisor interface; shared store operations are extracted into a common hypervisor.Backend struct.

What's included

  • Full VM lifecycle: create, start, stop, delete, inspect, list — OCI images only (direct kernel boot)
  • Networking: single-queue virtio-net via CNI TAP devices with IFF_NO_PI flag; SingleQueueNet VMConfig flag controls TAP creation
  • Console: PTY + Unix socket relay (self-exec background process with broadcaster pattern)
  • Snapshot / Clone / Restore: FC snapshot API + COW symlink redirect + network_overrides (FC v1.14+); withCOWPathLocked closure for serialization
  • Balloon: 25% memory with deflate-on-OOM and free-page-reporting (matches CH behavior)
  • GC: full blob pinning via RegisterGC; both backends registered during cocoon gc
  • Debug: cocoon --fc vm debug outputs FC launch command + REST API curl sequence
  • Doctor: FC v1.15 detection/install, zstd optional check
  • Kernel extraction: automatic vmlinuz → vmlinux decompression (zstd/gzip), atomic cache write

Shared code extraction (hypervisor/ layer)

Extracted ~150 lines of duplicated code from CH and FC into shared packages:

  • hypervisor.Backend struct: Inspect, List, ToVM, ResolveRef(s), LoadRecord, WithRunningVM, UpdateStates, MarkError, ReserveVM, RollbackCreate, ForEachVM, AbortLaunch, BatchMarkStarted, CleanStalePlaceholders
  • hypervisor/shared.go: EnterNetns, WaitForSocket, ExtractBlobIDs, BuildIPParams, PrefixToNetmask, CopyFile, RemoveVMDirs, CleanupRuntimeFiles, VerifyBaseFiles, BlobHexFromPath
  • Shared constants: CowSerial, CreatingStateGCGrace, APISocketName, ConsoleSockName

Feature comparison

Feature Cloud Hypervisor Firecracker
OCI images (direct boot) Y Y
Cloud images (UEFI boot) Y N
Windows guests Y N
Snapshot / Clone / Restore Y Y
CPU/memory override on clone/restore Y N
Multi-queue networking Y N
Memory balloon Y Y
qcow2 storage Y N
Interactive console Y Y
HugePages Y Y
Boot time ~200-500ms ~125ms
Memory overhead ~10-20 MiB/VM <5 MiB/VM

Key design decisions

  1. PTY relay: FC serial is stdin/stdout; a self-exec background relay process bridges PTY master to console.sock via broadcaster pattern (single persistent reader, no per-session goroutine leaks)
  2. vmlinux extraction: FC requires uncompressed ELF kernel; automatic zstd/gzip detection and decompression with atomic cache write (temp file + rename)
  3. COW symlink redirect: FC vmstate has absolute drive paths baked in; clone creates temporary symlink from source COW → clone COW during snapshot/load, serialized via withCOWPathLocked
  4. network_overrides: FC v1.14+ feature replaces TAP devices during snapshot/load, avoiding TAP flag mismatch on clone
  5. Absolute paths: FC snapshots store absolute paths (no relative path translation); cross-host portability requires same root_dir/run_dir layout
  6. SingleQueueNet: Generic VMConfig flag for single-queue TAP creation with IFF_NO_PI; set by cmd layer for FC, consumed by CNI layer

Known limitations (documented in KNOWN_ISSUES.md)

  • FC snapshots require same directory layout across hosts (vmstate is binary, not patchable)
  • FC cannot change CPU/memory after snapshot/load (overrides rejected before destructive ops)
  • FC uses symlink redirect for clone COW paths — upstream PR #5774 (drive_overrides) will eliminate this
  • FC does not support virtio-blk serial numbers; OCI init script uses /dev/vdX device paths

Testing

E2E tested on 35.240.182.52 (FC v1.15.0):

  • VM lifecycle: create → start → list → inspect → stop → delete ✅
  • Networking: CNI TAP with IP allocation ✅
  • Console: PTY relay with boot output + login prompt ✅
  • Snapshot → Clone (independent IP, post-clone hints) ✅
  • Snapshot → Restore (same VM, state reverted) ✅
  • CH smoke test: no regression ✅

Review history

  • 7 rounds of Codex review (18 findings fixed)
  • 2 rounds of /simplify (9 findings fixed)
  • Multiple PR review comment rounds
  • Final /code audit: 0 lint issues (dual platform), 10/10 tests pass

Test plan

  • FC create/start/stop/delete with OCI image
  • FC networking (single-queue TAP, IP allocation)
  • FC console (PTY relay, boot output visible)
  • FC snapshot save
  • FC clone (independent IP, network_overrides)
  • FC restore (same VM, state revert)
  • FC balloon (25% memory)
  • FC debug command (curl sequence)
  • CH regression test (no impact)
  • --fc --windows mutual exclusion
  • --fc with cloudimg rejection
  • Lint: 0 issues on linux + darwin
  • Tests: 10/10 pass

close #15 #17

CMGS added a commit that referenced this pull request Apr 7, 2026
P1 GC: Implement RegisterGC for FC backend — protects blob IDs
referenced by FC VMs from garbage collection, mirroring CH's GC module.

P1 Clone paths: Save cocoon.json metadata (StorageConfigs + BootConfig)
in snapshot tar. Create temporary symlinks from source drive paths to
clone paths before snapshot/load so FC finds drives at expected locations.
Symlinks are cleaned up after load + reconfigure.

P2 Rebuild: Replace fragile rebuildFromSnapshot (searched live VM records)
with self-contained metadata from cocoon.json. Clones no longer depend
on the source VM or any sibling VM existing in the DB.

P2 Console relay: Add 3s timeout on second goroutine wait after client
disconnect to prevent blocking the accept loop when PTY read is stuck.
@CMGS CMGS changed the title Feature/firecracker support firecracker as hypervisor backend Apr 8, 2026
CMGS added 26 commits April 8, 2026 11:58
Add --fc flag to select Firecracker as hypervisor backend.
Validates mutual exclusion with --windows and rejects cloudimg
(UEFI boot) since Firecracker only supports direct kernel boot.
InitHypervisor dispatches based on config; FC returns stub error
until the backend is implemented.
Create Firecracker backend package with Config (path helpers),
main Firecracker struct (constructor, Inspect, List, Watchable),
and helper utilities (toVM, path functions). Wire up InitHypervisor
to create FC backend when --fc is set. Lifecycle methods are stubs
pending implementation.
Add FC REST API client (pre-boot config model), Create (COW disk +
device-path cmdline), and Start (launch process → REST API config
sequence → InstanceStart). FC references disks by /dev/vdX path
since it lacks virtio serial support. Update overlay.sh init script
to resolve both device paths and serial names.
Add Stop (SendCtrlAltDel → SIGTERM → SIGKILL) and Delete
(stop-if-running → cleanup dirs → remove DB record) for the
Firecracker backend. Follows the same patterns as CH.
FC binds serial to process stdin/stdout. Create PTY pair at launch:
slave → FC stdin/stdout, master → background relay process. The
relay (self-exec with env var detection) listens on console.sock
and bridges connections to the PTY master. Auto-exits when FC dies.
Console() connects to console.sock, consistent with CH backend.
Add full snapshot lifecycle for FC backend:
- Snapshot: pause → PUT /snapshot/create (vmstate+mem) → reflink COW → resume
- Clone: extract → launch new FC → PUT /snapshot/load → reconfigure drives/NICs → resume
- Restore: kill running → extract → new FC → snapshot/load → reconfigure → resume
- Direct: hardlink mem, reflink COW, copy vmstate for local snapshots

FC snapshot/load does not preserve drive/NIC config, so drives and
networks are re-attached after load. Implements hypervisor.Direct
interface for reflink-optimized local snapshot operations.
Add FC_VERSION variable (v1.12.0), firecracker binary detection in
check_binary, and auto-install from GitHub releases in --upgrade mode.
Add --fc flag to global flags, Firecracker section with feature
comparison matrix, limitations, OCI image compatibility notes.
Update requirements, doctor, VM lifecycle, and shutdown behavior
sections to reflect dual-backend support.
- Pre-create FC log file (FC requires O_WRONLY|O_APPEND, no O_CREATE)
- Use underscores in drive/iface IDs (FC rejects hyphens)
- Add vmlinux extraction from vmlinuz (FC needs uncompressed ELF kernel)
- Support zstd and gzip compressed kernels via CLI decompressor
- Fix FC download URL in doctor/check.sh (tarball format)
- Guard boot pointer nil dereference in prepareOCI
- Fix relayBidirectional goroutine leak: buffer 2, close conn, wait
- Optimize ensureVmlinux: check ELF magic (4 bytes) and cache before
  reading full vmlinuz into memory
- Extract magic strings to constants (driveIDFmt, ifaceIDFmt,
  cowFileName, FC action types, VM state strings)
- Deep-copy SnapshotIDs map in toVM to prevent shared DB mutation
- Return real error from decompressZstd when output is empty
Extract ~650 lines of duplicated code from CH and FC backends into
shared hypervisor/ layer:

- Backend struct with BackendConfig interface: provides Inspect, List,
  ToVM, ResolveRef(s), LoadRecord, WithRunningVM, UpdateStates,
  MarkError, ReserveVM, RollbackCreate, ForEachVM, AbortLaunch
- shared.go: EnterNetns, WaitForSocket, ExtractBlobIDs, BuildIPParams,
  PrefixToNetmask, CopyFile, RemoveVMDirs, CleanupRuntimeFiles,
  BlobHexFromPath, SocketPath, ConsoleSockPath
- config.HypervisorType enum + switch-case in InitHypervisor
- FC version updated to v1.15.0
P1 GC: Implement RegisterGC for FC backend — protects blob IDs
referenced by FC VMs from garbage collection, mirroring CH's GC module.

P1 Clone paths: Save cocoon.json metadata (StorageConfigs + BootConfig)
in snapshot tar. Create temporary symlinks from source drive paths to
clone paths before snapshot/load so FC finds drives at expected locations.
Symlinks are cleaned up after load + reconfigure.

P2 Rebuild: Replace fragile rebuildFromSnapshot (searched live VM records)
with self-contained metadata from cocoon.json. Clones no longer depend
on the source VM or any sibling VM existing in the DB.

P2 Console relay: Add 3s timeout on second goroutine wait after client
disconnect to prevent blocking the accept loop when PTY read is stuck.
P1: GC now registers ALL hypervisor backends (CH + FC) via
InitAllHypervisors, protecting blobs from both backends on
mixed-backend hosts regardless of --fc flag.

P2: doctor/check.sh treats firecracker as optional — warns instead
of failing when not installed, since it's only needed for --fc.

P3: vm debug rejects --fc with a clear error since it only generates
Cloud Hypervisor launch commands.
…el path

P1: createDriveRedirects now unconditionally redirects the source COW
path to the clone's copy. When the source VM is still running, its
cow.raw is renamed to a temporary backup, a symlink is placed, and
after snapshot/load the backup is restored. This prevents FC from
reopening the live source VM's disk state.

P2: saveSnapshotMeta stores the portable vmlinuz path instead of the
host-local vmlinux cache. cloneAfterExtract runs ensureVmlinux on
the clone host to (re)create vmlinux from vmlinuz, making FC
snapshots fully portable across hosts.
P2 redirect: createDriveRedirects now returns error. On symlink
failure after backup rename, the backup is immediately restored
and all prior redirects are cleaned up, preventing source VM disk
corruption from a half-installed redirect.

P2 portable paths: snapshot metadata (cocoon.json) now stores paths
relative to root_dir using filepath.Rel. loadSnapshotMeta resolves
them against the local host's root_dir. Snapshots exported from one
host can be imported on another with a different Cocoon directory
layout, as long as the same OCI image has been pulled.
P1: SnapshotConfig now carries a Hypervisor field ("cloud-hypervisor"
or "firecracker") set during Snapshot(). Clone validates that the
snapshot's backend matches the active backend before proceeding,
with a clear error suggesting the correct flag.

P2: COW redirect during clone is now serialized via a per-source-COW
flock (.clone.lock). Concurrent snapshot/restore/clone operations on
the source VM block until the redirect is cleaned up, preventing
them from following the temporary symlink to the wrong disk.
saveSnapshotMeta now stores ALL drive entries (RO layers + RW COW),
not just RO entries. Without the source COW path, createDriveRedirects
had no old→new mapping to redirect, so snapshot/load would reopen
the live source cow.raw (if source VM exists) or fail (if deleted).
…creation

P1: acquireCOWLock (via lockCOWPath) now creates the parent directory
before locking, fixing ENOENT when source VM has been deleted.

P2: snapshotMeta stores SourceRootDir. vmstatePaths() reconstructs
the original absolute paths baked into FC's vmstate binary.
createDriveRedirects uses vmstate paths as symlink targets, so
cross-host clones redirect at the correct (source host) paths.

P2: COW flock is now taken in Snapshot and Restore too (via shared
lockCOWPath helper), not just Clone. Concurrent snapshot/restore
operations on the source VM are serialized with clone redirects.
…esign

P1: InitAllHypervisors now returns error instead of silently skipping
failed backends. GC aborts if any hypervisor can't be loaded, preventing
blob deletion when pinning data is incomplete.

P2: ensureVmlinux writes to a temp file and renames atomically,
preventing concurrent readers from observing a truncated kernel cache.

P2: Added zstd to doctor/check.sh binary checks — required by FC's
kernel decompression but was previously an undeclared dependency.

P2: Redesigned console relay to use a single persistent PTY reader
goroutine with broadcaster pattern. Each session subscribes/unsubscribes
via setSink(). No per-session read goroutines on the PTY master,
eliminating stale goroutine data theft after disconnect.
P1: vmstatePaths() now reconstructs from raw relative paths saved
before local resolution, so cross-host clones correctly redirect
at source-host paths even when root_dir differs.

P2: zstd treated as optional in doctor/check.sh (like firecracker),
warns instead of failing on CH-only hosts.

P3: FC Stop now honors --force (skip SendCtrlAltDel, immediate kill)
and --timeout (wait for guest response before escalating). Added
gracefulStop with SendCtrlAltDel → poll → forceTerminate pattern.
…ath >26

P2: snapshotRecordToConfig now copies the Hypervisor field so
export/import preserves the backend tag. Clone validation works
correctly after a round-trip.

P2: devPath handles >26 drives with Linux-style multi-letter naming
(vda..vdz, vdaa..vdaz, ...) for OCI images with deep layer stacks.
…install

P1: FC clone/restore now clamp CPU/memory to snapshot's original
values since FC cannot PATCH machine-config after snapshot/load.
Snapshot metadata stores CPU/Memory for clone to use. Prevents
metadata from advertising overrides FC didn't actually apply.

P2: doctor --upgrade now installs zstd via apt-get/yum when missing,
so fresh FC setups don't silently break on zstd-compressed kernels.
P2: Set VM.ID in synthetic VMRecord for clone launchProcess so FC
gets a valid --id flag instead of empty string.

P2: Drive redirects now only apply for same-host clones (where
SourceRootDir matches local rootDir). Cross-host clones skip
redirects entirely — they require the same rootDir layout, and
creating symlinks under a foreign path tree would be incorrect.
…keep PTY

P1: Always create drive redirects from vmstate paths → local paths,
including cross-host clones. COW flock only on same-host (where
source VM may be running). Cross-host redirects are safe since no
live VM owns those paths on the target host.

P2: FC clone/restore now reject --cpu/--memory overrides with a
clear error instead of silently clamping, since FC cannot PATCH
machine-config after snapshot/load.

P2: Keep PTY master open (intentional fd leak) when console relay
fails, preventing the slave-side hangup that would crash FC's
serial console output during boot.
… ops

Move FC CPU/memory override rejection to before any destructive
operations. Clone validates against snapshot metadata before launch.
Restore validates against current VM record before killing the
running VM (via validateRestoreOverrides helper). Prevents downtime
from unsupported override requests.
CMGS added 15 commits April 8, 2026 11:58
SetUnlinkOnClose(false) before closing the Go listener so the
socket file persists on disk for the relay child process.
Without this, net.UnixListener.Close() removes the socket file,
making console.sock disappear before the relay starts accepting.
Network:
- Add SingleQueueNet flag to VMConfig for FC single-queue TAPs
- CNI creates TAPs with IFF_NO_PI when SingleQueueNet is set (FC requires it)
- Set SingleQueueNet in both createVM and prepareClone paths

Console:
- Fix SetUnlinkOnClose(false) so console.sock persists for relay

Snapshot/Clone:
- Use FC network_overrides (v1.14+) during snapshot/load to provide
  clone's TAP devices, avoiding TAP flag mismatch
- Skip drive reconfiguration after snapshot/load (FC opens drives via
  fd during load, fds survive symlink cleanup)
- Remove unused reconfigureDrives function

Restore:
- Skip drive reconfiguration (same VM, paths unchanged)
- Pass nil network_overrides (same TAP)

COW lock:
- Rewrite lockCOWPath to withCOWPathLocked closure form
- Update all callers (snapshot, clone, restore, direct)

All e2e tests pass: FC create/start/network/console/snapshot/clone/
restore/stop/delete + CH smoke test (no regression).
…tempty

- prepareClone: move ctx before cmd per Go convention
- create_linux.go: re-read link after LinkSetHardwareAddr to get
  the actual MAC (link.Attrs() is stale after override)
- types/vm.go: add omitempty to FirstBooted for consistent JSON
- debug.go: normalize nolint comment alignment
Remove SingleQueueNet from VMConfig — FC queue decision stays at
the cmd layer via tapQueues parameter to initNetwork. The network
layer uses vmCfg.CPU for TAP queues, which initNetwork temporarily
overrides to 1 for FC.

Also add IFF_NO_PI to all TAPs unconditionally — both CH and FC
open TAPs with IFF_NO_PI, so the flag must always be set at
creation time for TUNSETIFF to succeed.
P2: FC clone now rejects --nics > snapshot NIC count since FC can't
hot-add NICs after snapshot/load (only network_overrides for existing).

P3: Debug command runs EnsureVmlinux to resolve vmlinuz → vmlinux
before printing the FC boot-source curl, so the output is runnable.
Export EnsureVmlinux for use by cmd/vm/debug.go.
Add Hypervisor field to types.VM so each VM carries its backend identity.
Move --fc from root PersistentFlags to create/run/debug subcommands only.
Commands like list/inspect/console/stop/rm now auto-detect the backend by
querying all registered backends — no --fc needed for existing VMs.

Clone infers the backend from the snapshot's Hypervisor field.
Snapshot save and list --vm auto-detect from the VM ref.
Status merges watchers from all backends via fan-in channel.
Validate --cpu/--memory/--nics overrides at cmd layer before creating
network and VM dirs, avoiding late failure and unnecessary rollback.

Add MAC change instructions to FC clone post-clone hints since FC
vmstate bakes in the source VM's guest MAC.
@CMGS CMGS force-pushed the feature/firecracker branch from fe2c700 to 29139fe Compare April 8, 2026 03:59
FC has no ACPI PM on x86 — the only shutdown/reboot signal path is the
i8042 keyboard controller reset. Without reboot=k, guest reboot hangs
(FC doesn't recognize the signal) and SendCtrlAltDel-based vm stop
times out after 30s before falling back to SIGTERM.
@CMGS
Copy link
Copy Markdown
Contributor Author

CMGS commented Apr 8, 2026

E2E Regression Test Results

Full lifecycle test across both backends, both network types, single and multi CPU.

Test Matrix

# Backend Network CPU Run Stop Snapshot Clone Hints Reboot Network Consistent
1 FC Static IP (bridge+host-local) 1 ✅ 0.5s MAC+Network ✅ Same IP
2 FC DHCP (dhcp-noipam) 2 ✅ 0.5s MAC+DHCP ✅ MAC preserved
3 CH Static IP (bridge+host-local) 2 ✅ hot-swap N/A ✅ Same IP
4 CH DHCP (dhcp-noipam) 1 ✅ hot-swap N/A ✅ MAC preserved

Key Findings

  • reboot=k fix verified: FC vm stop now completes in ~0.5s (was 30s timeout before SendCtrlAltDelSIGTERM fallback)
  • FC clone MAC hints: Correctly outputs ip link set dev ethN address <new-mac> instructions; CH clone does not (NIC hot-swap handles it)
  • TAP naming (tap{vmID[:8]}-{nic}): Works correctly on both backends after master rebase
  • vm.Hypervisor field: Set correctly (firecracker / cloud-hypervisor) on create, clone, and inspect
  • --fc auto-detect: vm list, inspect, stop, rm, console, snapshot save, snapshot list --vm all work without --fc flag
  • Clone backend inference: vm clone reads snapshot's Hypervisor field, no --fc needed
  • FC clone resource validation: --cpu, --memory, --nics overrides correctly rejected at cmd layer before resource creation
  • Reboot network consistency: After stop+start, MAC addresses preserved, networkd configs regenerated by initramfs cocoon-network script. DHCP VMs may get new IP on reboot (tracked in OCI VM: initramfs IP=dhcp causes DHCP lease to be persisted as static config #17)
  • Cross-backend vm list: Shows both FC and CH VMs in a single list

Known Issues

CMGS added 12 commits April 8, 2026 13:04
GC orchestrator holds the module's flock for the entire cycle. Collect
called LoadRecord which called DB.With → locker.Lock on the same flock,
causing self-deadlock since flock is not re-entrant.

Replace LoadRecord (lock-acquiring) with DB.ReadRaw (lock-free) in both
FC and CH GC Collect. This is safe because the GC orchestrator already
holds the lock, preventing concurrent DB mutations.
… issues

IP=dhcp caused three problems:
1. --nics 0 VMs hung forever (dhcpcd retries every 120s with no interface)
2. DHCP network VMs had leases persisted as static configs by
   systemd-network-generator, breaking DHCP semantics on reboot
3. Source VMs and cloned VMs had inconsistent network behavior

IP=off tells initramfs to skip networking entirely. Kernel ip= parameters
(when present for static IP networks) override this setting and still
trigger ipconfig. DHCP networks rely on systemd-networkd via the existing
20-wired.network (DHCP=yes) fallback, or cocoon-network's MAC-based
DHCP config generation.

Fixes #17
configure_networking probes for devices and waits for udev even when
IP=off, adding ~180s delay on VMs with no NICs. Only call it when a
kernel ip= parameter is present on the cmdline.
…fs fixes

- Move --fc from Global Flags to VM Flags (only create/run/debug)
- Update FC examples to show auto-detect for list/console/stop/clone
- Fix debug command description
- Add initramfs IP=off note to DHCP networking section
- Add /dev/vdX direct path branch to Android overlay.sh resolve_disk()
  so FC VMs can find disks (FC has no virtio serial support)
- Skip configure_networking unless kernel ip= param is present
- Extract GC Collect to shared Backend.GCCollect() (was duplicated)
- Fix goroutine leak in mergeWatchChannels (missing ctx.Done check)
Remove ndc dependency — ndc network interface add causes netd to take
over eth0 and clear existing routes from the main table. Instead:

- Static IP: kernel ip= routes already in main table, copy to policy tables
- DHCP: udhcpc obtains lease and configures main table, then same copy logic

Both paths use ip route replace into legacy_system/legacy_network/local_network
policy tables. Add /proc/1/cmdline fallback for SELinux-restricted /proc/cmdline.
Add guard file to prevent repeated execution on netd restart.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Firecracker as an alternative hypervisor backend

1 participant