microvm MVP cleanup: align rootfs behavior to gvisor, ...#313
microvm MVP cleanup: align rootfs behavior to gvisor, ...#313Benjamin Elder (BenTheElder) wants to merge 5 commits into
Conversation
|
Davanum Srinivas (@dims) this should significantly unblock large images. Still needs follow-ups to optimize image pull / unpack. I tested with the 3GB openshell image. |
1a621d5 to
6fc1b32
Compare
|
Benjamin Elder (@BenTheElder) does this need me to build/use |
Unfortunately yes but the scripts will handle it. |
6fc1b32 to
c6b05ed
Compare
| // it with a tmpfs upper. On failure it dumps the guest overlay state. | ||
| func startOverlayContainer(ctx context.Context, ac *kata.AgentClient, vsockPath string, c actorContainer) error { | ||
| carrierCtx, carrierCancel := context.WithTimeout(ctx, 30*time.Second) | ||
| err := ac.CreateCarrier(carrierCtx, c.name, c.spec) |
There was a problem hiding this comment.
Will there be a problem if the user container itself has -ovl as a suffix? Do we need to make sure there is no collision between carrier and workload ids?
There was a problem hiding this comment.
Yes, we can switch to _ovl since k8s DNS-label restrictions shouldn't apply here but do apply to the user specified containers. Testing / fixing.
There was a problem hiding this comment.
Addressed this, I think that fix works well. It seems unlikely we'll want to allow invalid k8s container names ... later we might consider something like the dynamic containers KEP.
c6b05ed to
6d6a4f7
Compare
|
I need to rework the branch atop the CI changes now that we have those, so we get coverage on this PR :-) Will do that before we merge [IE even if this LGTM to reviewers, let's wait until that is done, pointing an agent at it ...] |
The overlay rootfs serves each container's image read-only over virtio-fs, which needs a vhost-user fs device (a virtiofsd socket) and more than one PCI segment (the fs device sits on segment 1, kata's convention). Add FsConfig + PlatformConfig and the Fs/Platform fields to VmConfig; both are omitempty, so a config without them is serialized exactly as before.
…s upper)
Helpers to assemble a container's rootfs as an overlay: its OCI image served
read-only over virtio-fs (the lower) plus a guest tmpfs (the writable upper).
- StartVirtiofsd: run virtiofsd in find-paths migration mode (so the fs device
survives CH snapshot/restore), serving the per-sandbox shared dir.
- ReconstructSharedDirFromImage: bind-mount a container's image into <cid>/rootfs
under the shared dir (no host-side copy; virtiofsd serves it to the guest on
demand), ensure the standard OCI mountpoints exist, and remount it read-only so the
lower is immutable and byte-identical on every node (find-paths re-opens its inodes
by path on restore).
- CreateSandboxForActor: create the sandbox with the kataShared virtio-fs mount.
- CreateCarrier: a created-but-unstarted container that binds the base to a stable
per-container path the overlay uses as its lowerdir.
- StartOverlayWorkload: create + start the container with an overlayfs rootfs whose
upper/work live on a guest tmpfs.
Run all of an actor's containers in the one micro-VM (the pod sandbox), each with its
own overlay rootfs (virtio-fs RO lower + guest-tmpfs upper) rather than a per-container
disk. Because the writable upper is a guest tmpfs, rootfs writes are part of the CH
memory snapshot and persist across suspend/resume alongside process memory.
- RunWorkload: bind each container's image into the shared dir and start one
virtiofsd; create the sandbox, then a carrier + overlay workload per container.
- CheckpointWorkload: pause + snapshot memory; the tmpfs upper rides along, so there
is no per-container disk to ship.
- RestoreWorkload: reconstruct each read-only lower from the local OCI bundle, start
virtiofsd, repoint the snapshot config's per-VMDir paths (vsock, serial, fs socket),
and OnDemand-restore + resume.
This replaces the per-container disk path: remove the disk builder (BuildExt4Image),
the blk workload (StartBlkWorkload), and the now-obsolete blk integration test.
…overlay rootfs The overlay rootfs serves the image over virtio-fs, so the asset set gains virtiofsd and moves to kata 3.32. virtiofsd is built from a pinned source commit because the vhost-0.16 snapshot/restore fix isn't in a release tag yet (tracking: gitlab.com/virtio-fs/virtiofsd work_items/236). assemble.sh builds it and the stagers upload it to rustfs (kind) and GCS (GKE); the counter-microvm SandboxConfig lists the virtiofsd asset for arm64 + amd64, and the sandboxconfig-assets VAP (with its envtest) now requires virtiofsd for every microvm architecture. The overlay formats nothing on the host, so it runs on the committed debian:stable-slim worker base: drop the custom worker base (hack/ateom-base) and its use in run-microvm-demo.sh. Update the asset README and architecture doc for the overlay.
Terminology and accuracy fixup in files the overlay change doesn't otherwise touch: the runtime no longer resets the rootfs to golden (the overlay's tmpfs upper persists in the memory snapshot), and "owned-boot" was local jargon for ateom booting cloud-hypervisor itself. Comments only.
6d6a4f7 to
06ef4ed
Compare
|
OK, should be synced with the e2e changes now |
| cmd := exec.Command(bin, | ||
| "--socket-path="+o.SocketPath, | ||
| "--shared-dir="+o.SharedDir, | ||
| "--cache=auto", |
There was a problem hiding this comment.
Deferring: currently we only mount this ro, so we could use cache=always, but it's not clear yet how we will handle projected volumes cc Taahir Ahmed (@ahmedtd)
I think we might want to do a special writer for those and leverage the ability to exec into the sandboxes, in which case we could leave the virtiofsd mount fully read-only with no host <> guest churn and aggressively cache in the guest to improve performance.
But until we decide I'm just going to leave this, using it is already a massive startup improvement versus writing to an ext4 for multi-gigabyte disks ... may be worse for read heavy workloads but we can iterate.
Follow-up to #287 / #123