Skip to content

microvm MVP cleanup: align rootfs behavior to gvisor, ...#313

Open
Benjamin Elder (BenTheElder) wants to merge 5 commits into
agent-substrate:mainfrom
BenTheElder:microvm-overlay-tmpfs
Open

microvm MVP cleanup: align rootfs behavior to gvisor, ...#313
Benjamin Elder (BenTheElder) wants to merge 5 commits into
agent-substrate:mainfrom
BenTheElder:microvm-overlay-tmpfs

Conversation

@BenTheElder

Copy link
Copy Markdown
Collaborator

Follow-up to #287 / #123

  • align snapshot behavior: use a tmpfs (for writes) on top of read-only viritio-fs mount for the container image rootfs instead of ext4 images
  • enable multiple container support
  • cleanup stale comments

It's a good idea to open an issue first for discussion.

  • Tests pass
  • Appropriate changes to documentation are included in the PR

@BenTheElder Benjamin Elder (BenTheElder) changed the title microvm MVP cleanup: align rootfs behavior to gvisor, improve large container image performance, ... microvm MVP cleanup: align rootfs behavior to gvisor, ... Jun 26, 2026
@BenTheElder Benjamin Elder (BenTheElder) marked this pull request as ready for review June 26, 2026 05:28
@BenTheElder

Copy link
Copy Markdown
Collaborator Author

Davanum Srinivas (@dims) this should significantly unblock large images. Still needs follow-ups to optimize image pull / unpack. I tested with the 3GB openshell image.

@dims

Copy link
Copy Markdown
Collaborator

Benjamin Elder (@BenTheElder) does this need me to build/use virtiofsd from source?

Comment thread cmd/ateom-microvm/internal/kata/overlay_linux.go
@BenTheElder

Copy link
Copy Markdown
Collaborator Author

Benjamin Elder (@BenTheElder) does this need me to build/use virtiofsd from source?

Unfortunately yes but the scripts will handle it.

Comment thread cmd/ateom-microvm/run.go
// it with a tmpfs upper. On failure it dumps the guest overlay state.
func startOverlayContainer(ctx context.Context, ac *kata.AgentClient, vsockPath string, c actorContainer) error {
carrierCtx, carrierCancel := context.WithTimeout(ctx, 30*time.Second)
err := ac.CreateCarrier(carrierCtx, c.name, c.spec)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will there be a problem if the user container itself has -ovl as a suffix? Do we need to make sure there is no collision between carrier and workload ids?

@BenTheElder Benjamin Elder (BenTheElder) Jun 26, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can switch to _ovl since k8s DNS-label restrictions shouldn't apply here but do apply to the user specified containers. Testing / fixing.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed this, I think that fix works well. It seems unlikely we'll want to allow invalid k8s container names ... later we might consider something like the dynamic containers KEP.

@BenTheElder

Copy link
Copy Markdown
Collaborator Author

I need to rework the branch atop the CI changes now that we have those, so we get coverage on this PR :-)

Will do that before we merge [IE even if this LGTM to reviewers, let's wait until that is done, pointing an agent at it ...]

The overlay rootfs serves each container's image read-only over virtio-fs, which
needs a vhost-user fs device (a virtiofsd socket) and more than one PCI segment (the
fs device sits on segment 1, kata's convention). Add FsConfig + PlatformConfig and the
Fs/Platform fields to VmConfig; both are omitempty, so a config without them is
serialized exactly as before.
…s upper)

Helpers to assemble a container's rootfs as an overlay: its OCI image served
read-only over virtio-fs (the lower) plus a guest tmpfs (the writable upper).

  - StartVirtiofsd: run virtiofsd in find-paths migration mode (so the fs device
    survives CH snapshot/restore), serving the per-sandbox shared dir.
  - ReconstructSharedDirFromImage: bind-mount a container's image into <cid>/rootfs
    under the shared dir (no host-side copy; virtiofsd serves it to the guest on
    demand), ensure the standard OCI mountpoints exist, and remount it read-only so the
    lower is immutable and byte-identical on every node (find-paths re-opens its inodes
    by path on restore).
  - CreateSandboxForActor: create the sandbox with the kataShared virtio-fs mount.
  - CreateCarrier: a created-but-unstarted container that binds the base to a stable
    per-container path the overlay uses as its lowerdir.
  - StartOverlayWorkload: create + start the container with an overlayfs rootfs whose
    upper/work live on a guest tmpfs.
Run all of an actor's containers in the one micro-VM (the pod sandbox), each with its
own overlay rootfs (virtio-fs RO lower + guest-tmpfs upper) rather than a per-container
disk. Because the writable upper is a guest tmpfs, rootfs writes are part of the CH
memory snapshot and persist across suspend/resume alongside process memory.

  - RunWorkload: bind each container's image into the shared dir and start one
    virtiofsd; create the sandbox, then a carrier + overlay workload per container.
  - CheckpointWorkload: pause + snapshot memory; the tmpfs upper rides along, so there
    is no per-container disk to ship.
  - RestoreWorkload: reconstruct each read-only lower from the local OCI bundle, start
    virtiofsd, repoint the snapshot config's per-VMDir paths (vsock, serial, fs socket),
    and OnDemand-restore + resume.

This replaces the per-container disk path: remove the disk builder (BuildExt4Image),
the blk workload (StartBlkWorkload), and the now-obsolete blk integration test.
…overlay rootfs

The overlay rootfs serves the image over virtio-fs, so the asset set gains virtiofsd
and moves to kata 3.32. virtiofsd is built from a pinned source commit because the
vhost-0.16 snapshot/restore fix isn't in a release tag yet (tracking:
gitlab.com/virtio-fs/virtiofsd work_items/236). assemble.sh builds it and the stagers
upload it to rustfs (kind) and GCS (GKE); the counter-microvm SandboxConfig lists the
virtiofsd asset for arm64 + amd64, and the sandboxconfig-assets VAP (with its envtest)
now requires virtiofsd for every microvm architecture.

The overlay formats nothing on the host, so it runs on the committed debian:stable-slim
worker base: drop the custom worker base (hack/ateom-base) and its use in
run-microvm-demo.sh. Update the asset README and architecture doc for the overlay.
Terminology and accuracy fixup in files the overlay change doesn't otherwise touch:
the runtime no longer resets the rootfs to golden (the overlay's tmpfs upper persists
in the memory snapshot), and "owned-boot" was local jargon for ateom booting
cloud-hypervisor itself. Comments only.
@BenTheElder

Copy link
Copy Markdown
Collaborator Author

OK, should be synced with the e2e changes now

cmd := exec.Command(bin,
"--socket-path="+o.SocketPath,
"--shared-dir="+o.SharedDir,
"--cache=auto",

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deferring: currently we only mount this ro, so we could use cache=always, but it's not clear yet how we will handle projected volumes cc Taahir Ahmed (@ahmedtd)

I think we might want to do a special writer for those and leverage the ability to exec into the sandboxes, in which case we could leave the virtiofsd mount fully read-only with no host <> guest churn and aggressively cache in the guest to improve performance.

But until we decide I'm just going to leave this, using it is already a massive startup improvement versus writing to an ext4 for multi-gigabyte disks ... may be worse for read heavy workloads but we can iterate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants