Skip to content

microvm MVP#287

Merged
Bowei Du (bowei) merged 26 commits into
agent-substrate:mainfrom
BenTheElder:microvm-blk-rootfs
Jun 25, 2026
Merged

microvm MVP#287
Bowei Du (bowei) merged 26 commits into
agent-substrate:mainfrom
BenTheElder:microvm-blk-rootfs

Conversation

@BenTheElder

@BenTheElder Benjamin Elder (BenTheElder) commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Fixes #123

cloud-hypervisor + kata (guest only).

  • Tests pass
  • Appropriate changes to documentation are included in the PR

TODO:

  • CI coverage ... not sure if we can pull this off in GHA.

There are some other obvious TODOs that should be resolvable but are not resolved yet:

  • We require a custom base image (for mkfs), and we have no where to host them currently, so it has to be built on demand.
  • Only supports one container. (Should be a quick fix ... EDIT: it is but it could stay a follow-up given the PR size ...)
  • We have to hardcode the kvm mount etc in the workerpool controller. Punting generalizing that as pluggability is a long-term goal vs the short term POC-2 goals. We can think about how much config we want to expose there.
  • cmd/setup-gcp isn't provisioning with nested virt yet, I added a node pool manually
  • We probably need docs for setting up local dev. I'm using limactl on an M4 mac, with nested virt enabled (non-default config). The kind script will handle this IF the container env has KVM available.
  • No /run/ate projected mount. Easy enough to fix ™️

I've started on these but I'm not blocking on them.

@BenTheElder Benjamin Elder (BenTheElder) marked this pull request as ready for review June 23, 2026 06:37
Davanum Srinivas (dims) added a commit to dims/substrate that referenced this pull request Jun 23, 2026
agent-substrate#287's micro-VM (kata + cloud-hypervisor) MVP builds the actor's virtio-blk
rootfs synchronously with mkfs.ext4 -d, bound by the resume RPC deadline. For a
small image that is sub-second; a real multi-GB image (OpenShell's ~3.2 GB
helpdesk supervisor) takes minutes, so the context is cancelled and mkfs is
SIGKILLed mid-write -- the actor never boots.

- ateom-microvm RunWorkload (golden boot): build the rootfs on a deadline-
  detached context to a temp file + atomic rename; skip if it already exists.
  Idempotent + crash-safe so the controller retries converge.
- ateom-microvm RunWorkload (restore reconstruct): detach the rebuild from the
  RPC deadline too (reset-each-restore, so no idempotency).
- ateapi ResumeActor: hold the actor lock long enough to cover the build
  (30s -> 5m; suspend/delete stay short).
- atenet router implicit-resume: 15s -> 5m, so a request that triggers a resume
  waits for the multi-GB rebuild + VM restore instead of cancelling it.

Verified on CPU kind: the OpenShell helpdesk golden snapshot reaches Ready
(supervisor boots inside the kata guest + is checkpointed) and a resumed user
actor's agent serves /status, where before every path looped on
'mkfs.ext4 ... signal: killed'.

Signed-off-by: Davanum Srinivas <davanum@gmail.com>
@dims

Copy link
Copy Markdown
Collaborator

Benjamin Elder (@BenTheElder) I was able to test this with my openshell integration with some changes, please see:
c3aa089

It's mostly timeout bumps but there's a suggestion on making building Ext4 images a bit more robust.

Verified on bigbox (CPU kind cluster, /dev/kvm + nested virt):
- #287's own counter-microvm demo passes (boot → suspend → resume, in-RAM count survives the VM snapshot).
- OpenShell helpdesk golden snapshot reaches Ready — the supervisor boots inside a kata/cloud-hypervisor VM and is checkpointed.
- A resumed user actor's agent serves traffic: GET /status → {"history_turns":0,"uptime_seconds":79.7,"model":"gpt-oss:20b-cloud"}.

What was broken + the fix

#287's MVP builds each actor's virtio-blk rootfs synchronously with mkfs.ext4 -d, bound by the resume RPC deadline. Sub-second for a tiny image; for OpenShell's 3.2 GB helpdesk image it takes minutes → the context cancels → mkfs … signal: killed, at four spots. The branch addresses all:
1. ateom-microvm golden-boot build → deadline-detached + idempotent (temp→atomic-rename).
2. ateom-microvm restore-path rebuild → deadline-detached.
3. ateapi ResumeActor lock TTL 30s→5m.
4. atenet router implicit-resume 15s→5m.

@BenTheElder

Benjamin Elder (BenTheElder) commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator Author

It's mostly timeout bumps but there's a suggestion on making building Ext4 images a bit more robust.

ACK on ext4 more robust, I'm also working on moving the ext4 provisioning into Go (without exec) so we can drop the custom base image and make them more deterministic, but it's a complex change. I think it works fine now but I haven't battle tested it yet.

Our mock workloads have pretty small disks, will have to reconsider this aspect. Those timeouts seem abysmal!

I think I know how we want to fix this ... let me test some things.

Comment thread hack/update/licenses.sh
@BenTheElder

Copy link
Copy Markdown
Collaborator Author

[just trivial rebases for now]

Comment thread cmd/atelet/main.go
if err != nil {
return nil, fmt.Errorf("while calling ateom.CheckpointWorkload: %w", err)
}
sandboxRec.SnapshotFiles = resp.GetSnapshotFiles()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there more files stored in the checkpoint directory besides the snapshot files? Wondering if we could require that the entire directory only contain the necessary files.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should only have the snapshot files, but the specific list is currently the restore contract / recorded to the manifest. If we want to pivot to e.g. tarring them up, that should be fine.

Julian Gutierrez Oschmann (juli4n) pushed a commit that referenced this pull request Jun 23, 2026
)

While working on multi-container support for uVM in
#287 I noticed how
difficult it is to debug multi-container workloads.

This makes it a lot easier. TODO: We could support filtering on this in
kubectl ate (cc @HavenXia)
go-toml/v2 parses the kata configuration.toml; ttrpc (and its log dependency)
backs the kata-agent client the micro-VM runtime drives. Includes their licenses.
atelet passes the runtime-fetched sandbox asset paths to the ateom worker and
records the exact set of files in each snapshot, so snapshots are self-describing
and the worker image bakes in no sandbox toolchain.
A copy of the kata-containers kata-agent protocol buffers (agent/oci/types/csi),
used to drive the guest agent over ttrpc. Copied verbatim from kata-containers
3.31.0; see PROVENANCE.md for the upstream source, version, and license.
WorkerPool renders the micro-VM worker pod shape, and the SandboxConfig
validating admission policy enforces the micro-VM asset set (cloud-hypervisor,
guest kernel, guest image, kata config).
…ansfer

Ship/restore exactly the files each runtime records in the snapshot manifest.
The guest memory image is mostly free (zero) RAM, so compress and download only
the populated extents (versioned sparse-extent format, backward compatible with
plain zstd). On restore, download the snapshot concurrently with unpacking the
OCI image; for streaming object stores, pipe compression straight into the upload.
Move the ActorLogger (and SyncedWriter) out of cmd/ateom-gvisor into a shared
internal package so both ateom runtimes forward actor container logs to the pod
log with the same ate.dev/* labels. Add serverboot.InitLoggerWithWriter so a
runtime can route its own slog through the same synchronized writer as the actor
log forwarder (no interleaved lines).
A small REST client to own the guest boot (vm.create / vm.add-net with tap FDs /
vm.boot) and to snapshot and restore it. Restore uses cloud-hypervisor's OnDemand
(userfaultfd) memory restore to avoid densifying the guest memory, and a sparse
diff-merge overlays the post-restore delta back onto its source so each snapshot
stays complete and re-restorable.
A ttrpc kata-agent client (sandbox create, container create/start, guest
interface/route/ARP setup, stdout/stderr read), an mkfs.ext4 builder that turns
the OCI bundle rootfs into a virtio-blk disk image, OCI-to-agent spec conversion,
and a go-toml reader for the guest sizing + kernel params.
A second ateom runtime that runs the actor in a kata 3.31 guest under
cloud-hypervisor. ateom owns the CH boot itself (no kata shim) and gives the
actor a writable boot-time virtio-blk /dev/vdb rootfs built from the OCI bundle,
so rootfs writes land off guest RAM and the snapshot is memory-only (no balloon).
It drives the guest agent for sandbox setup + networking, snapshots/restores the
VM with reset-to-golden disk semantics (rootfs writes discarded across
suspend/resume, in-RAM state preserved) including cross-node restore, and forwards
the actor's stdout/stderr to the pod log with ate.dev/* labels.
Assemble + stage the micro-VM runtime assets, an ateom-base image (debian-slim +
e2fsprogs for mkfs.ext4), and run-microvm-demo.sh to build + deploy the
counter-microvm demo end to end (overriding the worker base via KO_CONFIG_PATH so
no committed file is edited). Document the micro-VM sandbox class.
Comment thread go.mod
github.com/hashicorp/go-reap v0.0.0-20260220095743-4e27870b4f51
github.com/klauspost/compress v1.18.5
github.com/opencontainers/runtime-spec v1.3.0
github.com/pelletier/go-toml/v2 v2.4.0

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like we need this for reading kata's configuration.toml

@BenTheElder Benjamin Elder (BenTheElder) Jun 24, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. and ttrpc is for talking to kata-agent (should be in the commit message)

@@ -0,0 +1,52 @@
# third_party/kata — vendored kata-containers sources

Source copied from [kata-containers](https://github.com/kata-containers/kata-containers)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anything we can ask kata folks to do here to make it easier for us consume?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we could ask for it to be split to another module?

I'm not sure if this is even supported, the kata-agent is something of an implementation detail currently, we could pretty easily substitute our own.

At the moment the most annoying part would be having another binary to plumb, especially without hosted builds.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seemed premature to ask for that now. We can consider it for a follow-up, but it's functionally pretty similar to vendoring. If we need to upgrade this often, that's a signal that we should move to a more stable integration point.

I was originally using the kata shim, but ... the gap to do snapshot/restore is pretty big.

`CreateSandbox`, `CreateContainer`, `StartContainer`, `UpdateInterface`, `UpdateRoutes`,
`AddARPNeighbors`, `ReadStdout`, `ReadStderr`.

### Regenerating

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we touch this again, we'll need to add a make file or script!

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

~agents :-)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's take a TODO -- agents sort of do it but it's hard to repeatedly verify exactly what was done to go from the Kata code to this codebase. It's easy to leave things out of a readme file because that is not executed deterministically.

Comment thread cmd/atecontroller/internal/controllers/workerpool_apply.go Outdated

// WaitReady blocks until the api-socket answers vmm.ping or the deadline passes.
func (c *Client) WaitReady(ctx context.Context, deadline time.Duration) error {
end := time.Now().Add(deadline)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use context deadline?

Comment thread cmd/ateom-microvm/internal/ch/merge.go Outdated
if err := m.Close(); err != nil {
return err
}
// Put the merged image at delta's name. We UNLINK CH's old delta FIRST, then

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these comments are somewhat distracting detail-wise and could be summarized by something like "partially written back data will be discarded so there is no need to worry about sync'ing pages fully to disk".

Same with comment above.

Comment thread cmd/ateom-microvm/internal/ch/merge.go Outdated
}

// MergeDeltaIntoBase produces the same COMPLETE merged snapshot as
// MergeSparseOverlay(base, delta, delta) — base with delta's populated pages

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this comment confusing -- especially since it describes what MergeSparseOverlay does. Can we have the comment just describe what MergeDeltaIntoBase does directly?

Also, some wierdness: MergeSparseOverlay(base, delta, ?delta?)

Comment thread cmd/ateom-microvm/internal/ch/merge.go Outdated
// checkpoint-state/), so the renames are same-filesystem (metadata-only). On the
// off chance they straddle a mount boundary (EXDEV), it falls back to the copying
// MergeSparseOverlay (base is untouched until the first rename succeeds).
func MergeDeltaIntoBase(ctx context.Context, base, delta string) error {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest: base-> baseFile, deltaFile

Comment thread cmd/ateom-microvm/internal/ch/merge.go Outdated
return os.Rename(merged, delta)
}

// overlayDataRegions copies every populated (non-hole) region of src onto dst at

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest just calling this copySparseRegions. There is no overlay being produced, you are overwriting the dst.

}
return copied, fmt.Errorf("SEEK_DATA: %w", err)
}
de, err := unix.Seek(sfd, ds, unix.SEEK_HOLE)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: there is an iterator-based impl over sparse that probably would be easier to understand.

// Raw HTTP/1.1 over the unix socket: net/http cannot attach SCM_RIGHTS, and
// CH's micro_http collects fds from the recvmsg ancillary data of the
// request that carries them.
req := fmt.Sprintf("PUT /api/v1/vm.restore HTTP/1.1\r\nHost: localhost\r\nAccept: */*\r\nContent-Type: application/json\r\nContent-Length: %d\r\n\r\n%s", len(body), body)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is kind of gross -- can we not just use the standard http transport here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, because we need to do the SCM_RIGHTS stuff ...

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might be able to hack it with a custom round tripper, but probably save that for later. For right now, can we make it a bit more structured:

func formatHTTPRequest(method, path string, headers map[string]string, body string) string

and

func parseHTTPResponse(...) ...

so we can unit test this?

} else {
_ = conn.SetDeadline(time.Now().Add(30 * time.Second))
}
req := fmt.Sprintf("PUT /api/v1/vm.add-net HTTP/1.1\r\nHost: localhost\r\nAccept: */*\r\nContent-Type: application/json\r\nContent-Length: %d\r\n\r\n%s", len(body), body)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here... is there a good reason why we have to do this completely raw?

if _, err := br.ReadString('\n'); err != nil { // the "OK <n>" line
return "debug-console CONNECT reply: " + err.Error()
}
// The kata debug console is an INTERACTIVE shell on a PTY (console.rs spawns

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or you could look for the sentinel appearing exactly twice?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Didn't seem super important. This approach works fine.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It relies on bash string pasting semantics with the single quote, not the most obvious thing in the world.

// DialAgent connects to the kata-agent through the hybrid-vsock socket at
// vsockPath (VsockSocketPath(id)): plain-text "CONNECT <port>" handshake with
// the VMM, then ttrpc over the stream.
func DialAgent(ctx context.Context, vsockPath string) (*AgentClient, error) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of this stuff is in agent/protocols/client/client.go in the Kata repo? Any reason we replicate it vs import from Kata?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could third_party it as well, the dependency tree is huge and we just want a small subset for the client. This file is basically all crud on top of the protos + TTRPC.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to document that very obviously...

// sandbox id (= actor id) then collides on the next attempt: "listen unix
// .../virtiofsd.sock: bind: address already in use", "Could not bind mount
// .../shared/sandboxes/<id>/mounts", "directory not empty". Calling this
// before each run gives kata a clean slate. Safe when nothing exists.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] delete "Safe when nothing exists" somewhat weird text

}
argv0 := strings.SplitN(string(cmdline), "\x00", 2)[0]
if strings.Contains(argv0, "cloud-hypervisor") || strings.Contains(argv0, "virtiofsd") || strings.Contains(argv0, "containerd-shim-kata") {
_ = unix.Kill(pid, unix.SIGKILL)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should log on error instead of eating it

// Deepest paths first so child mounts unmount before their parents.
sort.Slice(mounts, func(i, j int) bool { return len(mounts[i]) > len(mounts[j]) })
for _, mp := range mounts {
_ = unix.Unmount(mp, unix.MNT_DETACH)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should log on error, if we eat it and there is something wrong, then we would not know what happened?

}
}
for _, d := range dirs {
_ = os.RemoveAll(d)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

// configuration.toml. memDefault/vcpuDefault are substituted when the key is
// absent or non-positive (kata also accepts default_vcpus = -1 meaning "all host
// CPUs", which the owned boot does not support).
func ParseConfig(base []byte, memDefault, vcpuDefault int) (KataConfig, error) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*KataConfig? avoid copying by default

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 ints and a string? probably not worth it?

Comment thread cmd/ateom-microvm/main.go Outdated

var (
podUID = flag.String("pod-uid", "", "The UID of the current pod")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] any reason to space this out like this?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, some weird artifact, I AI split the commits (aside from generating much of the code). cleaning up along with most of these comments.

Comment thread cmd/ateom-microvm/main.go Outdated

// Share one synchronized writer between the runtime logger and the actor-log
// forwarder (created below) so the two log streams to the pod's stdout don't
// interleave-corrupt each other's lines (mirrors ateom-gvisor).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] probably don't want to leave a bunch of "mirrors ateom-gvisor" in the code.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can pull this out to high level docs. I want the behaviors to align so switching sandboxClass is ~easy.

}

func lastLines(s string, n int) string {
lines := []string{}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

var lines []string

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason to prefer that format?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly style -- it also doesn't alloc memory by default

t.Logf("reset-to-golden OK: discarded the rootfs write (disk sentinel gone) while RAM continuity held: %q", strings.TrimSpace(got))
}

func lastLines(s string, n int) string {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason why not

lines := strings.Split(s, "\n")
if len(lines) < n { return strings.Join(lines, "\n") }
etc

Comment thread cmd/ateom-microvm/net.go
Comment thread cmd/ateom-microvm/net.go Outdated
if err != nil {
return err
}
hostMAC, err := net.ParseMAC(hostVethMAC)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just do this in init()

Comment thread cmd/ateom-microvm/run.go Outdated
return nil, fmt.Errorf("while writing %s: %w", baseIDFile, err)
}

// NB: the snapshot is MEMORY-ONLY (config/state/memory-ranges + base-id). The

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this comment text repeated in multiple places?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just AI nonsense. will clean up.

Comment thread cmd/ateom-microvm/run.go Outdated
}
dSnapshot := time.Since(tSnapshot)

// Diff-snapshot completion for an OnDemand-restored actor: CH's snapshot here is

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here -- are you reminding yourself of this?

Comment thread cmd/ateom-microvm/run.go Outdated

// Tear down the per-activation actor network (mirrors gVisor).
if err := s.cleanupActorNetwork(ctx); err != nil {
slog.WarnContext(ctx, "Failed to clean up actor network after checkpoint", slog.Any("err", err))

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we keep going or just panic if the environment seems to have failed to be cleaned up

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to best effort tear down everything. Maybe after attempting to do all cleanup it panics.

Comment thread cmd/ateom-microvm/run.go Outdated
// guest ext4 cache:
// - same-node: a verbatim golden template (copyDiskFile) — guaranteed identical.
// - cross-node: rebuild from the OCI image atelet unpacked to the bundle at
// restore (mkfs.ext4 -d is LAYOUT-deterministic for identical inputs, so the

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this assumption always true for mkfs.ext4? probably ok in practice, although it would be kind of hard to diagnose if there were subtle shifts

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is an explicit TODO. I have a stab at moving to go, but really we will just switch to virtiofsd + tmpfs overlay upper in the short term (perhaps before this merges even ...)

Comment thread cmd/ateom-microvm/run.go Outdated
_ = chCmd.Process.Kill()
}
}()
// OnDemand (userfaultfd) memory restore: ~75ms vs ~1.8s eager, and it keeps the

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this comment needed here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need these details somewhere and we're setting the mode here, if there's temptation to switch to "Copy" mode it will be problematic for repeated roundtrips.

@bowei

Bowei Du (bowei) commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

I took a pass through everything...

Given the current MVP state where we still have to hash out a bunch of things, I didn't find anything show stopping in the review. I'm sure we need to morph quite a few things with the merging of the gvisor and CHV shared code. One thing that would be good to fix is to break up some of the really long func and files into more self contained (understandable) pieces that are unit tested.

- cleanup_linux: log on error (unmount/RemoveAll/kill) via ctx slog instead of
  swallowing; drop the stale kata-shim leftovers (no shim/containerd) while
  keeping the virtiofsd path: remove the /run/vc/sbs dir and the
  containerd-shim-kata kill-arm, fix the doc comment.
- run.go: include checkpointDir in the clear/create error messages; rename the
  resolvedRuntime.ch field to chBinary; make firstNonEmpty variadic + handle
  all-empty; drop stale "shim-owned"/"eager/shim paths" wording.
- main.go: compact the flag var block; trim "mirrors ateom-gvisor" comments.
- service_integration_test: simplify lastLines with strings.Split.
- ateom-base Dockerfile: full apt cleanup (apt-get clean + rm tmp).
- stage-to-rustfs.sh: fail fast when the aws CLI is missing.
- Parse the fixed veth CIDRs/MACs/gateway once into package vars (mustParse* at
  init) instead of re-parsing per activation (bowei).
- cleanupActorNetwork: gather the removeActorNftablesRules error and keep going
  (errors.Join + warn) instead of returning early.
- enableIPv4Forwarding: open the sysctl O_WRONLY rather than os.WriteFile (do not
  create it if missing).
- actorVethMTU: log a warning when the veth link can't be read before defaulting.
Drop the no-longer-used kata-shim from the example asset list; keep virtiofsd.
@BenTheElder

Copy link
Copy Markdown
Collaborator Author

I'm working on addressing the comments tonight.

I'm sure we need to morph quite a few things with the merging of the gvisor and CHV shared code

Yeah. We could do more code dedupe here.

I'm slightly concerned about lots of low level gvisor changes (e.g. networking stuff) landing into microvm in the short term, since we will have no CI coverage at least for now. But I suppose either way it has to be dealt with ASAP.

"Production Grade uVM" is a Beta target, FWIW.

I already have a bunch of local changes waiting to rebase and stack on this.

Leaning towards folding in some of the smaller and more critical changes given the timing.

@dims

Copy link
Copy Markdown
Collaborator

Benjamin Elder (@BenTheElder) yes let's land this and iterate!

- Rename params base/delta/out -> baseFile/deltaFile/outFile.
- Rename overlayDataRegions -> copySparseRegions (it overwrites dst, not an
  overlay).
- Rewrite MergeDeltaIntoBase doc to describe it directly; condense the no-fsync
  and rename-to-free-name comments.
- workerpool_apply: rename applyMicroVMPodShape -> maybeApplyMicroVMPodShape and
  pass wp.Spec.SandboxClass instead of the whole WorkerPool.
- sandboxconfig validation test: add an arm64 micro-VM asset-set case.
- specconv: TODO to forward Seccomp/Sysctl + Apparmor/SELinux for OCI parity.
- run.go: clarify the dialAgentRetry per-attempt cap vs retry-gap comment.
- roadmap: drop the microVM line (shipped).
- readSparseZstd: validate totalSize >= 0 and that each extent falls within size
  (the header is read from the downloaded snapshot) — guards Truncate/Seek/CopyN.
- copyZstdSparse: Truncate(0) up front so skipped (hole) regions can't expose
  stale bytes; it is a sparse write-out, not an in-place overlay.
- Rename sendToGCSWithZstd -> sendZstd, sendToGCSStreaming -> sendStreamingZstd
  (the package is already ategcs; it also handles S3/rustfs).
The memory-only snapshot holds because the rootfs is a host-backed virtio-blk
disk rather than a guest tmpfs overlay-upper (block writes still transit the
guest page cache transiently). Reword the 'off guest RAM'/'NOT guest RAM'
phrasing accordingly and trim the repeated memory-only NB.
run.go was ~1060 lines. Split the file (bowei) into cohesive units in the same
package, no logic changes:
- checkpoint.go: CheckpointWorkload + listFiles + teardownActor
- restore.go:    RestoreWorkload + rewriteSnapshotSocketPaths + repointActorRootfsDisk
- spec.go:       ensureKataCompatibleSpec + defaultKataMounts + defaultKataResources
run.go keeps RunWorkload + the shared boot/agent/net helpers (~500 lines).
RunWorkload was ~400 lines. Extract the dense, self-contained blocks into helpers
(no logic change); RunWorkload stays the orchestrator (the retErr-tied cleanup
defers must live there):
- guestConfig:         parse kata config -> mem/vcpus/kernel-params
- buildVMConfig:       assemble the CH VmConfig (cmdline + disks + vsock)
- startActorContainer: post-boot agent setup (sandbox, guest net, start container)
RunWorkload is now ~163 lines.
Restructure the ategcs object paths so the compress/decompress logic is in small,
unit-testable funcs that only touch io.Reader/io.Writer (bowei):
- writeContent: sparse-extent (file) vs plain zstd; returns a writeContentResult
  ({logicalBytes, populatedBytes, sparse}) instead of multi-returns + side vars.
- decodeContent: the symmetric download half (auto-detect sparse vs plain).
- sendZstd is now a thin dispatcher; the temp-file path is sendBufferedZstd,
  symmetric with sendStreamingZstd (both call writeContent).
- Rename the confusing logical/dataBytes -> logicalBytes/populatedBytes (log keys
  too). Add a direct writeContent<->decodeContent round-trip test.
Make the sparse-extent format streamable (bowei): instead of writing numExtents +
the extent table up front, emit (off,len,data) frames terminated by an end-offset
sentinel, with the metadata compressed alongside the data in the single zstd
stream. The writer discovers + emits extents incrementally (no up-front scan to
count them) and drops the in-memory extent table; the reader replays frames until
the sentinel. Bump sparseVersion 1->2 (readers reject older snapshots); keep the
per-extent bounds validation. Round-trip tests cover it.
Both implementers (gcsClient, test streamingMemStore) returned true and s3Client
doesn't implement it at all, so the SupportsStreamingPut() bool was redundant with
the type assertion (dims). Make streamingPutter a marker: presence of the
(unexported) method is the signal; the call site checks only the assertion.
Add sparsezstd_test.go (bowei): table-driven writeSparseZstd<->readSparseZstd
round-trips across hole/data layouts (empty, all-hole, all-data, leading/trailing
holes, single + multi extent), plus malformed-input coverage of the reader's
validation (bad version, negative size, extent past end, negative offset) and a
truncated stream (missing end sentinel).
The existing TestCopyZstdSparse used a fresh dst, so it never exercised the
defensive Truncate(0). Add TestCopyZstdSparseClearsStaleData: a dst pre-filled
with stale non-zero bytes (larger than the new content) must come back byte-exact
with holes zeroed and shrunk to the logical size.
@BenTheElder

Copy link
Copy Markdown
Collaborator Author

I'm pushing commits on top to make them more reviewable ... in theory. But I think it's probably going to be worth folding back into a logical stream again before merge ...

This is a lot for a single commit, so I don't think we should squash merge it, but I also don't necessarily think we want a ton of fixup commits at the end either.

The 'NB: the snapshot is memory-only...' note duplicated the CheckpointWorkload
doc comment (bowei). Removed it. (main.go's var-block spacing artifact was already
compacted in an earlier fixup.)
@BenTheElder

Copy link
Copy Markdown
Collaborator Author

main...BenTheElder:substrate:microvm-blk-rootfs-review-address-snapshot has the commits as reviewed + the review addressing commits ... I'll leave that as is for reference and then clean up the history here.

@bowei

Copy link
Copy Markdown
Collaborator

mega pr for micro vm

@bowei Bowei Du (bowei) merged commit e44249f into agent-substrate:main Jun 25, 2026
9 checks passed
@BenTheElder

Copy link
Copy Markdown
Collaborator Author

... I meant to squash back in all the "Address review" comments before merge, got too tied up with oncall ... oh well.

cleaning up some follow-up in smaller PRs ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] MicroVM support

5 participants