feat: use node-local overlayfs rootfs cache to eliminate per-restore untar (#228) by chenggui53 · Pull Request #283 · agent-substrate/substrate

chenggui53 (chenggui53) · 2026-06-22T07:33:23Z

Summary

Implements #228: cache one extracted, read-only rootfs per immutable image digest on each node, and materialize each actor's bundle as a thin overlayfs mount instead of re-untarring the whole image on every restore.

Problem

Every Restore call in atelet fully reconstructs the rootfs by pulling and untarring the OCI image — even when the same image digest has already been extracted on the same node many times before. Observed cost:

prepareOCIDirectory (untar rootfs): ~15–20s
runsc restore (checkpoint restore): ~268ms

Rootfs extraction dominates resume latency by ~99%.

Solution

Change the scaling behavior from "every restore pays extraction cost" to "first restore of a digest on a node pays extraction; later restores pay only an overlay mount."

On first use of an image digest on a node:

Pull + extract the flattened rootfs once into a node-local, read-only cache directory keyed by the image digest: /var/lib/ateom-gvisor/rootfs-cache/<sha256>/lower/

On every restore using the same digest:

Instead of RemoveAll + untar, set up an overlayfs mount for the actor bundle's rootfs:
- lowerdir = the cached, read-only extracted rootfs (shared, never mutated)
- upperdir + workdir = per-actor, actor-private writable layers

Changes

File	Operation	Description
`internal/ateompath/ateompath.go`	Modified	Added `RootfsCacheDir` and `RootfsCacheLowerDir()`
`cmd/atelet/internal/rootfscache/rootfscache.go`	New	Core cache module: `EnsureRootfs`, `Untar`, `ValidateTarName`, LRU eviction, concurrent dedup
`cmd/atelet/internal/rootfscache/rootfscache_test.go`	New	Unit tests: cache miss/hit, concurrent safety, partial cleanup, eviction, digest validation
`cmd/atelet/overlay.go`	New	overlayfs mount/unmount helpers + `isOverlayfsAvailable()`
`cmd/atelet/oci.go`	Modified	`prepareOCIDirectory` integrates overlayfs path with untar fallback; `extractDigestFromRef`; `unmountActorRootfs`
`cmd/atelet/main.go`	Modified	Creates rootfs cache, wires into `AteomHerder`, `resetActorDirs` adds unmount before cleanup

Key Design Decisions

Fallback safety: tag-only refs (no digest) automatically fall back to the existing untar path
Concurrent safety: per-digest inflightEntry dedup — N goroutines requesting the same digest only trigger 1 untar
Crash safety: .ready sentinel file; loadIndex auto-cleans partial entries from previous crashes
Eviction: LRU by .last_access timestamp, async trigger, 20GB default cap
Unmount cleanup: resetActorDirs does MNT_DETACH unmount on overlayfs rootfs before RemoveAll

Expected Impact

Scenario	Before	After
First restore (same node + digest)	~15-20s	~15-20s (populates cache)
Subsequent restore (cache hit)	~15-20s	<1s (overlayfs mount)
Checkpoint restore	~268ms	~268ms (unchanged)
Total resume latency (cache hit)	~15-20s	<1.3s

Testing

✅ All unit tests pass (go test ./cmd/atelet/...)
✅ Rootfscache tests: cache miss, cache hit, concurrent misses, partial entry cleanup, LRU eviction, digest validation
✅ Existing oci_test.go tests continue to pass (untar, path traversal, symlink escape, hardlink escape)
✅ Built and deployed to kind cluster — atelet running, rootfs-cache directory initialized, overlayfs kernel module available

Open Questions (for follow-up)

Eviction policy tuning (size cap, reference counting)
Tag-based images without digest — should we resolve digest via HEAD request?
Observability: cache hit/miss rate metrics (counters already added, need Prometheus dashboard)

google-cla · 2026-06-22T07:33:40Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

Implements issue agent-substrate#228: cache extracted rootfs per image digest on each node, and materialize per-actor bundles as overlayfs mounts instead of re-untarring on every restore. Changes: - New rootfscache package (cmd/atelet/internal/rootfscache) - New overlay.go with mount/unmount helpers - Modified prepareOCIDirectory for overlayfs integration - Updated resetActorDirs to unmount before cleanup - Added rootfs cache paths to ateompath - Unit tests for cache hit/miss, concurrent access, eviction

Benjamin Elder (BenTheElder) · 2026-06-22T19:02:12Z

I'd like to hold this one momentarily while we land #123 (one of our POC2 / alpha milestone requirements), as we need to make it work there too and there's already a lot in flight.

I agree that we should do something roughly like this though, and appreciate the PR.

rogeroger (rogeroger-yu) · 2026-06-23T02:09:28Z

+)
+
+// DefaultMaxCacheBytes is the default disk budget for the rootfs cache (20 GiB).
+const DefaultMaxCacheBytes int64 = 20 * 1024 * 1024 * 1024


Need to allow configurable size

rogeroger (rogeroger-yu) · 2026-06-23T02:25:06Z

+
+	var oldest *entryState
+	for _, e := range c.entries {
+		if oldest == nil || e.lastAccess.Before(oldest.lastAccess) {


There is no reference count between the cache entry and how many overlay mounts are currently using this lowerDir. lastAccess records "the most recent restore", not "whether it is still mounted and used".
As a result, the rootfs "in use" will be deleted, causing damage to the running container.

rogeroger (rogeroger-yu) · 2026-06-23T02:30:59Z

+
+// isOverlayfsAvailable checks whether the overlayfs kernel module is available
+// by attempting a mount on a temporary directory.
+func isOverlayfsAvailable() bool {


This function is unused

rogeroger (rogeroger-yu) · 2026-06-23T02:33:51Z

+	// entirely.  On a miss, the cache extracts and caches for next time.
+	digest := extractDigestFromRef(ref)
+	if rootfsCache != nil && digest != "" {
+		tarData, err := pullCache.Fetch(ctx, ref)


Cache hits still pullCache.Fetch the entire tar ？

rogeroger (rogeroger-yu) · 2026-06-23T02:35:36Z

+		upperDir := path.Join(bundlePath, "upper")
+		workDir := path.Join(bundlePath, "work")
+		if err := setupOverlayfs(rootPath, lowerDir, upperDir, workDir); err != nil {
+			return fmt.Errorf("setting up overlayfs (lower=%s, target=%s): %w", lowerDir, rootPath, err)


If setupOverlayfs (i.e. unix.Mount("overlay", ...)) fails — kernel without overlayfs, no CAP_SYS_ADMIN/EPERM, nested gVisor/runsc, mount-count limits, etc. — the whole restore returns an error instead of falling back to untar；
The PR description advertises a "digest-only fallback to the old untar path," but that fallback only triggers when there's no digest or no cache — it does nothing for a mount failure.

rogeroger (rogeroger-yu) · 2026-06-23T02:41:23Z

+		span.SetAttributes(attribute.String("rootfs_method", "overlay"))
+	} else {
+		// Fallback: no digest or no cache — extract directly (original path).
+		if err := os.RemoveAll(rootPath); err != nil {


The normal flow relies on resetActorDirs deleting the whole bundleDir between uses — but if prepareOCIDirectory is ever called without a preceding reset, a leftover non-empty upper leaks the previous container instance's writes into the new container (isolation bug), or stacks a second overlay on an already-mounted rootPath.

chenggui53 (chenggui53) force-pushed the worktree-overlayfs-rootfs-cache branch from 2b2efc3 to 970a309 Compare June 22, 2026 07:40

rogeroger (rogeroger-yu) reviewed Jun 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: use node-local overlayfs rootfs cache to eliminate per-restore untar (#228)#283

feat: use node-local overlayfs rootfs cache to eliminate per-restore untar (#228)#283
chenggui53 (chenggui53) wants to merge 1 commit into
agent-substrate:mainfrom
chenggui53:worktree-overlayfs-rootfs-cache

chenggui53 (chenggui53) commented Jun 22, 2026

Uh oh!

google-cla Bot commented Jun 22, 2026

Uh oh!

Benjamin Elder (BenTheElder) commented Jun 22, 2026

Uh oh!

rogeroger (rogeroger-yu) Jun 23, 2026

Uh oh!

rogeroger (rogeroger-yu) Jun 23, 2026

Uh oh!

rogeroger (rogeroger-yu) Jun 23, 2026

Uh oh!

rogeroger (rogeroger-yu) Jun 23, 2026

Uh oh!

rogeroger (rogeroger-yu) Jun 23, 2026

Uh oh!

rogeroger (rogeroger-yu) Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

chenggui53 (chenggui53) commented Jun 22, 2026

Summary

Problem

Solution

Changes

Key Design Decisions

Expected Impact

Testing

Open Questions (for follow-up)

Uh oh!

google-cla Bot commented Jun 22, 2026

Uh oh!

Benjamin Elder (BenTheElder) commented Jun 22, 2026

Uh oh!

rogeroger (rogeroger-yu) Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

rogeroger (rogeroger-yu) Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

rogeroger (rogeroger-yu) Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

rogeroger (rogeroger-yu) Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

rogeroger (rogeroger-yu) Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

rogeroger (rogeroger-yu) Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants