Skip to content

feat: use node-local overlayfs rootfs cache to eliminate per-restore untar (#228)#283

Open
chenggui53 (chenggui53) wants to merge 1 commit into
agent-substrate:mainfrom
chenggui53:worktree-overlayfs-rootfs-cache
Open

feat: use node-local overlayfs rootfs cache to eliminate per-restore untar (#228)#283
chenggui53 (chenggui53) wants to merge 1 commit into
agent-substrate:mainfrom
chenggui53:worktree-overlayfs-rootfs-cache

Conversation

@chenggui53

Copy link
Copy Markdown

Summary

Implements #228: cache one extracted, read-only rootfs per immutable image digest on each node, and materialize each actor's bundle as a thin overlayfs mount instead of re-untarring the whole image on every restore.

Problem

Every Restore call in atelet fully reconstructs the rootfs by pulling and untarring the OCI image — even when the same image digest has already been extracted on the same node many times before. Observed cost:

  • prepareOCIDirectory (untar rootfs): ~15–20s
  • runsc restore (checkpoint restore): ~268ms

Rootfs extraction dominates resume latency by ~99%.

Solution

Change the scaling behavior from "every restore pays extraction cost" to "first restore of a digest on a node pays extraction; later restores pay only an overlay mount."

On first use of an image digest on a node:

  • Pull + extract the flattened rootfs once into a node-local, read-only cache directory keyed by the image digest: /var/lib/ateom-gvisor/rootfs-cache/<sha256>/lower/

On every restore using the same digest:

  • Instead of RemoveAll + untar, set up an overlayfs mount for the actor bundle's rootfs:
    • lowerdir = the cached, read-only extracted rootfs (shared, never mutated)
    • upperdir + workdir = per-actor, actor-private writable layers

Changes

File Operation Description
internal/ateompath/ateompath.go Modified Added RootfsCacheDir and RootfsCacheLowerDir()
cmd/atelet/internal/rootfscache/rootfscache.go New Core cache module: EnsureRootfs, Untar, ValidateTarName, LRU eviction, concurrent dedup
cmd/atelet/internal/rootfscache/rootfscache_test.go New Unit tests: cache miss/hit, concurrent safety, partial cleanup, eviction, digest validation
cmd/atelet/overlay.go New overlayfs mount/unmount helpers + isOverlayfsAvailable()
cmd/atelet/oci.go Modified prepareOCIDirectory integrates overlayfs path with untar fallback; extractDigestFromRef; unmountActorRootfs
cmd/atelet/main.go Modified Creates rootfs cache, wires into AteomHerder, resetActorDirs adds unmount before cleanup

Key Design Decisions

  1. Fallback safety: tag-only refs (no digest) automatically fall back to the existing untar path
  2. Concurrent safety: per-digest inflightEntry dedup — N goroutines requesting the same digest only trigger 1 untar
  3. Crash safety: .ready sentinel file; loadIndex auto-cleans partial entries from previous crashes
  4. Eviction: LRU by .last_access timestamp, async trigger, 20GB default cap
  5. Unmount cleanup: resetActorDirs does MNT_DETACH unmount on overlayfs rootfs before RemoveAll

Expected Impact

Scenario Before After
First restore (same node + digest) ~15-20s ~15-20s (populates cache)
Subsequent restore (cache hit) ~15-20s <1s (overlayfs mount)
Checkpoint restore ~268ms ~268ms (unchanged)
Total resume latency (cache hit) ~15-20s <1.3s

Testing

  • ✅ All unit tests pass (go test ./cmd/atelet/...)
  • ✅ Rootfscache tests: cache miss, cache hit, concurrent misses, partial entry cleanup, LRU eviction, digest validation
  • ✅ Existing oci_test.go tests continue to pass (untar, path traversal, symlink escape, hardlink escape)
  • ✅ Built and deployed to kind cluster — atelet running, rootfs-cache directory initialized, overlayfs kernel module available

Open Questions (for follow-up)

  • Eviction policy tuning (size cap, reference counting)
  • Tag-based images without digest — should we resolve digest via HEAD request?
  • Observability: cache hit/miss rate metrics (counters already added, need Prometheus dashboard)

@google-cla

google-cla Bot commented Jun 22, 2026

Copy link
Copy Markdown

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

Implements issue agent-substrate#228: cache extracted rootfs per image digest on each
node, and materialize per-actor bundles as overlayfs mounts instead of
re-untarring on every restore.

Changes:
- New rootfscache package (cmd/atelet/internal/rootfscache)
- New overlay.go with mount/unmount helpers
- Modified prepareOCIDirectory for overlayfs integration
- Updated resetActorDirs to unmount before cleanup
- Added rootfs cache paths to ateompath
- Unit tests for cache hit/miss, concurrent access, eviction
@chenggui53 chenggui53 (chenggui53) force-pushed the worktree-overlayfs-rootfs-cache branch from 2b2efc3 to 970a309 Compare June 22, 2026 07:40
@BenTheElder

Copy link
Copy Markdown
Collaborator

I'd like to hold this one momentarily while we land #123 (one of our POC2 / alpha milestone requirements), as we need to make it work there too and there's already a lot in flight.

I agree that we should do something roughly like this though, and appreciate the PR.

)

// DefaultMaxCacheBytes is the default disk budget for the rootfs cache (20 GiB).
const DefaultMaxCacheBytes int64 = 20 * 1024 * 1024 * 1024

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to allow configurable size


var oldest *entryState
for _, e := range c.entries {
if oldest == nil || e.lastAccess.Before(oldest.lastAccess) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no reference count between the cache entry and how many overlay mounts are currently using this lowerDir. lastAccess records "the most recent restore", not "whether it is still mounted and used".
As a result, the rootfs "in use" will be deleted, causing damage to the running container.

Comment thread cmd/atelet/overlay.go

// isOverlayfsAvailable checks whether the overlayfs kernel module is available
// by attempting a mount on a temporary directory.
func isOverlayfsAvailable() bool {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is unused

Comment thread cmd/atelet/oci.go
// entirely. On a miss, the cache extracts and caches for next time.
digest := extractDigestFromRef(ref)
if rootfsCache != nil && digest != "" {
tarData, err := pullCache.Fetch(ctx, ref)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cache hits still pullCache.Fetch the entire tar ?

Comment thread cmd/atelet/oci.go
upperDir := path.Join(bundlePath, "upper")
workDir := path.Join(bundlePath, "work")
if err := setupOverlayfs(rootPath, lowerDir, upperDir, workDir); err != nil {
return fmt.Errorf("setting up overlayfs (lower=%s, target=%s): %w", lowerDir, rootPath, err)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If setupOverlayfs (i.e. unix.Mount("overlay", ...)) fails — kernel without overlayfs, no CAP_SYS_ADMIN/EPERM, nested gVisor/runsc, mount-count limits, etc. — the whole restore returns an error instead of falling back to untar;
The PR description advertises a "digest-only fallback to the old untar path," but that fallback only triggers when there's no digest or no cache — it does nothing for a mount failure.

Comment thread cmd/atelet/oci.go
span.SetAttributes(attribute.String("rootfs_method", "overlay"))
} else {
// Fallback: no digest or no cache — extract directly (original path).
if err := os.RemoveAll(rootPath); err != nil {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The normal flow relies on resetActorDirs deleting the whole bundleDir between uses — but if prepareOCIDirectory is ever called without a preceding reset, a leftover non-empty upper leaks the previous container instance's writes into the new container (isolation bug), or stacks a second overlay on an already-mounted rootPath.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants