feat: use node-local overlayfs rootfs cache to eliminate per-restore untar (#228)#283
Conversation
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
Implements issue agent-substrate#228: cache extracted rootfs per image digest on each node, and materialize per-actor bundles as overlayfs mounts instead of re-untarring on every restore. Changes: - New rootfscache package (cmd/atelet/internal/rootfscache) - New overlay.go with mount/unmount helpers - Modified prepareOCIDirectory for overlayfs integration - Updated resetActorDirs to unmount before cleanup - Added rootfs cache paths to ateompath - Unit tests for cache hit/miss, concurrent access, eviction
2b2efc3 to
970a309
Compare
|
I'd like to hold this one momentarily while we land #123 (one of our POC2 / alpha milestone requirements), as we need to make it work there too and there's already a lot in flight. I agree that we should do something roughly like this though, and appreciate the PR. |
| ) | ||
|
|
||
| // DefaultMaxCacheBytes is the default disk budget for the rootfs cache (20 GiB). | ||
| const DefaultMaxCacheBytes int64 = 20 * 1024 * 1024 * 1024 |
There was a problem hiding this comment.
Need to allow configurable size
|
|
||
| var oldest *entryState | ||
| for _, e := range c.entries { | ||
| if oldest == nil || e.lastAccess.Before(oldest.lastAccess) { |
There was a problem hiding this comment.
There is no reference count between the cache entry and how many overlay mounts are currently using this lowerDir. lastAccess records "the most recent restore", not "whether it is still mounted and used".
As a result, the rootfs "in use" will be deleted, causing damage to the running container.
|
|
||
| // isOverlayfsAvailable checks whether the overlayfs kernel module is available | ||
| // by attempting a mount on a temporary directory. | ||
| func isOverlayfsAvailable() bool { |
There was a problem hiding this comment.
This function is unused
| // entirely. On a miss, the cache extracts and caches for next time. | ||
| digest := extractDigestFromRef(ref) | ||
| if rootfsCache != nil && digest != "" { | ||
| tarData, err := pullCache.Fetch(ctx, ref) |
There was a problem hiding this comment.
Cache hits still pullCache.Fetch the entire tar ?
| upperDir := path.Join(bundlePath, "upper") | ||
| workDir := path.Join(bundlePath, "work") | ||
| if err := setupOverlayfs(rootPath, lowerDir, upperDir, workDir); err != nil { | ||
| return fmt.Errorf("setting up overlayfs (lower=%s, target=%s): %w", lowerDir, rootPath, err) |
There was a problem hiding this comment.
If setupOverlayfs (i.e. unix.Mount("overlay", ...)) fails — kernel without overlayfs, no CAP_SYS_ADMIN/EPERM, nested gVisor/runsc, mount-count limits, etc. — the whole restore returns an error instead of falling back to untar;
The PR description advertises a "digest-only fallback to the old untar path," but that fallback only triggers when there's no digest or no cache — it does nothing for a mount failure.
| span.SetAttributes(attribute.String("rootfs_method", "overlay")) | ||
| } else { | ||
| // Fallback: no digest or no cache — extract directly (original path). | ||
| if err := os.RemoveAll(rootPath); err != nil { |
There was a problem hiding this comment.
The normal flow relies on resetActorDirs deleting the whole bundleDir between uses — but if prepareOCIDirectory is ever called without a preceding reset, a leftover non-empty upper leaks the previous container instance's writes into the new container (isolation bug), or stacks a second overlay on an already-mounted rootPath.
Summary
Implements #228: cache one extracted, read-only rootfs per immutable image digest on each node, and materialize each actor's bundle as a thin overlayfs mount instead of re-untarring the whole image on every restore.
Problem
Every
Restorecall in atelet fully reconstructs the rootfs by pulling and untarring the OCI image — even when the same image digest has already been extracted on the same node many times before. Observed cost:prepareOCIDirectory(untar rootfs): ~15–20srunsc restore(checkpoint restore): ~268msRootfs extraction dominates resume latency by ~99%.
Solution
Change the scaling behavior from "every restore pays extraction cost" to "first restore of a digest on a node pays extraction; later restores pay only an overlay mount."
On first use of an image digest on a node:
/var/lib/ateom-gvisor/rootfs-cache/<sha256>/lower/On every restore using the same digest:
RemoveAll+untar, set up an overlayfs mount for the actor bundle's rootfs:lowerdir= the cached, read-only extracted rootfs (shared, never mutated)upperdir+workdir= per-actor, actor-private writable layersChanges
internal/ateompath/ateompath.goRootfsCacheDirandRootfsCacheLowerDir()cmd/atelet/internal/rootfscache/rootfscache.goEnsureRootfs,Untar,ValidateTarName, LRU eviction, concurrent dedupcmd/atelet/internal/rootfscache/rootfscache_test.gocmd/atelet/overlay.goisOverlayfsAvailable()cmd/atelet/oci.goprepareOCIDirectoryintegrates overlayfs path with untar fallback;extractDigestFromRef;unmountActorRootfscmd/atelet/main.goAteomHerder,resetActorDirsadds unmount before cleanupKey Design Decisions
inflightEntrydedup — N goroutines requesting the same digest only trigger 1 untar.readysentinel file;loadIndexauto-cleans partial entries from previous crashes.last_accesstimestamp, async trigger, 20GB default capresetActorDirsdoesMNT_DETACHunmount on overlayfs rootfs beforeRemoveAllExpected Impact
Testing
go test ./cmd/atelet/...)Open Questions (for follow-up)