DurableDir support#295
Conversation
| enum SnapshotScope { | ||
| // Not valid option; should never happen. | ||
| SNAPSHOT_SCOPE_UNSPECIFIED = 0; | ||
| // Snapshot memory and the full rootfs (including homedir content). |
There was a problem hiding this comment.
Do we actually want the full rootfs persistence, or just process memory?
This is something I've been toying with in #123 & various implemenations ...
AIUI we're implementing this at the cost of all writes having to live in memory on a tmpfs (which is bounded and easily exhausted, and can significantly bloat the snapshot if you only wanted to preserve process memory)
If we want full rootfs + process, I think we should probably come up with a clearer name for this mode, and keep "SNAPSHOT_SCOPE_PROCESS" for just the process memory later 🤔
There was a problem hiding this comment.
There was a problem hiding this comment.
About the name, do you have better name in the mind other than PROCESS?
There was a problem hiding this comment.
I tend to agree with Benjamin Elder (@BenTheElder) here. The full rootfs, even diffs, could pick up A LOT of things if the user is not careful. Egress controls might save them from downloading too many things I suppose, but definitely a potential footgun
|
|
||
| // Homedir snapshots are temporarily stored in subfolder relative to process checkpoint path. | ||
| // This is because gVisor missing capability to separete homedir content from rest of rootfs upon checkpointing. | ||
| HomedirSnapshotsSubfoldderName = "homedir" |
There was a problem hiding this comment.
Why does anything other than ateom-gvisor need to know this path? Shouldn't the result for atelet just be another snapshot artifact to upload? Can't ateom create the homedir internally without atelet?
There was a problem hiding this comment.
this supposed to be a temporary solution for gVisor. Today it can take either memory+rootfs or just homedir.
The gVisor team is working on a more generic snapshot file, that at time of memory+rootfs checkpoint, it will separate files to chunks, and homedir content will be stored separately from the rest of rootfs. It will allow to take one "snapshot" operation and later to upload homedir to bucket without rest of rootfs.
We actually need design how this feature will be supported in microVM.
I was expecting we might just have the actorTemplate specify what path is homedir and if specified we would internally provision the appropriate backing storage technology for the sandbox implementation and handle preserving it. We might not even need to mount a volume vs just tarring up that path from a shared filesystem after suspending the actor. If we're supporting multiple homedirs per container, we might want to think about a more appropriate name? |
Ax team explicitly asked for both 1&2.
This is a good point. Do you have any other name in a ming you would like to suggest? |
| // | ||
| // +kubebuilder:validation:ExactlyOneOf={homeDir} | ||
| type VolumeSource struct { | ||
| // homeDir represents a directory on rootfs that will participate in snapshots. |
There was a problem hiding this comment.
It'll be good to explain in more detail the difference between rootfs and homedir.
There was a problem hiding this comment.
Agree. We should add them to the glossary
|
|
||
| const ( | ||
| // Process memory plus the full rootfs (homedir included). | ||
| SnapshotScopeProcess SnapshotScope = "process" |
There was a problem hiding this comment.
Instead of the scope being inclusive of other scopes, I wonder if it would be better to make them mutually exclusive and require that the user specify multiple.
For example, in the future we probably also want to support external volume mount snapshots.
There was a problem hiding this comment.
I agree that it MIGHT be useful to allow homedir+rootfs without process at some point in the future. I don't know if rootfs without homedir is useful. So the combinations are not arbitrary, nor are they successively inclusive.
We can either convert it to 3 bools in a struct or enumerate Everything vs HomedirOnly and later RootfsAndHomedir and maybe RootfsOnly. Writing it out, I am now leaning back to 3 bools?
There was a problem hiding this comment.
For example, in the future we probably also want to support external volume mount snapshots.
All snapshots are supposed to be in sync; otherwise, it will be a mess, and some data will be out of sync with other data.
From a layers point of view, the dependencies are supposed to be (from min to max):
- None
- External only
- Homedir(s)
- Rootfs (including homedir) (TBD???)
- Memory
- None - no snapshot. It means the actor always starts fresh. This is like deleting an actor and creating a new one. Therefore, "None" does not need to be included.
- External Storage - this becomes the minimal one and always needs to be taken. This is the difference between Pause and Commit. On Pause, we will most likely just detach it from the actor and attach it on resume. On Commit, we will need to detach it and take a snapshot of the latest state. I did not add it to the list since we don't have support for external storage yet.
- Homedir - these are local files. Taking a snapshot of a local file and not taking a snapshot of external storage logically does not make sense (see the statement above).
- Rootfs - this is an interesting case; we don't have a clear requirement for this capability yet. However, if we add it in the future, the homedir will be included in it in any case.
- Memory - this is a full snapshot of everything. Taking a memory snapshot with no rootfs or external storage does not work, since memory is cached based on either the rootfs, external storage, or both.
I agree that it MIGHT be useful to allow homedir+rootfs without process at some point in the future.
See the comments above; we might add it in the future if it is required.
We can either convert it to 3 bools in a struct or enumerate Everything vs HomedirOnly and later RootfsAndHomedir and maybe RootfsOnly. Writing it out, I am now leaning back to 3 bools?
I wonder if it would be better to make them mutually exclusive and require that the user specify multiple.
Based on the explanation above, this does not look like a mutually exclusive pattern; it is supposed to be one of them.
There was a problem hiding this comment.
I think it should stay an enum because bools make kinda bad APIs and we probably won't accept all combinations which just makes it an enum in need of good names?
We should support:
- RootFS + Process (Also "process" isn't quite right if we also snapshot the uVM kernel ... it's more like FS + Memory)
- FS
- Homedir
As a follow-up might also want to support some kind of "temp" volume in FS and FS+Memory modes that is explicitly not persisted. Maybe "emptyDir" again from k8s? (which can also explicitly be memory or disk backed?)
There was a problem hiding this comment.
As a follow-up might also want to support some kind of "temp" volume in FS and FS+Memory modes that is explicitly not persisted. Maybe "emptyDir" again from k8s? (which can also explicitly be memory or disk backed?)
Taking memory snapshot with not "temp dir" is logically make snapshot in-consistent. We are coming to the same discussion again, what if memory has opened file description form "temp" folder.
Having full file system snapshot minus "temp" dir - what is a use case for it? "Homedir" is exactly designed for those cases - snapshot will include folders with application data - everything else is disposed.
There was a problem hiding this comment.
Taking memory snapshot with not "temp dir" is logically make snapshot in-consistent. We are coming to the same discussion again, what if memory has opened file description form "temp" folder.
Yeah I don't think we should do only memory. I am suggested we offer modes for "memory + rootfs writes", "rootfs writes", and whatever we rename "homedir"
Having full file system snapshot minus "temp" dir - what is a use case for it? "Homedir" is exactly designed for those cases - snapshot will include folders with application data - everything else is disposed.
Full rootfs means you don't have to know where writes land up front. That might be convenient. It's functionally an overlay homedir on top of the container root (vs homedir may be a whiteout mount).
There was a problem hiding this comment.
Multi-response:
None - no snapshot. It means the actor always starts fresh. This is like deleting an actor and creating a new one. Therefore, "None" does not need to be included.
I am not convince they are the same. There's always going to be a small cost to "create + resume" vs. "resume". Plus it shifts responsibility onto the consumer of the substrate when it doesn't need to. I think "none" makes sense (we don't have to implement it now, but we should not rule it out).
For example, in the future we probably also want to support external volume mount snapshots.
All snapshots are supposed to be in sync;
Devil's advocate -- do we really need snapshots of external storage? I do not have a good handle on the real requirements here.
Homedir - these are local files. Taking a snapshot of a local file and not taking a snapshot of external storage logically does not make sense (see the statement above).
Do we think that substrate-managed snapshots are the ONLY way to implement homedir? Or could we. hypothetically, use some REALLY fast network-attached device? Does that change anything here?
Rootfs - this is an interesting case; we don't have a clear requirement for this capability yet. However, if we add it in the future, the homedir will be included in it in any case.
In theory I could make an argument for rootfs without homedir. In practice maybe not interesting.
bools make kinda bad APIs
While I wrote that kube convention, there are places where bools can be good, and it's exactly this pattern (a set of N non-stacking, independent decisions. But ultimately it's about ergonomics.
As a follow-up might also want to support some kind of "temp" volume in FS and FS+Memory modes that is explicitly not persisted.
Can I say YAGN I for this? I'll be content to be wrong, but let's not get too hypothetical.
As I wrote this out I had a thought.
Somewhere (ActorTemplate -> Actor) we have a "map" of which volumes we want mounted on which dirs. If you don't want homedir semantics, don't mount a "homedir" volume. If you don't want external volumes, don't mount external volumes. We don't need MORE config for them We only need 3 options:
- Snapshot Nothing
- Snapshot Everything (memory + rootfs + anything you have mounted)
- Snapshot Volumes (anything you have mounted: homedir, external, maybe something else in the future)
There was a problem hiding this comment.
..If you don't want homedir semantics, don't mount a "homedir" volume. If you don't want external volumes, don't mount external volumes
- Snapshot Nothing
- Snapshot Everything (memory + rootfs + anything you have mounted)
- Snapshot Volumes (anything you have mounted: homedir, external, maybe something else in the future)
I actually like the framing (Snapshot Everything / Snapshot Volumes). This is very accurate summary.
"Snapshot Nothing" - I still not convinced about it if we really need it. However, with the proposed concept above, we alway can add it later if we find a right use case for it.
bools make kinda bad APIs....
The bool discussion is relevant either OnCommit/OnPause is a single value or multiple values. This kind of critical for design, since later on, API will be broken to make it array.
However, if we agree on Concept of
- Snapshot Nothing (potential future extension)
- Snapshot Everything
- Snapshot Volumes
the discussion becomes irrelevant, because Nothing+Everything or Everything+Volumes does not make sense.
| // | ||
| // +kubebuilder:validation:ExactlyOneOf={homeDir} | ||
| type VolumeSource struct { | ||
| // homeDir represents a directory on rootfs that will participate in snapshots. |
There was a problem hiding this comment.
Agree. We should add them to the glossary
|
|
||
| const ( | ||
| // Process memory plus the full rootfs (homedir included). | ||
| SnapshotScopeProcess SnapshotScope = "process" |
There was a problem hiding this comment.
I agree that it MIGHT be useful to allow homedir+rootfs without process at some point in the future. I don't know if rootfs without homedir is useful. So the combinations are not arbitrary, nor are they successively inclusive.
We can either convert it to 3 bools in a struct or enumerate Everything vs HomedirOnly and later RootfsAndHomedir and maybe RootfsOnly. Writing it out, I am now leaning back to 3 bools?
|
Let's add validation to reject homedir mode when I'm pretty sure I know how we'll do it but I'm already getting complaints about the size of the uVM POC which is not merged yet :-) |
I'm pretty sure I know how we'll do it but I'm already getting complaints about the size of the uVM POC which is not merged yet :-) Added a "+kubebuilder:validation:XValidation" validation and adjusted unit tests. |
Eitan Yarmush (EItanya)
left a comment
There was a problem hiding this comment.
Just some high level comments, this is looking great
| enum SnapshotScope { | ||
| // Not valid option; should never happen. | ||
| SNAPSHOT_SCOPE_UNSPECIFIED = 0; | ||
| // Snapshot memory and the full rootfs (including homedir content). |
There was a problem hiding this comment.
I tend to agree with Benjamin Elder (@BenTheElder) here. The full rootfs, even diffs, could pick up A LOT of things if the user is not careful. Egress controls might save them from downloading too many things I suppose, but definitely a potential footgun
| // - if OnPause is "homedir", then OnCommit must be "homedir". | ||
| // | ||
| // +optional | ||
| OnCommit SnapshotScope `json:"onCommit,omitempty"` |
There was a problem hiding this comment.
Pause vs commit are slightly confusing to me as verbs. I had to dig through the code to figure out the difference. I wonder if we should think through this for 2 reasons:
- This is a very low level concept being surfaced in the API, node local snapshotting vs remote. I wonder how relevant this difference is for most users, and whether we should actually surface it somewhere else?
- Pause is very similar to Suspend, and commit is very generic. I think we should try and come up with words that are a bit more clear to purpose.
There was a problem hiding this comment.
The pattern I imagine is PAUSE if you want to do a short suspension. The benefit of it, memory and rootfs can be preserved and resume will be fast. The "Commit" is somethings that been persistence and can be always back to it. For those cases, process snapshot might be too expensive and it collect a lot of data that might be non relevant for execution.
When we did PoC for mega-atelet, idea was in PAUSE not "kill process" at all, just keep it running and give to process 0 CPU. It allowed linux kernel to offload memory pages to SSD when it was no available. The #119 has more info what is a difference between pause and suspend.
I kind of imagine, OnPause developers will use "process" as a default behavior and OncCommit is "homedir".
There was a problem hiding this comment.
The difference between "pause" and "commit" (happy to rename) is meaningful. Commit means "push my changes to perma-storage", which has a cost (possibly charged-back?). Pause means "I'm idle now but I am not at a place where I want to push my changes yet". A committed snapshot is eligible for a fork, but a pause is not.
We chose "commit" because it felt vaguely git-ish. We chose "pause" because we kept mixing up "snapshot", "checkpoint", "suspend" in all our discussions :) "Pause" feels more temporary, which is the intended semantic.
Again, happy to rename.
There was a problem hiding this comment.
…copes to Full/Data
…pe definitions to glossary
88a7925 to
411cddb
Compare
|
Benjamin Elder (@BenTheElder) I have rebased code. Following last 3 commits are changes in names & fixes of the code after rebase: PTAL |
Eitan Yarmush (EItanya)
left a comment
There was a problem hiding this comment.
DurDir :)
Tim Hockin (thockin)
left a comment
There was a problem hiding this comment.
Overall LGTM Some of my comments are minor
| const ( | ||
| // Full captures process memory plus the entire filesystem delta on top of | ||
| // the OCI image (including any attached DurableDir volumes). | ||
| SnapshotScopeFull SnapshotScope = "full" |
There was a problem hiding this comment.
Kube convention for string constants is "Full" and "Data" (leading capital letter)
| // If not provided, the "full" behavior is used by default. | ||
| // | ||
| // +optional | ||
| OnPause SnapshotScope `json:"onPause,omitempty"` |
There was a problem hiding this comment.
+kubebuilder:validation:Enum=Full,Data
+kubebuilder:default=Data
Or should the default be Full?
AFAICT there's not an easy way to do conditional defaulting here, so we can't easily set the default for OnCommit to "whatever OnPause is", unless we use an MAP for this. How important is it?
What we generally want to avoid is having consumers of this API do things like "if obj.X == nil { /* assume "full" */ } -- that default is then baked into a million places and if you ever iterate on the API can break.
| SnapshotScopeData SnapshotScope = "data" | ||
| ) | ||
|
|
||
| // +kubebuilder:validation:XValidation:rule="(has(self.onPause) ? self.onPause : 'full') == 'full' || (has(self.onCommit) ? self.onCommit : 'full') == (has(self.onPause) ? self.onPause : 'full')",message="OnCommit must be a subset of OnPause" |
There was a problem hiding this comment.
It seems that defaults apply AFTER validations which is...unfortunate. You are embedding the default value all over
| OnPause SnapshotScope `json:"onPause,omitempty"` | ||
|
|
||
| // OnCommit specifies what to include in the snapshot when a commit is requested. | ||
| // If not provided, the "full" behavior is used by default. |
There was a problem hiding this comment.
We should be more precise with default behaviors. If a user specifies onPause: Data then this can't default to Full unless you intend for that to be an error?
|
|
||
| // ActorTemplateSpec defined desired spec of an actor. | ||
| // | ||
| // +kubebuilder:validation:XValidation:rule="!has(self.containers) || self.containers.all(c, !has(c.volumeMounts) || c.volumeMounts.filter(vm, has(self.volumes) && self.volumes.exists(v, v.name == vm.name && has(v.durableDir))).size() <= 1)",message="A container may mount at most one DurableDir-typed volume" |
There was a problem hiding this comment.
I thought you wanted to allow multiple DurDir?
| SnapshotScopeData SnapshotScope = "data" | ||
| ) | ||
|
|
||
| // +kubebuilder:validation:XValidation:rule="(has(self.onPause) ? self.onPause : 'full') == 'full' || (has(self.onCommit) ? self.onCommit : 'full') == (has(self.onPause) ? self.onPause : 'full')",message="OnCommit must be a subset of OnPause" |
There was a problem hiding this comment.
error strings should refer to fields in their jsonName style - onCommit and onPause
There was a problem hiding this comment.
I will accumulate some smaller changes as I flag them for a followup
| {Name: "vol1", VolumeSource: VolumeSource{DurableDir: &DurableDirVolumeSource{}}}, | ||
| } | ||
| at.Spec.Containers[0].VolumeMounts = []VolumeMount{ | ||
| {Name: "vol1", MountPath: "/home/.config"}, |
There was a problem hiding this comment.
We should also try something like ..config or x..y
| {Name: "vol1", VolumeSource: VolumeSource{DurableDir: &DurableDirVolumeSource{}}}, | ||
| } | ||
| at.Spec.Containers[0].VolumeMounts = []VolumeMount{ | ||
| {Name: "vol1", MountPath: "/home\t/user"}, |
There was a problem hiding this comment.
Strictly speaking tabs, spaces, etc are valid in pathnames. Is there a good reason to prevent them?
$ ls -l "a b"
ls: cannot access 'a'$'\t''b': No such file or directory
$ touch "a b"
$ ls -l "a b"
-rw-r--r-- 1 thockin primarygroup 0 Jun 26 08:56 'a'$'\t''b'
| {Name: "vol1", VolumeSource: VolumeSource{DurableDir: &DurableDirVolumeSource{}}}, | ||
| } | ||
| at.Spec.Containers[0].VolumeMounts = []VolumeMount{ | ||
| {Name: "vol1", MountPath: "/home\t/user"}, |
There was a problem hiding this comment.
How about a testcase for "/my home directory" (spaces) which should be allowed
Implement a layer allowing to take durableDir snapshot only, without taking snapshot of entire memory. This is a part of the #119 feature.
The "durableDir" support is introduced via adding Volumes support to ActorTemplate.
Knowns issues / limitations
No MicroVM support for homedir yet. Will be implemented in following PR.
Resume from SNAPSHOT_SCOPE_DATA temporary behavior
Current behavior of resume from durableStorage returns immediately after gVisor process is started, it means container's bootstrap is not completed by the time resume API returned to the caller. It causes the issue that first curl request for suspended actor throws an HTTP exception, since webserver listening on the port inside container is not running yet.
PR feat: implement container readiness probes with custom HTTP endpoint … #330 fixes the issue, for a meantime the workaround is to call
kubectl ate resume actorand after it call to the curl itself.