Back system distro overlay with a per-instance scratch vhd#40739
Draft
benhillis wants to merge 3 commits into
Draft
Back system distro overlay with a per-instance scratch vhd#40739benhillis wants to merge 3 commits into
benhillis wants to merge 3 commits into
Conversation
The WSLg system distro runs from a read-only vhd with a writable overlay on top. That overlay's read/write layer was backed by tmpfs, so everything written into it (logs, temp files, copied-up files, build output) consumed guest memory and could spill into swap. Heavy writes -- e.g. compiling the Linux kernel in the system distro -- could exhaust RAM and swap and trigger the OOM killer VM-wide. Back the overlay read/write layer with a per-instance temporary ext4 "scratch" vhd (dynamically expanding, 64 GB cap) instead, mirroring the swap vhd. Writes now land on reclaimable disk page cache rather than pinned guest memory, and a runaway write gets a clean ENOSPC instead of an OOM kill. The host creates and attaches a scratch-<InstanceId>.vhdx per instance when GUI apps are enabled, passes its LUN to the guest in LX_MINI_INIT_MESSAGE, and ejects + deletes it when the instance terminates (with the per-VM temp dir teardown as a backstop). The guest formats the device as ext4, mounts it, and bind-mounts a unique subdirectory as the overlay rw layer. If the scratch vhd cannot be created, attached, or mounted, the overlay transparently falls back to the previous tmpfs behavior so the distro still launches. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
9f44d5b to
7c4a74e
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
This PR changes how the WSLg system distro’s writable overlay is backed: instead of using tmpfs (guest memory), it introduces a per-instance, dynamically-expanding scratch VHD that is formatted/mounted in the guest and used as the overlayfs upper/work backing store. This aims to prevent heavy write workloads in the system distro from exhausting VM RAM/swap and triggering VM-wide OOM.
Changes:
- Windows host: create/attach a per-instance
scratch-<InstanceId>.vhdx, track it per runtime instance, pass its SCSI LUN to the guest, and clean it up on failed startup and on termination. - Wire protocol: add
ScratchLuntoLX_MINI_INIT_MESSAGEso it can be specified per instance. - Linux guest init: format/mount the scratch device as ext4 at
/scratch, and teach overlay mounting to optionally use a scratch-backed upper/work layer with fallback to tmpfs on failure.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| src/windows/service/exe/WslCoreVm.h | Adds APIs and per-instance tracking for scratch VHD paths. |
| src/windows/service/exe/WslCoreVm.cpp | Implements scratch VHD creation/attach, LUN passing, and cleanup logic. |
| src/windows/service/exe/WslCoreInstance.h | Exposes instance id for cleanup coordination. |
| src/windows/service/exe/WslCoreInstance.cpp | Implements GetInstanceId(). |
| src/windows/service/exe/LxssUserSession.cpp | Ensures scratch cleanup on post-create startup failure and on terminate. |
| src/shared/inc/lxinitshared.h | Extends mini-init message with ScratchLun. |
| src/linux/init/util.h | Extends overlay mount helper signature; adds temp dir helper decl. |
| src/linux/init/util.cpp | Adds temp dir helper and implements scratch-backed overlay upper/work setup. |
| src/linux/init/main.cpp | Formats/mounts scratch ext4 and uses it for system distro overlay with tmpfs fallback. |
Comment on lines
+1234
to
+1236
| scratchLun = AttachDiskLockHeld(scratchPath.c_str(), DiskType::VHD, MountFlags::None, {}, false, m_userToken.get()); | ||
| m_instanceScratchVhds.emplace(InstanceId, scratchPath); | ||
| cleanupScratch.release(); |
The test hard-coded mkfs.ext4 /dev/sde for its bare-mounted 20MB disk. The per-instance system distro overlay scratch vhd now occupies an earlier /dev/sd* node, shifting the bare disk and causing the test to format the wrong device. Detect the disk by size instead, and promote MountTests' GetBlockDeviceInWsl helper to a shared Common.h/Common.cpp function. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment on lines
+1440
to
+1441
| const auto scratchPath = std::move(search->second); | ||
| m_instanceScratchVhds.erase(search); |
Comment on lines
+2485
to
+2488
| for (wchar_t name = 'a'; name < 'z'; name++) | ||
| { | ||
| std::wstring cmd = L"-u root blockdev --getsize64 /dev/sd"; | ||
| cmd += name; |
The per-instance scratch vhd path is deterministic from the instance id, so remove the m_instanceScratchVhds map and derive the path via GetInstanceScratchPath wherever it is needed (create, failure cleanup, terminate). This makes cleanup idempotent and removes the untracked-leak and const-std::move review findings. Also fix the block-device scan in the test helper to include /dev/sdz. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary of the Pull Request
The WSLg system distro runs from a read-only VHD with a writable overlay on top. That overlay's read/write layer was backed by tmpfs, so everything written into it (logs, temp files, copied-up files, build output) consumed guest memory and could spill into swap. Heavy writes — e.g. compiling the Linux kernel inside the system distro — could exhaust RAM and swap and trigger the OOM killer VM-wide.
This change backs the overlay read/write layer with a per-instance temporary ext4 "scratch" VHD (dynamically expanding, 64 GB cap) instead, mirroring the existing swap VHD. Overlay writes now land on reclaimable disk page cache rather than pinned guest memory, and a runaway write gets a clean
ENOSPCinstead of an OOM kill.PR Checklist
Detailed Description of the Pull Request / Additional comments
Host (
src/windows/service/exe/)LaunchInitmessage),WslCoreVm::CreateInstancecreates a sparse, dynamically-expandingscratch-<InstanceId>.vhdx(c_scratchVhdSizeBytes= 64 GB) under user impersonation, attaches it viaAttachDiskLockHeldto get a SCSI LUN, tracks it inm_instanceScratchVhds(keyed by the runtime instance GUID), and passes the LUN to the guest inLX_MINI_INIT_MESSAGE.ScratchLun(defaultULONG_MAX= none).LxssUserSession::_TerminateInstanceInternalcaptures the instance id and callsWslCoreVm::CleanupInstanceScratch, which ejects the disk, deletes the VHDX, and erases the map entry. The per-VM temp directory teardown on VM shutdown is the backstop for any leak.scope_exitguards cover every failure path (create-but-attach-fails, attach-but-track-fails, and post-registration startup failures), so a failed launch never leaks an attached disk, LUN, or file, and never advertises a torn-down LUN to the guest.Message (
src/shared/inc/lxinitshared.h)ScratchLunmoved fromLX_MINI_INIT_EARLY_CONFIG_MESSAGEtoLX_MINI_INIT_MESSAGE(it is now per-instance, not per-VM).Guest (
src/linux/init/)CreateOverlayScratch(Lun)formats the scratch device as ext4 (no journal, lazy inode init — the data is disposable) and mounts it at/scratch.UtilMountOverlayFsgained an optional scratch-root parameter: when present, the overlay rw layer is a unique subdirectory bind-mounted from/scratch(disk-backed, reclaimable); otherwise it falls back to tmpfs./scratchmount and the overlay are torn down automatically when the instance exits.No new user-facing strings; no IDL/ABI changes (the
ScratchLunfield moved within the internal mini_init wire protocol, which is versioned together).Validation Steps Performed
Built, deployed to host, and validated end-to-end inside the system distro (
wsl --system -d <distro>):df -h /) reports 64 GB (the scratch size);mountshowsupperdir=/system/rw/upperon the bind-mounted ext4 scratch.buff/cache(freed byecho 3 > /proc/sys/vm/drop_caches) and not inshared/shmem, unlike the previous tmpfs behavior.scratch-<id>.vhdxfrom ~0.07 GB to ~8.07 GB; only the VHD for the instance written to grows (per-instance isolation confirmed with two running distros).wsl --terminate <distro>ejects and deletes that instance'sscratch-*.vhdxwhile the VM stays running; the per-VM temp directory is removed onwsl --shutdown.Constrained-memory OOM A/B (the headline win): capped the VM at
memory=4GB, swap=0and wrote 6 GiB into the system-distro overlay on the same build, comparing the ext4 scratch against the old tmpfs behavior (mount -t tmpfs):/10% used, data on disk)Xwayland,WSLGd/init(),pulseaudio,GnsEngine, …vmmemWSL(host)With ext4 the kernel writes overlay data back to the scratch VHD and frees the pages, so 6 GiB fits in a 4 GiB VM. With tmpfs every byte is unevictable shmem, so at the memory ceiling the kernel OOM-kills the WSLg stack and the write never completes — the exact failure this change eliminates.
Memory reclaimability A/B (unconstrained VM): same 8 GiB write — ext4 keeps guest
Shmemflat andbuff/cachefully reclaimable (hostvmmemWSLworking set drops ~9.7 GB → ~5.4 GB afterdrop_caches), whereas tmpfs parks the full 8 GiB inShmemthatdrop_cachescannot reclaim (working set stays pinned at ~9.5 GB).The change was additionally reviewed across several rounds of multi-model code review; all findings were addressed.