Back system distro overlay with a per-instance scratch vhd by benhillis · Pull Request #40739 · microsoft/WSL

benhillis · 2026-06-07T20:05:48Z

Summary of the Pull Request

The WSLg system distro runs from a read-only VHD with a writable overlay on top. That overlay's read/write layer was backed by tmpfs, so everything written into it (logs, temp files, copied-up files, build output) consumed guest memory and could spill into swap. Heavy writes — e.g. compiling the Linux kernel inside the system distro — could exhaust RAM and swap and trigger the OOM killer VM-wide.

This change backs the overlay read/write layer with a per-instance temporary ext4 "scratch" VHD (dynamically expanding, 64 GB cap) instead, mirroring the existing swap VHD. Overlay writes now land on reclaimable disk page cache rather than pinned guest memory, and a runaway write gets a clean ENOSPC instead of an OOM kill.

PR Checklist

Closes: Link to issue #xxx
Communication: I've discussed this with core contributors already. If work hasn't been agreed, this work might be rejected
Tests: Added/updated if needed and all pass
Localization: All end user facing strings can be localized
Dev docs: Added/updated if needed
Documentation updated: If checked, please file a pull request on our docs repo and link it here: #xxx

Detailed Description of the Pull Request / Additional comments

Host (src/windows/service/exe/)

When GUI apps are enabled (LaunchInit message), WslCoreVm::CreateInstance creates a sparse, dynamically-expanding scratch-<InstanceId>.vhdx (c_scratchVhdSizeBytes = 64 GB) under user impersonation, attaches it via AttachDiskLockHeld to get a SCSI LUN, tracks it in m_instanceScratchVhds (keyed by the runtime instance GUID), and passes the LUN to the guest in LX_MINI_INIT_MESSAGE.ScratchLun (default ULONG_MAX = none).
On terminate, LxssUserSession::_TerminateInstanceInternal captures the instance id and calls WslCoreVm::CleanupInstanceScratch, which ejects the disk, deletes the VHDX, and erases the map entry. The per-VM temp directory teardown on VM shutdown is the backstop for any leak.
scope_exit guards cover every failure path (create-but-attach-fails, attach-but-track-fails, and post-registration startup failures), so a failed launch never leaks an attached disk, LUN, or file, and never advertises a torn-down LUN to the guest.

Message (src/shared/inc/lxinitshared.h)

ScratchLun moved from LX_MINI_INIT_EARLY_CONFIG_MESSAGE to LX_MINI_INIT_MESSAGE (it is now per-instance, not per-VM).

Guest (src/linux/init/)

CreateOverlayScratch(Lun) formats the scratch device as ext4 (no journal, lazy inode init — the data is disposable) and mounts it at /scratch.
UtilMountOverlayFs gained an optional scratch-root parameter: when present, the overlay rw layer is a unique subdirectory bind-mounted from /scratch (disk-backed, reclaimable); otherwise it falls back to tmpfs.
If the scratch VHD cannot be created, attached, formatted, or mounted — or the overlay rw setup fails (e.g. backing disk full) — the overlay transparently falls back to the previous tmpfs behavior so the distro still launches.
Each instance runs in a private mount namespace, so the /scratch mount and the overlay are torn down automatically when the instance exits.

No new user-facing strings; no IDL/ABI changes (the ScratchLun field moved within the internal mini_init wire protocol, which is versioned together).

Validation Steps Performed

Built, deployed to host, and validated end-to-end inside the system distro (wsl --system -d <distro>):

ext4-backed overlay: root overlay (df -h /) reports 64 GB (the scratch size); mount shows upperdir=/system/rw/upper on the bind-mounted ext4 scratch.
RAM not pinned: a 4 GiB write lands in reclaimable buff/cache (freed by echo 3 > /proc/sys/vm/drop_caches) and not in shared/shmem, unlike the previous tmpfs behavior.
Host disk grows on demand: an 8 GiB random write grows the host scratch-<id>.vhdx from ~0.07 GB to ~8.07 GB; only the VHD for the instance written to grows (per-instance isolation confirmed with two running distros).
Cleanup: wsl --terminate <distro> ejects and deletes that instance's scratch-*.vhdx while the VM stays running; the per-VM temp directory is removed on wsl --shutdown.
Fallback: verified the overlay still launches via tmpfs when no scratch device is available.

Constrained-memory OOM A/B (the headline win): capped the VM at memory=4GB, swap=0 and wrote 6 GiB into the system-distro overlay on the same build, comparing the ext4 scratch against the old tmpfs behavior (mount -t tmpfs):

	ext4 scratch (this change)	tmpfs (old behavior)
6 GiB overlay write	succeeds (`/` 10% used, data on disk)	fails
Global OOM killer	none	fires repeatedly
Processes killed	none	`Xwayland`, `WSLGd`/`init()`, `pulseaudio`, `GnsEngine`, …
`vmmemWSL` (host)	bounded under the 4 GB cap	pinned at the cap, then OOM

With ext4 the kernel writes overlay data back to the scratch VHD and frees the pages, so 6 GiB fits in a 4 GiB VM. With tmpfs every byte is unevictable shmem, so at the memory ceiling the kernel OOM-kills the WSLg stack and the write never completes — the exact failure this change eliminates.

Memory reclaimability A/B (unconstrained VM): same 8 GiB write — ext4 keeps guest Shmem flat and buff/cache fully reclaimable (host vmmemWSL working set drops ~9.7 GB → ~5.4 GB after drop_caches), whereas tmpfs parks the full 8 GiB in Shmem that drop_caches cannot reclaim (working set stays pinned at ~9.5 GB).

The change was additionally reviewed across several rounds of multi-model code review; all findings were addressed.

The WSLg system distro runs from a read-only vhd with a writable overlay on top. That overlay's read/write layer was backed by tmpfs, so everything written into it (logs, temp files, copied-up files, build output) consumed guest memory and could spill into swap. Heavy writes -- e.g. compiling the Linux kernel in the system distro -- could exhaust RAM and swap and trigger the OOM killer VM-wide. Back the overlay read/write layer with a per-instance temporary ext4 "scratch" vhd (dynamically expanding, 64 GB cap) instead, mirroring the swap vhd. Writes now land on reclaimable disk page cache rather than pinned guest memory, and a runaway write gets a clean ENOSPC instead of an OOM kill. The host creates and attaches a scratch-<InstanceId>.vhdx per instance when GUI apps are enabled, passes its LUN to the guest in LX_MINI_INIT_MESSAGE, and ejects + deletes it when the instance terminates (with the per-VM temp dir teardown as a backstop). The guest formats the device as ext4, mounts it, and bind-mounts a unique subdirectory as the overlay rw layer. If the scratch vhd cannot be created, attached, or mounted, the overlay transparently falls back to the previous tmpfs behavior so the distro still launches. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR changes how the WSLg system distro’s writable overlay is backed: instead of using tmpfs (guest memory), it introduces a per-instance, dynamically-expanding scratch VHD that is formatted/mounted in the guest and used as the overlayfs upper/work backing store. This aims to prevent heavy write workloads in the system distro from exhausting VM RAM/swap and triggering VM-wide OOM.

Changes:

Windows host: create/attach a per-instance scratch-<InstanceId>.vhdx, track it per runtime instance, pass its SCSI LUN to the guest, and clean it up on failed startup and on termination.
Wire protocol: add ScratchLun to LX_MINI_INIT_MESSAGE so it can be specified per instance.
Linux guest init: format/mount the scratch device as ext4 at /scratch, and teach overlay mounting to optionally use a scratch-backed upper/work layer with fallback to tmpfs on failure.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
src/windows/service/exe/WslCoreVm.h	Adds APIs and per-instance tracking for scratch VHD paths.
src/windows/service/exe/WslCoreVm.cpp	Implements scratch VHD creation/attach, LUN passing, and cleanup logic.
src/windows/service/exe/WslCoreInstance.h	Exposes instance id for cleanup coordination.
src/windows/service/exe/WslCoreInstance.cpp	Implements `GetInstanceId()`.
src/windows/service/exe/LxssUserSession.cpp	Ensures scratch cleanup on post-create startup failure and on terminate.
src/shared/inc/lxinitshared.h	Extends mini-init message with `ScratchLun`.
src/linux/init/util.h	Extends overlay mount helper signature; adds temp dir helper decl.
src/linux/init/util.cpp	Adds temp dir helper and implements scratch-backed overlay upper/work setup.
src/linux/init/main.cpp	Formats/mounts scratch ext4 and uses it for system distro overlay with tmpfs fallback.

+            scratchLun = AttachDiskLockHeld(scratchPath.c_str(), DiskType::VHD, MountFlags::None, {}, false, m_userToken.get());
+            m_instanceScratchVhds.emplace(InstanceId, scratchPath);
+            cleanupScratch.release();


The test hard-coded mkfs.ext4 /dev/sde for its bare-mounted 20MB disk. The per-instance system distro overlay scratch vhd now occupies an earlier /dev/sd* node, shifting the bare disk and causing the test to format the wrong device. Detect the disk by size instead, and promote MountTests' GetBlockDeviceInWsl helper to a shared Common.h/Common.cpp function. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

+    const auto scratchPath = std::move(search->second);
+    m_instanceScratchVhds.erase(search);


+        for (wchar_t name = 'a'; name < 'z'; name++)
+        {
+            std::wstring cmd = L"-u root blockdev --getsize64 /dev/sd";
+            cmd += name;


The per-instance scratch vhd path is deterministic from the instance id, so remove the m_instanceScratchVhds map and derive the path via GetInstanceScratchPath wherever it is needed (create, failure cleanup, terminate). This makes cleanup idempotent and removes the untracked-leak and const-std::move review findings. Also fix the block-device scan in the test helper to include /dev/sdz. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings June 7, 2026 20:05

Copilot started reviewing on behalf of benhillis June 7, 2026 20:05 View session

benhillis force-pushed the benhillis/scratch-vhd-overlay branch from 9f44d5b to 7c4a74e Compare June 7, 2026 20:08

Copilot AI reviewed Jun 7, 2026

View reviewed changes

Comment thread src/windows/service/exe/WslCoreVm.cpp Outdated

Comment on lines +1234 to +1236

scratchLun = AttachDiskLockHeld(scratchPath.c_str(), DiskType::VHD, MountFlags::None, {}, false, m_userToken.get());

m_instanceScratchVhds.emplace(InstanceId, scratchPath);

cleanupScratch.release();

Copilot AI review requested due to automatic review settings June 8, 2026 16:29

Copilot started reviewing on behalf of benhillis June 8, 2026 16:30 View session

Copilot AI reviewed Jun 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Back system distro overlay with a per-instance scratch vhd#40739

Back system distro overlay with a per-instance scratch vhd#40739
benhillis wants to merge 3 commits into
microsoft:masterfrom
benhillis:benhillis/scratch-vhd-overlay

benhillis commented Jun 7, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		const auto scratchPath = std::move(search->second);
		m_instanceScratchVhds.erase(search);

Conversation

benhillis commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary of the Pull Request

PR Checklist

Detailed Description of the Pull Request / Additional comments

Validation Steps Performed

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

benhillis commented Jun 7, 2026 •

edited

Loading