Skip to content

Back system distro overlay with a per-instance scratch vhd#40739

Draft
benhillis wants to merge 3 commits into
microsoft:masterfrom
benhillis:benhillis/scratch-vhd-overlay
Draft

Back system distro overlay with a per-instance scratch vhd#40739
benhillis wants to merge 3 commits into
microsoft:masterfrom
benhillis:benhillis/scratch-vhd-overlay

Conversation

@benhillis

@benhillis benhillis commented Jun 7, 2026

Copy link
Copy Markdown
Member

Summary of the Pull Request

The WSLg system distro runs from a read-only VHD with a writable overlay on top. That overlay's read/write layer was backed by tmpfs, so everything written into it (logs, temp files, copied-up files, build output) consumed guest memory and could spill into swap. Heavy writes — e.g. compiling the Linux kernel inside the system distro — could exhaust RAM and swap and trigger the OOM killer VM-wide.

This change backs the overlay read/write layer with a per-instance temporary ext4 "scratch" VHD (dynamically expanding, 64 GB cap) instead, mirroring the existing swap VHD. Overlay writes now land on reclaimable disk page cache rather than pinned guest memory, and a runaway write gets a clean ENOSPC instead of an OOM kill.

PR Checklist

  • Closes: Link to issue #xxx
  • Communication: I've discussed this with core contributors already. If work hasn't been agreed, this work might be rejected
  • Tests: Added/updated if needed and all pass
  • Localization: All end user facing strings can be localized
  • Dev docs: Added/updated if needed
  • Documentation updated: If checked, please file a pull request on our docs repo and link it here: #xxx

Detailed Description of the Pull Request / Additional comments

Host (src/windows/service/exe/)

  • When GUI apps are enabled (LaunchInit message), WslCoreVm::CreateInstance creates a sparse, dynamically-expanding scratch-<InstanceId>.vhdx (c_scratchVhdSizeBytes = 64 GB) under user impersonation, attaches it via AttachDiskLockHeld to get a SCSI LUN, tracks it in m_instanceScratchVhds (keyed by the runtime instance GUID), and passes the LUN to the guest in LX_MINI_INIT_MESSAGE.ScratchLun (default ULONG_MAX = none).
  • On terminate, LxssUserSession::_TerminateInstanceInternal captures the instance id and calls WslCoreVm::CleanupInstanceScratch, which ejects the disk, deletes the VHDX, and erases the map entry. The per-VM temp directory teardown on VM shutdown is the backstop for any leak.
  • scope_exit guards cover every failure path (create-but-attach-fails, attach-but-track-fails, and post-registration startup failures), so a failed launch never leaks an attached disk, LUN, or file, and never advertises a torn-down LUN to the guest.

Message (src/shared/inc/lxinitshared.h)

  • ScratchLun moved from LX_MINI_INIT_EARLY_CONFIG_MESSAGE to LX_MINI_INIT_MESSAGE (it is now per-instance, not per-VM).

Guest (src/linux/init/)

  • CreateOverlayScratch(Lun) formats the scratch device as ext4 (no journal, lazy inode init — the data is disposable) and mounts it at /scratch.
  • UtilMountOverlayFs gained an optional scratch-root parameter: when present, the overlay rw layer is a unique subdirectory bind-mounted from /scratch (disk-backed, reclaimable); otherwise it falls back to tmpfs.
  • If the scratch VHD cannot be created, attached, formatted, or mounted — or the overlay rw setup fails (e.g. backing disk full) — the overlay transparently falls back to the previous tmpfs behavior so the distro still launches.
  • Each instance runs in a private mount namespace, so the /scratch mount and the overlay are torn down automatically when the instance exits.

No new user-facing strings; no IDL/ABI changes (the ScratchLun field moved within the internal mini_init wire protocol, which is versioned together).

Validation Steps Performed

Built, deployed to host, and validated end-to-end inside the system distro (wsl --system -d <distro>):

  • ext4-backed overlay: root overlay (df -h /) reports 64 GB (the scratch size); mount shows upperdir=/system/rw/upper on the bind-mounted ext4 scratch.
  • RAM not pinned: a 4 GiB write lands in reclaimable buff/cache (freed by echo 3 > /proc/sys/vm/drop_caches) and not in shared/shmem, unlike the previous tmpfs behavior.
  • Host disk grows on demand: an 8 GiB random write grows the host scratch-<id>.vhdx from ~0.07 GB to ~8.07 GB; only the VHD for the instance written to grows (per-instance isolation confirmed with two running distros).
  • Cleanup: wsl --terminate <distro> ejects and deletes that instance's scratch-*.vhdx while the VM stays running; the per-VM temp directory is removed on wsl --shutdown.
  • Fallback: verified the overlay still launches via tmpfs when no scratch device is available.

Constrained-memory OOM A/B (the headline win): capped the VM at memory=4GB, swap=0 and wrote 6 GiB into the system-distro overlay on the same build, comparing the ext4 scratch against the old tmpfs behavior (mount -t tmpfs):

ext4 scratch (this change) tmpfs (old behavior)
6 GiB overlay write succeeds (/ 10% used, data on disk) fails
Global OOM killer none fires repeatedly
Processes killed none Xwayland, WSLGd/init(), pulseaudio, GnsEngine, …
vmmemWSL (host) bounded under the 4 GB cap pinned at the cap, then OOM

With ext4 the kernel writes overlay data back to the scratch VHD and frees the pages, so 6 GiB fits in a 4 GiB VM. With tmpfs every byte is unevictable shmem, so at the memory ceiling the kernel OOM-kills the WSLg stack and the write never completes — the exact failure this change eliminates.

Memory reclaimability A/B (unconstrained VM): same 8 GiB write — ext4 keeps guest Shmem flat and buff/cache fully reclaimable (host vmmemWSL working set drops ~9.7 GB → ~5.4 GB after drop_caches), whereas tmpfs parks the full 8 GiB in Shmem that drop_caches cannot reclaim (working set stays pinned at ~9.5 GB).

The change was additionally reviewed across several rounds of multi-model code review; all findings were addressed.

Copilot AI review requested due to automatic review settings June 7, 2026 20:05
The WSLg system distro runs from a read-only vhd with a writable overlay on
top. That overlay's read/write layer was backed by tmpfs, so everything
written into it (logs, temp files, copied-up files, build output) consumed
guest memory and could spill into swap. Heavy writes -- e.g. compiling the
Linux kernel in the system distro -- could exhaust RAM and swap and trigger
the OOM killer VM-wide.

Back the overlay read/write layer with a per-instance temporary ext4 "scratch"
vhd (dynamically expanding, 64 GB cap) instead, mirroring the swap vhd. Writes
now land on reclaimable disk page cache rather than pinned guest memory, and a
runaway write gets a clean ENOSPC instead of an OOM kill.

The host creates and attaches a scratch-<InstanceId>.vhdx per instance when
GUI apps are enabled, passes its LUN to the guest in LX_MINI_INIT_MESSAGE, and
ejects + deletes it when the instance terminates (with the per-VM temp dir
teardown as a backstop). The guest formats the device as ext4, mounts it, and
bind-mounts a unique subdirectory as the overlay rw layer. If the scratch vhd
cannot be created, attached, or mounted, the overlay transparently falls back
to the previous tmpfs behavior so the distro still launches.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@benhillis benhillis force-pushed the benhillis/scratch-vhd-overlay branch from 9f44d5b to 7c4a74e Compare June 7, 2026 20:08

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR changes how the WSLg system distro’s writable overlay is backed: instead of using tmpfs (guest memory), it introduces a per-instance, dynamically-expanding scratch VHD that is formatted/mounted in the guest and used as the overlayfs upper/work backing store. This aims to prevent heavy write workloads in the system distro from exhausting VM RAM/swap and triggering VM-wide OOM.

Changes:

  • Windows host: create/attach a per-instance scratch-<InstanceId>.vhdx, track it per runtime instance, pass its SCSI LUN to the guest, and clean it up on failed startup and on termination.
  • Wire protocol: add ScratchLun to LX_MINI_INIT_MESSAGE so it can be specified per instance.
  • Linux guest init: format/mount the scratch device as ext4 at /scratch, and teach overlay mounting to optionally use a scratch-backed upper/work layer with fallback to tmpfs on failure.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/windows/service/exe/WslCoreVm.h Adds APIs and per-instance tracking for scratch VHD paths.
src/windows/service/exe/WslCoreVm.cpp Implements scratch VHD creation/attach, LUN passing, and cleanup logic.
src/windows/service/exe/WslCoreInstance.h Exposes instance id for cleanup coordination.
src/windows/service/exe/WslCoreInstance.cpp Implements GetInstanceId().
src/windows/service/exe/LxssUserSession.cpp Ensures scratch cleanup on post-create startup failure and on terminate.
src/shared/inc/lxinitshared.h Extends mini-init message with ScratchLun.
src/linux/init/util.h Extends overlay mount helper signature; adds temp dir helper decl.
src/linux/init/util.cpp Adds temp dir helper and implements scratch-backed overlay upper/work setup.
src/linux/init/main.cpp Formats/mounts scratch ext4 and uses it for system distro overlay with tmpfs fallback.

Comment thread src/windows/service/exe/WslCoreVm.cpp Outdated
Comment on lines +1234 to +1236
scratchLun = AttachDiskLockHeld(scratchPath.c_str(), DiskType::VHD, MountFlags::None, {}, false, m_userToken.get());
m_instanceScratchVhds.emplace(InstanceId, scratchPath);
cleanupScratch.release();
The test hard-coded mkfs.ext4 /dev/sde for its bare-mounted 20MB disk. The
per-instance system distro overlay scratch vhd now occupies an earlier
/dev/sd* node, shifting the bare disk and causing the test to format the
wrong device. Detect the disk by size instead, and promote MountTests'
GetBlockDeviceInWsl helper to a shared Common.h/Common.cpp function.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 8, 2026 16:29

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Comment thread src/windows/service/exe/WslCoreVm.cpp Outdated
Comment on lines +1440 to +1441
const auto scratchPath = std::move(search->second);
m_instanceScratchVhds.erase(search);
Comment thread test/windows/Common.cpp Outdated
Comment on lines +2485 to +2488
for (wchar_t name = 'a'; name < 'z'; name++)
{
std::wstring cmd = L"-u root blockdev --getsize64 /dev/sd";
cmd += name;
The per-instance scratch vhd path is deterministic from the instance id, so
remove the m_instanceScratchVhds map and derive the path via GetInstanceScratchPath
wherever it is needed (create, failure cleanup, terminate). This makes cleanup
idempotent and removes the untracked-leak and const-std::move review findings.

Also fix the block-device scan in the test helper to include /dev/sdz.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants