Conversation
sys_madvise silently accepted unmapped holes for every advice except MADV_DONTNEED: MADV_FREE/COLD/PAGEOUT/HUGEPAGE/NOHUGEPAGE/NORMAL/RANDOM/ SEQUENTIAL/WILLNEED returned 0 even when the range walked off into unmapped territory. Linux's madvise_walk_vmas returns -ENOMEM on the first gap, so guests that probed for unmapped state with these advices saw the wrong answer. MADV_FREE additionally accepted file-backed mappings that real Linux rejects via vma_is_anonymous. DONTNEED had two latent bugs that survived because tests never covered the relevant paths. The PROT_NONE skip meant a guest doing mprotect PROT_NONE -> madvise DONTNEED -> mprotect RW saw stale data on re-grant (Linux detaches the pages and zero-fills lazily). And for MAP_SHARED file-backed regions, memset+pread overwrote unsynced in-memory writes with stale file contents — Linux can't discard pages from a shared page-cache mapping at all, and elfuse's CoW-shared model has nothing on disk to refill from. The single pread also lacked EINTR retry and was vulnerable to short reads, unlike the loop already used by sys_mmap for file-backed initial fill. The new madvise_range_mapped helper walks g->regions[] under mmap_lock and is invoked once per advice that Linux validates via walk_vmas. DONTNEED additionally drops the PROT_NONE-skip, skips r->shared file regions, and replaces the pread call with the same EINTR-tolerant loop sys_mmap uses. MADV_FREE checks for anonymous private via flag bits to match the spirit of vma_is_anonymous.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
sys_madvise silently accepted unmapped holes for every advice except MADV_DONTNEED: MADV_FREE/COLD/PAGEOUT/HUGEPAGE/NOHUGEPAGE/NORMAL/RANDOM/ SEQUENTIAL/WILLNEED returned 0 even when the range walked off into unmapped territory. Linux's madvise_walk_vmas returns -ENOMEM on the first gap, so guests that probed for unmapped state with these advices saw the wrong answer. MADV_FREE additionally accepted file-backed mappings that real Linux rejects via vma_is_anonymous.
DONTNEED had two latent bugs that survived because tests never covered the relevant paths. The PROT_NONE skip meant a guest doing mprotect PROT_NONE -> madvise DONTNEED -> mprotect RW saw stale data on re-grant (Linux detaches the pages and zero-fills lazily). And for MAP_SHARED file-backed regions, memset+pread overwrote unsynced in-memory writes with stale file contents — Linux can't discard pages from a shared page-cache mapping at all, and elfuse's CoW-shared model has nothing on disk to refill from. The single pread also lacked EINTR retry and was vulnerable to short reads, unlike the loop already used by sys_mmap for file-backed initial fill.
The new madvise_range_mapped helper walks g->regions[] under mmap_lock and is invoked once per advice that Linux validates via walk_vmas. DONTNEED additionally drops the PROT_NONE-skip, skips r->shared file regions, and replaces the pread call with the same EINTR-tolerant loop sys_mmap uses. MADV_FREE checks for anonymous private via flag bits to match the spirit of vma_is_anonymous.
Summary by cubic
Aligns
sys_madvisewith Linux by rejecting holes/OOB ranges, fixingMADV_DONTNEEDsemantics (zero-fill even underPROT_NONE, preserve writableMAP_SHARED, robust file refill), and restrictingMADV_FREEto anonymous private. Adds tests to lock in these behaviors and fixessys_mmapregion flags to reliably mark anonymous mappings.ENOMEMfor unmapped gaps and addresses outside the guest space.MADV_DONTNEED: zero anonymous pages even ifPROT_NONE; skip writableMAP_SHARED; reload clean read-only/shared pages from file with an EINTR/short‑read safe loop.MADV_FREE: allow only on anonymous private mappings; returnEINVALfor file‑backed or shared (incl. closed‑fd) mappings.NORMAL/RANDOM/SEQUENTIAL/WILLNEED/HUGEPAGE/NOHUGEPAGE/COLD/PAGEOUT) walk mappings and returnENOMEMon holes; otherwise accept silently.sys_mmap: preserveLINUX_MAP_ANONYMOUSin region flags while trackingMAP_SHARED/MAP_PRIVATE.ENOMEM, PROT_NONE zero‑fill, writable vs read‑onlyMAP_SHARED, shared‑anon and file‑backedMADV_FREErejections (incl. closed‑fd), unknown advice, andlength=0.Written for commit 06c3dae. Summary will update on new commits.