Skip to content

Tighten madvise Linux parity for hole/anon/shared#3

Merged
jserv merged 1 commit intomainfrom
madvise
May 4, 2026
Merged

Tighten madvise Linux parity for hole/anon/shared#3
jserv merged 1 commit intomainfrom
madvise

Conversation

@jserv
Copy link
Copy Markdown
Contributor

@jserv jserv commented May 4, 2026

sys_madvise silently accepted unmapped holes for every advice except MADV_DONTNEED: MADV_FREE/COLD/PAGEOUT/HUGEPAGE/NOHUGEPAGE/NORMAL/RANDOM/ SEQUENTIAL/WILLNEED returned 0 even when the range walked off into unmapped territory. Linux's madvise_walk_vmas returns -ENOMEM on the first gap, so guests that probed for unmapped state with these advices saw the wrong answer. MADV_FREE additionally accepted file-backed mappings that real Linux rejects via vma_is_anonymous.

DONTNEED had two latent bugs that survived because tests never covered the relevant paths. The PROT_NONE skip meant a guest doing mprotect PROT_NONE -> madvise DONTNEED -> mprotect RW saw stale data on re-grant (Linux detaches the pages and zero-fills lazily). And for MAP_SHARED file-backed regions, memset+pread overwrote unsynced in-memory writes with stale file contents — Linux can't discard pages from a shared page-cache mapping at all, and elfuse's CoW-shared model has nothing on disk to refill from. The single pread also lacked EINTR retry and was vulnerable to short reads, unlike the loop already used by sys_mmap for file-backed initial fill.

The new madvise_range_mapped helper walks g->regions[] under mmap_lock and is invoked once per advice that Linux validates via walk_vmas. DONTNEED additionally drops the PROT_NONE-skip, skips r->shared file regions, and replaces the pread call with the same EINTR-tolerant loop sys_mmap uses. MADV_FREE checks for anonymous private via flag bits to match the spirit of vma_is_anonymous.


Summary by cubic

Aligns sys_madvise with Linux by rejecting holes/OOB ranges, fixing MADV_DONTNEED semantics (zero-fill even under PROT_NONE, preserve writable MAP_SHARED, robust file refill), and restricting MADV_FREE to anonymous private. Adds tests to lock in these behaviors and fixes sys_mmap region flags to reliably mark anonymous mappings.

  • Bug Fixes
    • Validate ranges with a new helper; return ENOMEM for unmapped gaps and addresses outside the guest space.
    • MADV_DONTNEED: zero anonymous pages even if PROT_NONE; skip writable MAP_SHARED; reload clean read-only/shared pages from file with an EINTR/short‑read safe loop.
    • MADV_FREE: allow only on anonymous private mappings; return EINVAL for file‑backed or shared (incl. closed‑fd) mappings.
    • Hint advices (NORMAL/RANDOM/SEQUENTIAL/WILLNEED/HUGEPAGE/NOHUGEPAGE/COLD/PAGEOUT) walk mappings and return ENOMEM on holes; otherwise accept silently.
    • sys_mmap: preserve LINUX_MAP_ANONYMOUS in region flags while tracking MAP_SHARED/MAP_PRIVATE.
    • Tests cover OOB ENOMEM, PROT_NONE zero‑fill, writable vs read‑only MAP_SHARED, shared‑anon and file‑backed MADV_FREE rejections (incl. closed‑fd), unknown advice, and length=0.

Written for commit 06c3dae. Summary will update on new commits.

cubic-dev-ai[bot]

This comment was marked as resolved.

sys_madvise silently accepted unmapped holes for every advice except
MADV_DONTNEED: MADV_FREE/COLD/PAGEOUT/HUGEPAGE/NOHUGEPAGE/NORMAL/RANDOM/
SEQUENTIAL/WILLNEED returned 0 even when the range walked off into
unmapped territory. Linux's madvise_walk_vmas returns -ENOMEM on the
first gap, so guests that probed for unmapped state with these advices
saw the wrong answer. MADV_FREE additionally accepted file-backed
mappings that real Linux rejects via vma_is_anonymous.

DONTNEED had two latent bugs that survived because tests never covered
the relevant paths. The PROT_NONE skip meant a guest doing mprotect
PROT_NONE -> madvise DONTNEED -> mprotect RW saw stale data on re-grant
(Linux detaches the pages and zero-fills lazily). And for MAP_SHARED
file-backed regions, memset+pread overwrote unsynced in-memory writes
with stale file contents — Linux can't discard pages from a shared
page-cache mapping at all, and elfuse's CoW-shared model has nothing
on disk to refill from. The single pread also lacked EINTR retry and
was vulnerable to short reads, unlike the loop already used by sys_mmap
for file-backed initial fill.

The new madvise_range_mapped helper walks g->regions[] under mmap_lock
and is invoked once per advice that Linux validates via walk_vmas.
DONTNEED additionally drops the PROT_NONE-skip, skips r->shared file
regions, and replaces the pread call with the same EINTR-tolerant loop
sys_mmap uses. MADV_FREE checks for anonymous private via flag bits to
match the spirit of vma_is_anonymous.
@jserv jserv merged commit ccdefd0 into main May 4, 2026
4 checks passed
@jserv jserv deleted the madvise branch May 4, 2026 21:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant