[v1.14] Expose memory mapping & dirty pages; Make memfile dump optional#8
[v1.14] Expose memory mapping & dirty pages; Make memfile dump optional#8bchalios wants to merge 76 commits into
Conversation
88151c3 to
89538bc
Compare
a041675 to
61fdd9d
Compare
d91bdb1 to
61fdd9d
Compare
The commit 7ee43cc ("fix: Support creating file with open_file_nonblock") did modify the file opening utility function by adding `open` option, but it also removed the `read` option from it. This causes an error during metrics and logs file initialization code if the the file is a FIFO and there are no readers already reading from it. This is because `open` returns `ENXIO` when opening a FIFO write-only as described in the man page: ``` ENXIO O_NONBLOCK | O_WRONLY is set, the named file is a FIFO, and no process has the FIFO open for reading. ``` Fix is just a partial revert of the part that changed the file opening logic by re-introducing same `open_file_nonblock` as it was before but with added `create` flag. Signed-off-by: Egor Lazarchuk <yegorlz@amazon.co.uk>
No functional change. Just cleanup. Signed-off-by: Egor Lazarchuk <yegorlz@amazon.co.uk>
Add note about fix in the firecracker-microvm#5698 PR. Signed-off-by: Egor Lazarchuk <yegorlz@amazon.co.uk>
In GuestMemoryExtension::dump_dirty() and GuestMemorySlot::dump_dirty() there are several unwraps which will cause Firecracker to panic. Instead of panicking simply propagate errors up the call chain to mark the snapshot operation as failed. To make error handling less verbose and more descriptive make GuestMemorySlot::dump_dirty() return MemoryError instead of GuestMemoryError and define new error types for MemoryError. Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
The comment in test_dump_dirty() before the last snapshot is taken says: // First region pages: [dirty, clean] However, the following code line updates the second page rather than the first one, so in reality the state of the bitmap is [clean, dirty]. Fix the comment. Additionally, the "Dump only the dirty pages." comment further above is misleading. Since we write to all pages of all regions, the Firecracker bitmap (rather than the KVM bitmap) will be all 1s, therefore the first snapshot taken by this function will dump all memory pages. Rename 'dirty_bitmap' to 'kvm_dirty_bitmap' to make it clear which bitmap it refers to and add comments about the state of the Firecracker bitmap too. To make things more obvious for the reader re-configure the state of the KVM dirty bitmap next to the relevant comment, just before we take the second snapshot. Since we are in the neighbourhood, rename 'expected_first_region' to 'expected_file_contents' since it clearly stores the contents of both regions rather than the first one only. No functional change intended. Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
It is possible that the dirty bitmap dump_dirty() receives is larger or smaller than the slot size. Return an error in that case. Additionally, the inner loop in that function always iterates over the 0..64 range. Typically the region size won't be a multiple of 64, so we need to make sure that we break after we check the last bit that corresponds to the last page of the region. Extend the test_dump_dirty() test case to exercise the new code paths by supplying wrongly sized bitmaps to the function. Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
GuestMemorySlot::dump_dirty() does not advance the file cursor when the pages at the end of the memory region are clean. This causes the next slot to start writing at an incorrect offset and corrupting the contents of the previous region. Fix this by always advancing the cursor at the end of dump_dirty(). As per the original report from EJ Ciramella this only affects VMs with more than one memory slots and it can cause Firecracker or the guest to crash when loading a corrupted snapshot. Fixes: 6c4c1bf ("feat(mem): introduce KVM slots per GuestMemoryRegion") Reported-by: EJ Ciramella <ejc3@meta.com> Suggested-by: EJ Ciramella <ejc3@meta.com> Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
The test_dump_dirty() test currently checks the file contents of a differential snapshot where the last page of both memory regions is dirty. Extend this test to also test the case where the last page of both regions is clean. Additionally, check that the logical size of the resulting file is different than the physical size due to the holes representing clean pages. Finally, make sure that if the KVM dirty bitmap is larger than the region size, the extra bits are ignored. Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
The test_diff_snapshot_overlay() case tests differential snapshots on VMs that have a single memory slot since basic_config() uses a 256MiB memory size by default. Parametrize the test so that it's repeated for both 256MiB and 4096MiB sizes. On x86 this will create 2 memory slots and hence test a different scenario. Suggested-by: Riccardo Mancini <mancio@amazon.com> Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
Update v1.14.2 CHANGELOG mentioning the fix for firecracker-microvm#5705 Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
This one was only used in devtool, so move it there. No functional change. Signed-off-by: Egor Lazarchuk <yegorlz@amazon.co.uk>
There are several issues with the current implementation which result in a spurious failures/timeouts in a CI. Additionally, the kvm update logic was duplicated in a couple of places. To resolve these issues, refactor the code a bit to move kvm loading/checking logic into one place and implement more robust locking mechanism. Signed-off-by: Egor Lazarchuk <yegorlz@amazon.co.uk>
Update version number / CHANGELOG / CREDITS Signed-off-by: Riccardo Mancini <mancio@amazon.com>
Bump aws-lc-rs to 1.16.1 to take the newest aws-lc-sys 0.38.0 which is required for security advisory. (cherry picked from commit f056e65) Signed-off-by: Jack Thomson <jackabt@amazon.com>
Previously the value of the opt len was not validated which contradicts the spec. Per RFC 9293 (MUST-7), opt len includes the kind and length bytes so the minimum valid value length is 2. Reported-by: Kai Mitsuzawa <kaizawap@amazon.co.jp> (@kaizawa97) (cherry picked from commit fa3e4d7) Signed-off-by: Jack Thomson <jackabt@amazon.com>
Ensure we reject invalid option lengths (cherry picked from commit e0eb5e7) Signed-off-by: Jack Thomson <jackabt@amazon.com>
Add new entry for the MMDS TCP option length fix Signed-off-by: Jack Thomson <jackabt@amazon.com>
The virtio-blk spec (section 5.2.6.14) defines VIRTIO_BLK_T_DISCARD (type 11) as a request the guest driver issues to free sectors it no longer needs. To handle it, the VMM needs three things introduced here: - `RequestType::Discard` so the dispatcher can route the request type. - `DiscardWriteZeroes` (16 bytes, ByteValued) matching the layout of a single discard segment that the guest places in the data descriptor. - `DISCARD_SEGMENT_SIZE` constant (16) used in parse() to validate that the data buffer length is a non-zero multiple of the segment size. - `IoErr::InvalidFlags` / `IoErr::InvalidOffset` for per-segment validation errors added in the processing step. Placeholder arms in `finish()` and `process()` keep the build green; they are replaced with the real implementation in the next commit. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Add SyncIoError::Discard, SyncFileEngine::discard() and AsyncFileEngine::discard() (synchronous fallback — discard is not latency-critical and IORING_OP_FALLOCATE is not in the io_uring allowlist). Wire up FileEngine::discard() dispatching to both engines. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Fix 2-space indented getrandom entries (should be 4 spaces like all other entries) and remove trailing whitespace from madvise comment. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
VIRTIO_BLK_F_DISCARD issues fallocate(FALLOC_FL_PUNCH_HOLE) on the backing file. Add fallocate to the vmm thread seccomp filter on both x86_64 and aarch64 so the syscall is not blocked at runtime. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Add parse() validation: discard data descriptor must be read-only and data_len must be a non-zero multiple of DISCARD_SEGMENT_SIZE (16 bytes). Add process_discard_segments(): validates flags==0, num_sectors>0, and sector bounds per segment, then calls FileEngine::discard() using fallocate(FALLOC_FL_PUNCH_HOLE). Returns VIRTIO_BLK_S_IOERR on any validation or syscall failure; VIRTIO_BLK_S_OK on success. Add discard_count metric incremented on each successful discard. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Populate max_discard_sectors and max_discard_seg in restore() to match what new() produces for the expanded ConfigSpace. No version bump needed — main already carried 9.0.0 → 10.0.0 in this release cycle. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Set VIRTIO_BLK_F_DISCARD in avail_features for all writable block devices. Populate max_discard_sectors (capped to disk size) and max_discard_seg=1 in the config space. Read-only devices advertise neither the feature nor non-zero discard config fields. This is a behavioral change for existing writable block device configurations: guests will now see and may negotiate the discard feature. See docs/api_requests/block-discard.md for host filesystem requirements and recommendations. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Explain how discard works, host requirements, limitations, and snapshot compatibility impact. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Test feature flag advertisement, config space discard fields, and end-to-end discard request processing (success and error paths). Update test_virtio_features and test_virtio_read_config to account for the new VIRTIO_BLK_F_DISCARD bit and config space fields. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Two tests: - test_discard: boot VM with writable ext4 disk, write+delete data, run fstrim, verify rc=0 and discard_count metric increments. - test_discard_not_advertised_for_read_only: verify read-only devices report discard_max_bytes=0 in sysfs (feature not advertised). Both parametrized over Sync and Async IO engines. Also add discard_count to the fcmetrics block_metrics list. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Replace the 8-byte _unused1 placeholder at offset 48 with the virtio-blk
write-zeroes fields and the required trailing padding:
- max_write_zeroes_sectors (u32, offset 48)
- max_write_zeroes_seg (u32, offset 52)
- write_zeroes_may_unmap (u8, offset 56)
- _unused1 (u8; 3, offset 57) — spec's unused1
- _pad (u8; 4, offset 60) — Rust alignment padding;
the virtio-blk spec
ends at byte 60.
Bump the compile-time size assertion to 64 and add per-field
offset_of! assertions guarding every field's position against future
accidental layout drift. The offsets mirror the Linux kernel header
include/uapi/linux/virtio_blk.h::struct virtio_blk_config exactly.
Fields default to zero; no behavioural change. They will be populated
when VIRTIO_BLK_F_WRITE_ZEROES is wired up.
Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Add the RequestType::WriteZeroes variant and From<u32> mapping for VIRTIO_BLK_T_WRITE_ZEROES. Add IoErr::WriteZeroesUnsupported and Status::WriteZeroesUnsupported (parallel to the discard equivalents) that return VIRTIO_BLK_S_UNSUPP silently when the host filesystem doesn't support the operation. Add a write_zeroes_unsupported EOPNOTSUPP cache on DiskProperties (not persisted). Add placeholder arms in PendingRequest::finish() and Request::process() so the code compiles. Both currently return WriteZeroesUnsupported; they will be replaced when the full feature is wired up. Update the test-only From<RequestType> for u32 and request_type_flags() helpers to handle the new variant. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Add write_zeroes(offset, len, unmap, req) on FileEngine plus the corresponding SyncFileEngine::write_zeroes() and AsyncFileEngine::push_write_zeroes() implementations. The unmap flag selects the fallocate mode at call time: - unmap=true: FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE (mode 3) - unmap=false: FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE (mode 17) The async path uses io_uring's IORING_OP_FALLOCATE (already in the allowlist from the discard work). The sync path calls libc::fallocate. Add SyncIoError::WriteZeroes and extend BlockIoError::is_eopnotsupp() to also match SyncIoError::WriteZeroes so the EOPNOTSUPP fallback machinery (added later) catches both modes. No seccomp / io_uring restriction changes are required: fallocate is already in the seccomp filter and IORING_OP_FALLOCATE is already allowed via the discard work. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Wire up Request::parse() validation and Request::process() handling
for write-zeroes requests, mirroring the discard path:
- parse(): reject write-only data descriptors and require data_len
to be a non-zero multiple of DISCARD_SEGMENT_SIZE (the
DiscardWriteZeroes struct is shared between both ops).
- parse_write_zeroes_segment(): validate flags (only bit 0,
VIRTIO_BLK_WRITE_ZEROES_FLAG_UNMAP, is permitted), reject
num_sectors=0, check sector range, and return the unmap flag
so process() can pick the fallocate mode.
- process(): short-circuit to VIRTIO_BLK_S_UNSUPP if a previous
EOPNOTSUPP set disk.write_zeroes_unsupported; otherwise dispatch
to FileEngine::write_zeroes().
- the (Ok(_), RequestType::WriteZeroes) finish() arm now increments
the write_zeroes_count metric and returns VIRTIO_BLK_S_OK.
- the EOPNOTSUPP detection in the error path also handles
RequestType::WriteZeroes, setting the cache and returning
Status::WriteZeroesUnsupported (silent UNSUPP).
Add the write_zeroes_count metric and aggregate() line.
The feature flag is not yet advertised — that lands in the next
commit, after the request handling is fully in place.
Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
…vices Set VIRTIO_BLK_F_WRITE_ZEROES in avail_features for any non-read-only block device, alongside VIRTIO_BLK_F_DISCARD. Populate the max_write_zeroes_sectors, max_write_zeroes_seg, and write_zeroes_may_unmap config fields in both VirtioBlock::new() and the persist::restore() path so the values match what the guest sees on a fresh boot vs after a snapshot restore. write_zeroes_may_unmap=1 lets the guest set the UNMAP flag on individual segments, which we then translate to fallocate's PUNCH_HOLE mode (UNMAP=0 uses ZERO_RANGE). Update the test_virtio_features and test_virtio_read_config expectations to account for the new feature bit and config fields. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Add three tests parametrised over Sync and Async file engines:
- test_write_zeroes_feature_and_config: VIRTIO_BLK_F_WRITE_ZEROES
advertised on writable devices; max_write_zeroes_sectors,
max_write_zeroes_seg, and write_zeroes_may_unmap populated as
expected; read-only devices advertise none of these.
- test_write_zeroes: end-to-end happy path (UNMAP=1 → PUNCH_HOLE,
metric incremented, VIRTIO_BLK_S_OK) plus four error cases
(reserved flag bit set, sector range out of bounds, num_sectors=0,
invalid data length).
- test_write_zeroes_unsupported_cached: pre-set the cache flag and
verify subsequent requests return VIRTIO_BLK_S_UNSUPP without
incrementing the success or invalid-request counters.
The UNMAP=0 (ZERO_RANGE) happy path is not asserted in the unit
tests because ZERO_RANGE support is filesystem-dependent (the host's
tmpfs in some kernels returns EOPNOTSUPP); end-to-end UNMAP=0
correctness is left to the integration tests running on ext4. The
unit tests cover wire-flag handling for UNMAP=0 via the
invalid-flags and zero-sectors cases.
Also extend test_request_type_from to assert the new
VIRTIO_BLK_T_WRITE_ZEROES → RequestType::WriteZeroes mapping.
Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Two tests, parametrised over Sync and Async IO engines:
- test_write_zeroes: boot VM with a writable block device, write
random data to a 1 MiB region of /dev/vdb, run `blkdiscard -z`
(BLKZEROOUT ioctl), then verify:
* sysfs /queue/write_zeroes_max_bytes is non-zero (feature
negotiated)
* the region reads back as zeros after the operation
* no requests failed
- test_write_zeroes_not_advertised_for_read_only: read-only device
reports write_zeroes_max_bytes=0 in sysfs.
The test deliberately does not assert that the write_zeroes_count
metric incremented. The Linux kernel's blkdev_issue_zeroout() may
legitimately fall back to issuing plain zero-page writes
(REQ_OP_WRITE) instead of REQ_OP_WRITE_ZEROES even when the feature
is advertised, depending on internal heuristics. Both paths leave
the device reading as zeros, so a metric assertion would be flaky in
CI without indicating any actual bug. Direct WRITE_ZEROES
request-handling correctness is covered by the unit tests.
Use cmp -n /dev/vdb /dev/zero for the zero check (od collapses
repeated lines with *, which makes naive byte-comparison fragile).
Add write_zeroes_count to host_tools/fcmetrics.py so the metrics
schema validation passes.
Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Add docs/api_requests/block-write-zeroes.md describing: - automatic advertisement on writable devices - UNMAP=0 → FALLOC_FL_ZERO_RANGE (zeros in place) - UNMAP=1 → FALLOC_FL_PUNCH_HOLE (zeros + deallocate) - host filesystem requirements - EOPNOTSUPP fallback (silent VIRTIO_BLK_S_UNSUPP, shared cache) - known limitations Remove the "write_zeroes is not supported" line from block-discard.md now that the feature is implemented. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
|
We require contributors to sign our Contributor License Agreement, and we don't have @ilstam, @ShadowCurse, @JackThomson2, @Manciukic, @zulinx86 on file. You can sign our CLA at https://e2b.dev/docs/cla . Once you've signed, post a comment here that says '@cla-bot check' |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 4 total unresolved issues (including 3 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit ab2399e. Configure here.
| let is_eopnotsupp = matches!( | ||
| &cqe_result, | ||
| Err(e) if e.raw_os_error() == Some(-libc::EOPNOTSUPP) | ||
| ); |
There was a problem hiding this comment.
Async CQE EOPNOTSUPP check uses wrong errno sign
High Severity
The async completion path checks e.raw_os_error() == Some(-libc::EOPNOTSUPP) (negative), but the Cqe wrapper almost certainly converts the negative CQE result to a positive errno when constructing io::Error (the standard convention for from_raw_os_error). The sync path in BlockIoError::is_eopnotsupp() correctly checks for positive libc::EOPNOTSUPP. If the sign is wrong, discard/write-zeroes EOPNOTSUPP caching never triggers for the async io_uring engine, and every unsupported request hits the generic error path instead.
Reviewed by Cursor Bugbot for commit ab2399e. Configure here.
Whenever free-page hinting is enabled, also advertise the new VIRTIO_BALLOON_F_HINT_WAIT_ON_ACK feature bit (6). When negotiated, the guest driver waits for the device to signal-used each hint buffer before pushing the just-hinted page onto vb->free_page_list, closing a stale-hint data-loss race where the shrinker could recycle a page back to the buddy allocator before discard_range completed on the host. Guests without kernel support for bit 6 simply do not negotiate it (the driver self-clears the bit if VIRTIO_BALLOON_F_FREE_PAGE_HINT is not also negotiated), so this is forward-compatible with stock guests. No host-side protocol change is required: process_free_page_hinting_queue already calls signal_used_queue once per drain, which serves as the ACK the guest waits on. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Adds a guest-side check that the negotiated balloon features in
/sys/bus/virtio/devices/virtioN/features include bit 3 (FREE_PAGE_HINT)
and bit 6 (HINT_WAIT_ON_ACK) when free_page_hinting is enabled.
The test is gated on a new dedicated marker, requires_patched_kernel,
which is registered in tests/pytest.ini and added to the default -m
exclusion filter so the test is auto-skipped by every CI run (regular
and nightly). To run it, replace the 6.1 artifact vmlinux with a build
that carries Jack Thomson's wait-on-ACK patch and invoke:
tools/devtool -y test -- -m requires_patched_kernel \
tests/integration_tests/functional/test_balloon_wait_on_ack.py
If the kernel is not patched, the bit-6 assertion fails with a clear
"did you replace the kernel?" message.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Add a subsection under free_page_hinting describing the behaviour of VIRTIO_BALLOON_F_HINT_WAIT_ON_ACK: always advertised alongside FPH, self-cleared by guests without the supporting kernel patch, no separate config knob, and a note on the per-buffer round-trip cost on supported guests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
|
We require contributors to sign our Contributor License Agreement, and we don't have @ilstam, @ShadowCurse, @JackThomson2, @Manciukic, @zulinx86 on file. You can sign our CLA at https://e2b.dev/docs/cla . Once you've signed, post a comment here that says '@cla-bot check' |
|
@cla-bot check |
|
We require contributors to sign our Contributor License Agreement, and we don't have @ilstam, @ShadowCurse, @JackThomson2, @Manciukic, @zulinx86 on file. You can sign our CLA at https://e2b.dev/docs/cla . Once you've signed, post a comment here that says '@cla-bot check' |
|
The cla-bot has been summoned, and re-checked this pull request! |


WIP
License Acceptance
By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.PR Checklist
tools/devtool checkbuild --allto verify that the PR passesbuild checks on all supported architectures.
tools/devtool checkstyleto verify that the PR passes theautomated style checks.
how they are solving the problem in a clear and encompassing way.
in the PR.
CHANGELOG.md.Runbook for Firecracker API changes.
integration tests.
TODO.rust-vmm.