chore(release): promote rc-2026.5.4#118
Merged
Merged
Conversation
The shared upgrade binary cache stored the extracted binary and, on a cache hit, returned it after only a SHA-256 check against a sibling .meta.json. SHA-256 is not a security control: anyone able to write to the shared cache directory (a co-located process, a shared container volume, a low-privilege foothold on the host) could drop a malicious binary plus a forged matching metadata hash, and the next ant-node instance to upgrade would execute it with no signature verification at all — persistent RCE on every co-located node. The ML-DSA-65 signature covers the archive and was only checked on the initial download, never on a cache hit. Changes: - Cache the signed *archive + detached signature* instead of the extracted binary. `BinaryCache::get_verified_archive` re-runs ML-DSA-65 verification on every cache hit; the binary is always extracted fresh from the just-verified archive. A tampered archive, tampered or missing signature, or forged metadata fails verification against the pinned release public key, so a poisoned cache entry is rejected and a fresh verified download runs. - Stage cached files into the caller's process-private temp directory and verify that copy, then extract from the same private path. Closes the verify-vs-extract TOCTOU on the shared cache files: an attacker cannot swap the bytes between when the verifier reads them and when the extractor reads them. - Size policy before any copy or read. `fs::symlink_metadata` + `file_type().is_file()` rejects symlinks / FIFOs / devices outright; archive size is bounded by `MAX_ARCHIVE_SIZE_BYTES` and the signature must be exactly `SIGNATURE_SIZE` bytes. Otherwise an attacker could plant `cached.archive -> /dev/zero` (stats as 0 bytes) and force unbounded disk fill in the staging dir or OOM in `signature::verify`. - Cache only after successful extraction. A validly-signed-but-malformed release no longer becomes a shared cache poison pill that every later node downloads, fails to extract, and re-downloads. - `cache_dir.rs` restricts the shared upgrade cache directory to 0700 on Unix as defence in depth; the ML-DSA gate is the primary control. - `store_archive` mirrors the same size / file-type / signature checks before persisting, so a poisoned entry cannot be created through the supported path either. Tests in `src/upgrade/binary_cache.rs` cover the tamper path (SHA-256-forged swap on disk rejected by the signature re-check), the post-hit shared-file swap (private copy unaffected), the symlink-to- `/dev/zero` bypass attempt, oversize archive / wrong-sized signature rejection, and round-trip storage. Production verifies against the pinned `RELEASE_SIGNING_KEY`; tests use a `#[cfg(test)]`-only constructor that injects a generated key without weakening the production trust anchor. Residual: cache entries are not bound to a specific release version (the ML-DSA signing context is constant across versions), so a same-UID attacker who already has any past validly-signed release can plant it under a newer version's cache key and force a downgrade to that old signed binary. Not RCE (still legitimately-signed bytes) and a same-UID attacker has easier paths anyway; closing it cleanly requires coordinated changes in the release-signing pipeline, ant-keygen, ant-node, and ant-client, and is tracked in the binary_cache module docs.
Review feedback on the upgrade binary cache: - `meta.json` was read with an unbounded `fs::read_to_string`. An attacker with write access to the shared cache directory could plant the metadata sidecar as a symlink to `/dev/zero` or as a huge file and stall the read into a hang/OOM before the archive/sig hardening ran. The metadata path now goes through the same open-once-and-validate gate as the archive: regular-file check on the opened handle, capped at `MAX_META_BYTES` (4 KiB). - Archive + signature staging previously did `symlink_metadata` (path) followed by `fs::copy` (path), leaving a small TOCTOU window where an attacker could race-swap the path to a symlink/FIFO/device or an oversized file between the check and the copy. Both files are now opened once via `open_regular_capped`, validated on the resulting `File` handle (size + file-type), and copied into the private staging dir from the open handle (wrapped in `Read::take(len)` as belt-and-braces against a post-open extension). All subsequent operations on those files use the staged private bytes, never the shared path. - Comment fix: the prior comment claimed `sha256_file` loads the archive into memory in full. It actually streams in 8 KiB chunks; the memory-pressure concern is `signature::verify_from_file*` (FIPS-204 requires the message as a slice). Wording updated. - Stale error message "Failed to serialize binary cache meta" updated to "Failed to serialize cached archive metadata" — the cache now stores archive metadata, not extracted-binary metadata. Two new tests: test_oversized_meta_is_rejected test_meta_symlink_to_special_file_is_rejected (Unix-only) 488 lib tests pass; cfd clean.
Close a local DoS on auto-upgrade: a cache-dir attacker could plant a FIFO at ant-node-<ver>.archive (or .sig / .meta.json) and open() for reading would block indefinitely waiting for a writer, hanging the upgrade. open_regular_capped previously only checked file type AFTER the blocking open. Two-layer defence in open_regular_capped: - Pre-check via fs::metadata (follows symlinks), reject non-regular files before open(). A symlink-to-regular is still accepted as before; a symlink-to-FIFO/device/socket is rejected. - On Unix, also open with O_NONBLOCK so a race between the pre-check and open() cannot reopen the FIFO window. Reads on regular files ignore O_NONBLOCK, so this is a no-op for the happy path. Platform- specific constant (0o4000 Linux, 0x0004 macOS/BSD); fallback to no flag on unknown unix-likes. The existing post-open is_file() check on the file handle remains the TOCTOU-safe final gate. New regression test test_fifo_cached_archive_does_not_hang plants a real FIFO via mkfifo and asserts return in well under 2s. 14/14 binary_cache tests pass; cfd clean.
Round 2 from adversarial review: - Replace hand-coded O_NONBLOCK constants with libc::O_NONBLOCK. The previous 0o4000/0x0004 per-OS values were correct on x86_64/aarch64/arm but wrong on Linux/MIPS (0o200) and Linux/SPARC (0x4000), where 0o4000 maps to O_NOATIME. Using the libc constant always picks the right value for the target arch. Add libc as a Unix-only direct dependency (was already transitive). - Test test_fifo_cached_archive_does_not_hang: replace the mkfifo shell-out with libc::mkfifo so a CI image that drops coreutils cannot silently skip this test. Bump the budget from 2s to 5s to absorb GitHub Actions macOS runner cold-start variance, since the failure mode "O_NONBLOCK wrong on this arch" and "CI runner slow" look identical from the assertion. - Document the load-bearing invariant on get_verified_archive's private_dir: callers MUST supply a process-private 0o700 dir (apply.rs already does via tempfile + permissions). Without that the reopens-by-path in sha256_file/verify_archive would reopen a TOCTOU window. - Add a cross-reference comment explaining the intentional asymmetry between store_archive (uses symlink_metadata, rejects symlinks) and open_regular_capped (uses fs::metadata, accepts symlink-to-regular) so a later editor doesn't unify them in the wrong direction. 14/14 binary_cache tests pass, 489/489 lib tests pass, cfd clean.
Switch both Linux release targets from glibc to musl so the published
binaries run on any Linux distribution, including Alpine and other
musl-based systems. Asset filenames are unchanged
(ant-node-cli-linux-{arm64,x64}.tar.gz) so existing auto-upgraders on
deployed nodes continue to find them.
x86_64-unknown-linux-musl now uses `cross` for the musl toolchain
(matching aarch64). musl-static binaries have no dynamic linker
dependency and execute on glibc hosts as well as musl hosts.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
musl's default malloc is notably slower than glibc's under concurrent allocation churn — the steady-state shape of a DHT-bridged P2P node. Switching the global allocator to mimalloc neutralises that regression for the musl Linux builds, and tends to outperform glibc's allocator as well, so all builds benefit. Applied to both ant-node and ant-devnet binaries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Merkle pay-yourself defence verified candidate closeness with an iterative Kademlia *network* lookup (find_closest_nodes_network) on the PUT-handling hot path. That lookup runs up to MAX_ITERATIONS rounds bounded by CLOSENESS_LOOKUP_TIMEOUT (240s) and is the dominant term in slow per-chunk store times; its instability (fresh transient peers pulled in on every call) also contributes to the closeness disagreements that cause outright rejections. Answer instead from the local routing table (find_closest_nodes_local, a pure in-memory k-bucket read with no network I/O), matching the precedent already used for the close-group responsibility check (find_closest_nodes_local_with_self). Fall back to the network lookup only when the local table is genuinely too sparse to be authoritative (fewer than CLOSENESS_LOOKUP_WIDTH peers near the midpoint). The fallback is gated on local table size, not match outcome, so a forged pool cannot force the expensive 240s path -- an attacker cannot make a victim's local routing table sparse. check_closeness_match and the single-flight pass-cache wrapper are unchanged. Node-side only, no wire/protocol change, so this is backwards compatible across a mixed-version fleet. The fallback decision is extracted into a pure const fn (closeness_should_fall_back_to_network) so its CLOSENESS_LOOKUP_WIDTH boundary is unit tested without standing up a P2PNode. Test results: - cargo fmt -- --check: clean - cargo clippy --lib --all-features -- -D clippy::panic -D clippy::unwrap_used -D clippy::expect_used: no warnings - cargo test --lib payment::verifier: 67 passed, 0 failed (incl. new boundary test closeness_falls_back_to_network_only_below_lookup_width) - e2e test target (--test e2e --features test-utils): compiles Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses the coordination concern raised in review (dirvine): the Merkle closeness check is a *verification* that must mirror the uploader's pure XOR-distance view, not the reachability re-rank used for storage selection. With saorsa-core's reachability-aware find_closest_nodes_local (WithAutonomi/saorsa-core#121), a re-rank could demote an XOR-close relay-only peer out of the compared window and falsely reject an honest candidate pool that legitimately contains that peer. Switch the closeness check to find_closest_nodes_local_by_distance, the XOR-only variant added to saorsa-core#121 for exactly this purpose. check_closeness_match (the set-membership helper) is unchanged. Also rename the local variable network_peers -> closeness_peers for readability (review feedback, grumbach), since it now usually holds local-table results. The rc-2026.5.4 dependency pins (saorsa-core, ant-protocol) come from the release-cut base commit; this commit only advances Cargo.lock to the rc-2026.5.4 tip so the pin includes the merged #121 (find_closest_nodes_local_by_distance), which the base's release cut predated. Test results (against the rc-2026.5.4 deps): - cargo fmt -- --check: clean - cargo clippy --lib --all-features (-D panic -D unwrap_used -D expect_used): no warnings - cargo test --lib payment::verifier: 67 passed, 0 failed Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fix: answer Merkle closeness check from local routing table
… on mismatch Two upload-breaking regressions on testnets with a meaningful NAT fraction, both from storer-side closeness verification diverging from the uploader's network-walked peer selection. Single-node close-group check (introduced in #107): switch off the reachability-reranked find_closest_nodes_local_with_self onto the XOR-only find_closest_nodes_local_by_distance_with_self. The re-rank (saorsa-core #121) demoted XOR-close relay-only / NAT'd peers out of the local top-CLOSE_GROUP_SIZE, dropping 2-3 of the uploader's 7 quoted peers and breaching the >=5 threshold. This mirrors the fix already applied to the Merkle path; it remains a pure local lookup, so no added network cost. Merkle candidate-pool check (changed in #111): #111 moved the check off the authoritative network lookup onto the local routing table, with the fallback gated on local-table *size*, not match *outcome*. On a real network the local k-bucket sample legitimately diverges from the uploader's network-walked candidates (which include reachable responders from positions 17-32), so honest pools were hard-rejected with no escalation. Keep #111's local fast path (accept on a local match), but escalate to the authoritative network lookup on match *failure* too. Bound the reopened network-fallback path with a new closeness_fallback_permits semaphore (CLOSENESS_NETWORK_FALLBACK_CONCURRENCY = 16): inflight_closeness already collapses same-pool concurrency, and this caps the distinct-pool case so a forged-pool flood cannot spawn unbounded 240s Kademlia walks -- addressing the DoS rationale #111 gave for the size-only gate. Requires saorsa-core's find_closest_nodes_local_by_distance_with_self (WithAutonomi/saorsa-core#122) on rc-2026.5.4. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Picks up find_closest_nodes_local_by_distance_with_self (WithAutonomi/saorsa-core#122, now merged to rc-2026.5.4) that the single-node close-group verification change depends on. The crate is pinned to `branch = "rc-2026.5.4"`, so this only advances Cargo.lock from 1be7352 to 82bb541; no manifest change. ant-node now compiles against the published branch without a local patch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI only triggered for push/pull_request against `main`, so PRs targeting release branches (e.g. rc-2026.5.4) ran no checks. Add `rc-*` to both branch filters. Note: the pull_request branch filter is evaluated against the PR's base branch, so this only starts firing for rc-targeted PRs once it has landed on the rc-2026.5.4 branch itself (i.e. after this PR merges). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…closeness-verification fix(payment): verify closeness against pure-XOR view; escalate Merkle on mismatch
feat: build Linux releases against musl with mimalloc allocator
CI re-enabled on rc-* branches surfaced a pre-existing doc_markdown lint (clippy 1.95) in binary_cache.rs that fails under -D warnings.
Revert #107 (enforce single-node proof verification)
There was a problem hiding this comment.
Pull request overview
This PR promotes the crate to 0.11.5 while also introducing broader runtime, upgrade-cache, payment-verifier, and release workflow changes.
Changes:
- Bumps package/lockfile version and adds
mimalloc/libcdependencies. - Reworks upgrade caching to store signed archives and re-verify signatures on cache hits.
- Changes single-node payment verification and switches Linux release artifacts to musl builds.
Reviewed changes
Copilot reviewed 9 out of 10 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
Cargo.toml |
Bumps version and adds allocator/libc dependencies. |
Cargo.lock |
Regenerates dependency lockfile for version/dependency changes. |
src/upgrade/cache_dir.rs |
Tightens Unix upgrade cache directory permissions. |
src/upgrade/binary_cache.rs |
Replaces cached binaries with signed archive/signature cache validation. |
src/upgrade/apply.rs |
Extracts verified cached archives and caches downloaded signed archives. |
src/payment/verifier.rs |
Simplifies single-node payment validation and delegates median payment verification. |
src/bin/ant-node/main.rs |
Sets mimalloc as global allocator. |
src/bin/ant-devnet/main.rs |
Sets mimalloc as global allocator. |
.github/workflows/release.yml |
Changes Linux release targets to musl/cross builds. |
.github/workflows/ci.yml |
Runs CI on rc-* branches. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+463
to
+464
| Self::validate_peer_bindings(payment)?; | ||
| self.validate_local_recipient(payment)?; |
Comment on lines
+26
to
+30
| # Global allocator. musl's default malloc is significantly slower than | ||
| # glibc's under concurrent allocation churn, which matches the node's | ||
| # steady-state workload. mimalloc neutralises that regression for the | ||
| # musl Linux builds (and tends to beat glibc's allocator too). | ||
| mimalloc = "0.1" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Promotes
rc-2026.5.4to release version(s): 0.11.5.-rc.*from[package].versionCargo.lockOnce merged, the release tag will be pushed to fire the publish workflow.