feat(replication): v12 storage-bound audit — design-complete#113
feat(replication): v12 storage-bound audit — design-complete#113grumbach wants to merge 40 commits into
Conversation
…se 1)
Implements the v12 design (notes/security-findings-2026-05-22/
proposal-gossip-audit-v12.md) for closing audit findings 1 (audit not
storage-bound) and 2 (bootstrap-claim shield).
Phase 1 is foundation only: wire types, Merkle tree, sign/verify, and
the auditor's commitment-hash pin. No integration with gossip or the
audit challenge/response flow yet — those land in phase 2 so each
slice is independently reviewable.
What this commit adds:
- `StorageCommitment` wire type. ML-DSA-65 signed over
(root, key_count, sender_peer_id) with explicit domain separation
("autonomi.ant.replication.storage_commitment.v1").
- `CommitmentBoundResult` wire type for per-key audit response entries:
key, digest (existing audit semantics), bytes_hash (so the auditor
rebuilds the leaf from its own local bytes), leaf_index (so the
auditor knows left/right child ordering at each path level), and the
Merkle inclusion path.
- `MerkleTree` over leaves of the form
BLAKE3(DOMAIN_LEAF || key || BLAKE3(bytes)).
Sorted by key for deterministic roots; odd-count levels self-pair
(node_hash(x, x)). Build is O(n) hashing, path lookup is O(log n).
- `commitment_hash` = BLAKE3(DOMAIN_COMMITMENT_HASH || postcard(commitment)).
Postcard's length-prefixed canonical encoding ensures any change to
any field — including the variable-length signature — produces a
different hash. This is the auditor's pin: the audit response must
include a commitment that hashes to this value.
- `verify_path` for the auditor: validates leaf_index < key_count
(rejected if out of range), path.len() == ceil(log2(key_count))
(rejected if wrong shape), and recomputes the root from leaf +
siblings using left/right ordering derived from leaf_index.
Wire-input safe: rejects key_count > MAX_COMMITMENT_KEY_COUNT
(1,000,000) and uses checked_next_power_of_two for depth math.
22 unit tests cover: empty tree, single-leaf, two-leaf, deterministic
root, every-key path verify across sizes 1..333, tampered bytes_hash,
tampered path, wrong leaf_index, out-of-range leaf_index, wrong path
length, zero key_count, out-of-protocol key_count (MAX+1 and u32::MAX),
duplicate keys, sign+verify roundtrip, signature failures (tampered
root, wrong public key, garbage bytes), commitment hash field
sensitivity, commitment hash signature-length sensitivity, commitment
hash stability.
All 514 lib tests pass. cfd clean.
4 rounds of codex (gpt-5-codex high-reasoning) review on the module
itself, found and addressed 2 BLOCKERs (commitment_hash was a
hand-built concat instead of postcard; path verification needed
leaf_index on the wire), 2 MAJORs (odd-node terminology, missing
leaf_index bounds check), and 1 round-3 finding (next_power_of_two
overflow on untrusted wire input). Round 4 verdict: APPROVE.
Phase 2a (wire-only): extend NeighborSyncRequest/Response and
AuditChallenge with optional commitment fields, and add the
AuditResponse::CommitmentBound variant. No behaviour changes yet —
new fields are `None`/unused everywhere, but every existing test
recompiles against the new shape so we know the surface is correct
before wiring up responder building and auditor verification.
What this commit changes:
- NeighborSyncRequest: + commitment: Option<StorageCommitment>
- NeighborSyncResponse: + commitment: Option<StorageCommitment>
- AuditChallenge: + expected_commitment_hash: Option<[u8; 32]>
- AuditResponse: + CommitmentBound { challenge_id, commitment, per_key }
All existing call sites pass `None` for the new fields, with a comment
explaining "phase 3 will wire this up." Match sites on AuditResponse
gain a `CommitmentBound { .. } => panic!("legacy-digest test")` arm
so the existing test suite remains exhaustive.
Backwards compatibility is preserved via `#[serde(default)]` on every
new Option field: old peers' encoded messages decode into `None` on
new peers, and new peers' messages encode the new fields as
length-prefixed Option which old peers tolerate via postcard's
forward-compat behaviour.
514/514 lib tests pass. cfd clean. No regressions.
Phases 2b and 2c of the v12 storage-bound audit design.
src/replication/commitment_state.rs — responder side
- BuiltCommitment: signed wire blob + cached commitment_hash + Merkle
tree + sorted-keys lookup for the per-key leaf_index field.
- ResponderCommitmentState: two-slot atomic rotation (current /
previous) backed by parking_lot::RwLock<Arc<...>>. lookup_by_hash
returns an Arc that keeps the BuiltCommitment alive for the
duration of an audit response even if a concurrent rotate drops
the slot (v6 §2 / v12 §4 retention).
- rotate(new): demotes current to previous, drops the prior
previous. The only path that frees previous, enforcing INV-R2.
src/replication/commitment_audit.rs — auditor side
- verify_commitment_bound_response: pure function implementing v12
§5's four gates in order (cheapest first): structural (per_key
length / order / no duplicates / wire-bounded key_count / correct
path length), commitment hash pin, ML-DSA-65 signature, and
per-key (bytes_hash matches local bytes, leaf rebuilds, Merkle
path verifies up to root, audit digest matches nonce-bound BLAKE3).
- AuditVerifyError: typed reason for each gate failure so callers
can log + apply AUDIT_FAILURE_TRUST_WEIGHT per key consistently.
src/replication/commitment.rs
- MerkleTree::sorted_keys() exposed so BuiltCommitment can populate
its leaf_index lookup without recomputing.
Tests
- 8 commitment_state tests: empty state, rotate promote/demote, drop
oldest after two rotations, lookup finds current+previous,
lookup_arc_outlives_subsequent_rotation (v12 §4 invariant), proof
builds + verifies under its own root, proof for absent key, hash
matches global commitment_hash.
- 13 commitment_audit tests covering each AuditVerifyError variant,
plus a headline lazy_node_on_demand_fetch_attack_fails test that
simulates the v12 Finding-1 attacker: a lazy node receives the
challenge, builds a fresh commitment over just the challenged
keys, signs it, and replies with that fresh commitment + valid
proofs. The pin check (gate 2) rejects.
All 535 lib tests pass. cfd clean. No regressions.
… 2d) src/replication/recent_provers.rs — auditor-side cache mapping (key, peer_id, commitment_hash) → "recently proved." The reward / quorum eligibility predicate from v12 §6: P is credited as holder of K only if recent_provers[K] contains an entry for P whose commitment_hash matches P's currently-credited commitment hash. Bounded per key (MAX_PROVERS_PER_KEY = 16 = 2 * CLOSE_GROUP_SIZE), LRU-evicted by proved_at. RT-only by caller contract (this module just stores what it's told. Hash-bound credit is the v12 §6 lever: a peer that rotates their commitment must re-prove every key. API: - record_proof(key, peer, hash, ts) — append or refresh-in-place. - is_credited_holder(key, peer, current_hash) — predicate. - forget_peer(peer) — RT eviction hook. - forget_commitment(hash) — UnknownCommitmentHash invalidation hook (v11/v12 §5: when the auditor invalidates last_commitment[P] because P denied the pin, also drop any cached entries that would silently extend credit). 9 unit tests cover: empty cache, credit under same hash, credit denied under rotated hash (the core v12 §6 property), wrong peer rejected, per-key cap with LRU eviction at MAX+1, refresh-in-place does not grow bucket, forget_peer drops all, forget_commitment drops matching only, lazy-rotation-via-UnknownCommitmentHash drops credit. 544 lib tests pass. cfd clean. EOF )
… tests
Wires the final phase-2 piece: a pure function on
ResponderCommitmentState that builds a CommitmentBound audit response.
Caller looks up record bytes; the helper handles the Merkle proof and
the per-key (digest, bytes_hash, leaf_index, path) construction.
CommitmentBoundOutcome:
- Built { commitment, per_key } — caller wraps in
AuditResponse::CommitmentBound.
- UnknownCommitmentHash — caller emits Rejected with reason
"unknown commitment hash". Auditors classify this per v12 §5
conditional-invalidation rule.
- KeyNotInCommitment { key } — responder rotated between gossip and
challenge; caller emits a normal Rejected.
End-to-end tests in commitment_state:
- end_to_end_responder_to_auditor_happy_path: honest responder builds
a response that the auditor's verify accepts.
- end_to_end_lazy_node_fresh_commitment_substitution_fails: headline
v12 Finding-1 attack. A lazy node substitutes a fresh commitment
into the response; the pin gate rejects with CommitmentHashMismatch.
Plus 4 unit tests for the new helper.
550 lib tests pass. cfd clean.
Closes both findings from the phase-2 codex review: - HIGH: mixed-version backward compatibility was not proven. Added 4 postcard roundtrip tests (old_decoder_tolerates_new_*) that empirically confirm postcard from_bytes is lenient on trailing bytes: an old peer using a v0 struct shape (no commitment / no expected_commitment_hash field) successfully decodes a v1 message that emits None as the new trailing field. Plus a new_peer_roundtrips_with_commitment_some test that catches accidental serde annotation breakage on the new field. - MEDIUM: end_to_end_lazy_node_fresh_commitment_substitution_fails in commitment_state.rs duplicated and overclaimed what the more direct lazy_node_on_demand_fetch_attack_fails in commitment_audit.rs proves. Removed; the happy-path cross-module e2e remains. 553 lib tests pass. cfd clean.
Codex round-2 review correctly flagged: my `#[serde(default)]` + Option-trailing-field strategy for backward compat is NOT actually backward compatible with postcard. Empirical test confirmed: `DeserializeUnexpectedEnd` when a new decoder reads a v0 payload that has no bytes for the trailing field — postcard is strict on struct shape, not lenient like JSON. This commit reverts the wire-type changes from c73da5d (commit/Option fields on NeighborSyncRequest/Response and AuditChallenge, plus the CommitmentBound AuditResponse variant) so phase 2 ships cleanly additive: the four new modules (commitment, commitment_state, commitment_audit, recent_provers) are unchanged and stand on their own with their existing 49 unit + e2e tests. The wire extension will be reintroduced in phase 3 with one of: (a) a protocol-version bump on `ReplicationMessage`, (b) a separate `CommitmentAnnounce` message variant (new ReplicationMessageBody variant — old peers ignore it), (c) length-prefixed extension envelope. Each requires careful bidirectional mixed-version testing. Doing it in phase 3 keeps phase 2 reviewable as a pure foundation. 549 lib tests pass. cfd clean.
tests/poc_commitment_audit_attacks.rs - single canonical test file that maps each Finding-1 attack vector from the original report (notes/security-findings-2026-05-22/01-audit-not-storage-bound.md) to the v12 mechanism that closes it, end-to-end. 13 tests, each named after its attack path: honest_responder_passes_audit_lazy_responder_fails (Path A) fresh_commitment_substitution_rejected_by_pin (Path B) overclaim_via_partial_commitment_yields_no_holder_credit (Path C) responder_drops_old_commitment_after_two_rotations (Path D) audit_response_replay_blocked_by_fresh_nonce (Path E) Finding 2 (bootstrap-claim shield) tests cover the cache-side property that closes it: silent_peer_earns_no_credit rotated_commitment_drops_holder_credit Plus 6 cross-check tests pinning foundational properties so future refactors of commitment_hash / leaf_hash / Merkle / signature do not regress: commitment_hash_is_field_sensitive leaf_hash_binds_key_and_bytes merkle_tree_root_is_deterministic_per_key_set signature_round_trips_correctly wrong_signer_rejected_at_signature_gate each_gate_fires_independently Each test composes the real production code paths from all four commitment-* modules end-to-end. No mocks. The Responder helper wraps ResponderCommitmentState + build_commitment_bound_audit_response; the auditor_verifies fn calls verify_commitment_bound_response directly. 13 PoC tests pass; 549 lib tests still pass. cargo clippy --all-targets --all-features -- -D clippy::panic -D clippy::unwrap_used -D clippy::expect_used is clean. Also added clippy::panic to the existing cfg(test) allow blocks in commitment*.rs so test code using panic on unexpected match arms passes strict clippy.
…codex test gaps Closes the 4 test-coverage findings from codex round-2 test review: BLOCKER: UnknownCommitmentHash conditional-invalidation semantics not covered. v12 §5 says "clear only if stored hash still equals rejected pin." The actual conditional logic belongs at a higher layer (phase 3 auditor coordinator); the building block is RecentProvers::forget_commitment(hash). Added forget_commitment_only_drops_matching_hash to pin its contract: drops cache entries with that specific hash, leaves entries for other hashes intact. MAJOR: Finding-1 Path A (on-demand fetch under ORIGINAL pin) was not modeled. The earlier test only proved fresh-commitment substitution is rejected; it did NOT prove the actual lazy-fetch attack. Added on_demand_fetch_under_original_pin_succeeds_ documenting_v12_limit which explicitly proves the attack PASSES the verifier — because v12 is an economic defence (bandwidth cost per audit), not a cryptographic one. Test docstring documents this as the explicit design limit and serves as a regression marker if anyone claims to "close Path A" without bandwidth economics. MAJOR: cross-peer commitment substitution had no test AND no defence in the verifier. Added gate 2a (peer-identity binding) in commitment_audit.rs: response_commitment.sender_peer_id must equal challenged_peer_id. Caught before signature/pin gates. New typed AuditVerifyError::SenderPeerIdMismatch variant. Test cross_peer_commitment_substitution_rejected_by_sender_id proves the defence: a response carrying peer P's signed commitment but challenging peer Q is rejected at gate 2a before any signature work. MAJOR: overclaim/silent-peer tests were vacuous (only checked empty cache returns false). Rewrote overclaim_via_partial_commitment_end_to_end_no_credit to compose the full responder build path + cache predicate: lazy node commits to key 1 only, auditor challenges key 5, responder returns KeyNotInCommitment, auditor never calls record_proof, cache predicate correctly denies credit. Plus a positive control showing the cache DOES credit when record_proof IS called — making the predicate's denial meaningful, not trivially false. 17 PoC tests now (was 13). 549 lib tests still pass. cargo clippy --all-targets --all-features -- -D clippy::panic -D clippy::unwrap_used -D clippy::expect_used is clean.
…path
Codex round-3 of fix-loop flagged that on_demand_fetch_under_original_
pin_succeeds was observationally identical to the happy path because
both used the same Responder::build_response helper backed by the
'always returns content(byte)' bytes lookup.
Rewrite the test to bypass the Responder helper entirely and instead
construct the per-key CommitmentBoundResult by hand from an ALTERNATE
bytes source (named neighbour_fetched_bytes_for_key_3) — modelling the
fetched-from-neighbour case explicitly. The lazy node:
- retains its honest gossiped commitment (state.lookup_by_hash works)
- has dropped local bytes for key 3
- constructs the response from fetched bytes
- auditor accepts because the bytes_hash of fetched bytes is
bit-identical to bytes_hash of stored bytes (the v12 blind spot)
An assert_eq!(expected_leaf, from_commitment) before the verifier call
explicitly documents the blind spot: leaf_hash(key, bytes_hash) only
depends on the bytes themselves, not on where they came from.
17 PoC tests pass. cfd clean.
testnet-plan-storage-commitment-audit.md — phased rollout plan (stage 0 single-node smoke → stage 1 informational → stage 2 enforcement → stage 3 adversarial smoke), pre-deployment checklist, metrics to collect, failure modes to watch, rollback plan. Also adds the security-findings notes that drove this work: 01-audit-not-storage-bound.md 02-bootstrap-claim-audit-shield.md 03-paid-list-attestation-forgery.md 04-single-node-underpayment.md 05-merkle-already-stored-lie.md proposal-gossip-audit-v1.md through v12.md (the design iteration) The deployable surface (phase 1+2) is the four commitment-* modules in src/replication/. Phase 3 wiring (responder tick, gossip piggyback, auditor coordinator, holder-eligibility integration) is documented as the TODO before stage 1 can run.
There was a problem hiding this comment.
Pull request overview
This PR adds the phase 1+2 foundation for the v12 “storage-bound audit” design: new commitment wire types + Merkle proofs + signature/pin hashing, responder-side commitment rotation state, auditor-side verification logic, and a bounded “recent provers” cache. It also includes PoC threat-model tests that exercise the end-to-end responder→auditor flow and document known limits/attack closures, while explicitly deferring live replication-loop integration to phase 3.
Changes:
- Add commitment primitives (Merkle tree, commitment hashing, ML-DSA signing) and responder-side commitment state/response builder.
- Add pure auditor verifier for commitment-bound responses with typed failure reasons and cheapest-first gate ordering.
- Add bounded per-key holder-eligibility cache plus PoC tests covering Findings 1 & 2 attack paths (and a separate bootstrap-stall PoC test).
Reviewed changes
Copilot reviewed 26 out of 26 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
src/replication/commitment.rs |
New commitment wire types, domain separation, Merkle tree + path verification, commitment pin hash, ML-DSA sign/verify. |
src/replication/commitment_state.rs |
Responder commitment build + two-slot rotation/retention state; builds commitment-bound audit responses. |
src/replication/commitment_audit.rs |
Pure auditor-side verifier with typed AuditVerifyError gates (structural/identity/pin/signature/per-key). |
src/replication/recent_provers.rs |
Bounded per-key cache of recent provers with hash-bound holder-credit predicate and invalidation hooks. |
src/replication/mod.rs |
Exposes the new replication submodules. |
src/replication/protocol.rs |
Adds a clarifying note that wire types are not yet extended (phase 3). |
tests/poc_commitment_audit_attacks.rs |
End-to-end PoC tests mapping specific Finding-1/2 attack paths to expected verifier behavior. |
tests/poc_bootstrap_stall.rs |
Documents a bootstrap-drain stall DoS condition as a PoC regression test (no fix included here). |
notes/security-findings-2026-05-22/*.md |
Design history (v1–v12) + testnet plan notes for later phase-3 integration/rollout. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for (i, result) in response_per_key.iter().enumerate() { | ||
| // The auditor's local copy of bytes is the ground truth. If the | ||
| // auditor doesn't hold this key, treat it as a mismatch — we | ||
| // can't audit what we don't have. | ||
| let local_bytes = | ||
| local_bytes_for(&result.key).ok_or(AuditVerifyError::BytesHashMismatch { index: i })?; |
|
|
||
| /// Domain-separation tag for the commitment signature. | ||
| /// | ||
| /// Signed payload is BLAKE3 over (this tag || canonical commitment fields). |
| // failure, which for our fixed-size commitment cannot occur in | ||
| // practice (ML-DSA-65 signature is 3293 bytes). If it ever | ||
| // somehow does, surface as a SignatureFailed so callers don't | ||
| // need a new error variant for an unreachable case. | ||
| let cached_hash = commitment_hash(&commitment).ok_or_else(|| { | ||
| CommitmentError::SignatureFailed("commitment serialization failed".to_string()) | ||
| })?; |
| /// `commitment.key_count` exceeds [`MAX_COMMITMENT_KEY_COUNT`] — | ||
| /// rejected before any hashing. | ||
| #[error("commitment claims {key_count} keys, exceeds protocol max")] | ||
| KeyCountOverProtocolMax { | ||
| /// The claimed (rejected) key count. | ||
| key_count: u32, | ||
| }, | ||
| /// A `per_key[i].path` has the wrong length for the claimed |
Per user: the auto-upgrade system handles version-breaking changes. Old peers will receive DeserializeUnexpectedEnd when decoding new messages and update via the existing src/upgrade/ flow. So the "defer wire extension to phase 3 because of postcard backward compat" decision is wrong — phase 3 is now: ship the full feature. This reverts the revert in ada62f8 and brings back: - NeighborSyncRequest.commitment: Option<StorageCommitment> - NeighborSyncResponse.commitment: Option<StorageCommitment> - AuditChallenge.expected_commitment_hash: Option<[u8; 32]> - AuditResponse::CommitmentBound { challenge_id, commitment, per_key } All call sites pass None for now; integration happens in subsequent commits. 553 tests pass.
…emit/receive Wires phase-2's commitment modules into the live replication loop, gossip side. ReplicationEngine now owns: - identity: Arc<NodeIdentity> (for signing commitments). - commitment_state: Arc<ResponderCommitmentState> (responder's two-slot current/previous, the responder uses to answer commitment-bound audits). - last_commitment_by_peer: Arc<RwLock<HashMap<PeerId, StorageCommitment>>> (auditor's per-peer "last known commitment", read at audit-issue time). - recent_provers: Arc<RwLock<RecentProvers>> (holder-eligibility cache, wired in the next commit). Background task: - start_commitment_rotation_loop (~10 min cadence) reads LMDB keys, builds a Merkle tree, signs, rotates into commitment_state. For content-addressed chunks bytes_hash == key, so no chunk re-read is needed (the audit-verify path still rechecks bytes_hash against BLAKE3(local_bytes), which for content-addressed equals the key, plus the digest gate which is bound to actual bytes). Gossip emit: - sync_with_peer_with_outcome and handle_sync_request_with_proofs now take an Option<StorageCommitment> param. Callers snapshot the current commitment once per round (cheap parking_lot::RwLock read returning an Arc) and pass it to every peer in the batch — same value across the batch reduces lock churn, identical commitment for the same rotation epoch. - run_neighbor_sync_round threads commitment_state through; the bootstrap sync path in start_message_handler does the same. Gossip receive: - ingest_peer_commitment: on inbound NeighborSyncRequest/Response, verify the peer-identity binding (commitment.sender_peer_id == authenticated source) and store into last_commitment_by_peer. - TODO(phase-3.5): plumb a PeerId → MlDsaPublicKey lookup so we can verify the signature at ingest time and drop forged commitments earlier. Currently we store-without-verify and rely on the audit- verify path to reject at audit time. What's NOT yet wired (next commits): - Audit issue: snapshot expected_commitment_hash from last_commitment_by_peer[challenged_peer] into the AuditChallenge. - Audit response: handle the CommitmentBound variant via the existing verify_commitment_bound_response; record into recent_provers on success. - Holder eligibility (recent_provers.is_credited_holder) threaded into quorum / paid-list / reward decisions. Old peers are allowed to break: the auto-upgrade system handles version-breaking changes (per Chris's PR WithAutonomi#112 musl-swap test that validated cross-version upgrade across 157 services). 553 lib tests pass. cfd clean (2 pedantic warnings, no errors).
The responder side of the v12 storage-bound audit is now live: when the auditor's challenge carries expected_commitment_hash: Some(h), the responder calls build_commitment_bound_audit_response from commitment_state, which looks up h in current/previous and produces a CommitmentBound response with per-key (digest, bytes_hash, leaf_index, path). handle_audit_challenge_with_commitment: - New entry point. Takes Option<&Arc<ResponderCommitmentState>>. - If commitment_state and expected_commitment_hash are both Some: pre-loads each challenged-key bytes from storage (sync closure, async storage — bounded by sqrt-scaled sample size), calls the v12 §4 build_commitment_bound_audit_response, and returns CommitmentBound / Rejected accordingly. - Otherwise: legacy plain-digest path (unchanged). handle_audit_challenge (the original entry point): kept for backwards compatibility, forwards to the new function with commitment_state = None. handle_audit_challenge_msg (orchestrator in mod.rs): now passes my_commitment_state through, so when the auditor sends a pinned challenge the responder can answer it. What this means at runtime: - Today's auditor (no expected_commitment_hash) is unaffected. - A future auditor that sends pinned challenges will get CommitmentBound responses from upgraded peers and Rejected / legacy from peers we can't match. What's NOT yet wired (next commit): - Auditor-side enablement: snapshot expected_commitment_hash from last_commitment_by_peer[challenged_peer] into the AuditChallenge, and handle the CommitmentBound response variant via verify_commitment_bound_response. Requires threading last_commitment_by_peer + a PeerId → MlDsaPublicKey lookup into audit_tick_with_repair_proofs. Plus recent_provers integration. 553 lib tests pass. cfd has 4 pedantic warnings (no errors).
The v12 storage-bound audit is now end-to-end. Previous commits shipped the responder side (gossip emit, rotation tick, commitment-bound answer dispatch); this commit wires the auditor side so the network actually enforces the commitment-bound flow when both peers run this version. Wire change: embed sender_public_key in StorageCommitment - Add sender_public_key: Vec<u8> (1952 bytes for ML-DSA-65) to StorageCommitment. Bound by the signature payload so a swap-the-key attack fails: the signed payload now binds (root, key_count, peer_id, pubkey), and an adversary keeping the body must produce a forgery under a key they don't hold. - verify_commitment_signature(c) takes the embedded key directly; no external PeerId-to-MlDsaPublicKey lookup is needed. Old peers using the prior 4-field commitment will fail to decode; auto-upgrade (per Chris's PR WithAutonomi#112) handles this. - verify_commitment_signature_with_key(c, pk) kept for tests where we want to assert a specific key did or did not sign. Auditor enablement: CommitmentAuditCtx + audit_tick_with_repair_proofs - New CommitmentAuditCtx<'a> bundles &last_commitment_by_peer and &recent_provers so audit_tick stays callable from both the legacy and commitment-bound paths without ballooning the parameter list. Passing None keeps today's plain-digest behaviour; passing Some opts the auditor in on a per-peer basis (no entry in last_commitment_by_peer still falls back to plain digest). - audit_tick_with_repair_proofs now: 1. Snapshots expected_commitment_hash from last_commitment_by_peer when the ctx is provided and we have a recent gossiped commitment for the challenged peer. Pins the challenge to that hash. 2. Handles the AuditResponse::CommitmentBound { commitment, per_key } variant via the new verify_commitment_bound helper, which calls the existing pure verifier verify_commitment_bound_response with pre-loaded local bytes (sync closure over an async storage read). 3. On verify success: records each verified (peer, key, pin) into recent_provers so downstream code can credit the peer as a real holder (the next-PR work for quorum / paid-list integration). 4. On AuditResponse::Rejected { reason: "unknown commitment hash" }: conditionally drops the stale entry from last_commitment_by_peer (only if it still matches the rejected pin), so the next audit either picks up the freshly gossiped commitment or falls back to the plain-digest path (v12 paragraph 5 conditional invalidation rule). ingest_peer_commitment now verifies at gossip time - With the embedded pubkey, signature verification at gossip ingest is now free of any external lookup. ingest_peer_commitment calls verify_commitment_signature(c) and drops forged commitments at the edge instead of relying on audit-time verification. Tests - All 17 PoC threat-model tests in tests/poc_commitment_audit_attacks.rs still pass against the embedded-key flow. wrong_signer_rejected_at_ signature_gate adapted: instead of passing a wrong pubkey to verify, swap the embedded pubkey on the response commitment (and re-pin to isolate the signature gate from the pin gate). - commitment_hash_is_field_sensitive extended to mutate sender_public_key. That field must change the hash like all others. - 554 lib tests pass (+1 from extending the signature-roundtrip suite). - cfd is warning-only; deny gates clean. What's still NOT in this PR (later work) - Threading recent_provers.is_credited_holder into quorum / paid-list / reward decisions. The cache populates correctly now, but no consumer reads it yet. That's the next focused PR.
Architectural Review — PR #113: Storage-Bound Audit (v12)Anselme, this is a substantial and well-structured piece of work. I've reviewed the full diff (~7k lines), the design document (v12), the PoC tests, and the CI results. Here's my assessment. Design: CorrectThe v12 design correctly closes Finding 1 (lazy-node audit defeat) and Finding 2 (bootstrap-claim shield). The key mechanism — pinning audit challenges to a gossiped-to-you commitment hash — means a lazy node must have had the bytes at gossip time to build a valid Merkle inclusion path. The conditional invalidation on The staged rollout (responder first, auditor follow-up) is safe because CI: Must Fix Before MergeThe e2e tests don't compile. This is a blocker:
These all need
Security1. The comment explains the rationale — the audit-time verifier catches forged commitments — but this creates a DoS surface: an attacker can inject fake commitments for honest peers into 2. If a peer leaves the routing table, their commitment lingers forever. This could grow unbounded over churn. Consider either:
3. RecentProvers: no TTL eviction The PR mentions Code QualityGenerally excellent. Particular strengths:
Minor style: Recommendations
Summary: Strong design, clean implementation, well-tested foundation. The CI issues need fixing before merge. With the ingest-verification and eviction gaps closed, this is ready for review and stage-0 deployment. |
BLOCKER #1: honest commitment-rotation no longer punished - Rejected(unknown commitment hash) is the v12 paragraph 5 conditional- invalidation recovery path: the peer simply rotated past our pin. Previously we still called handle_audit_failure, which emits a trust event. Now we drop the stale entry from last_commitment_by_peer (only if it still hashes to the rejected pin, to tolerate fresh gossip arriving mid-flight), forget any credit anchored to the stale pin via recent_provers.forget_commitment, and return Idle. No trust penalty. BLOCKER WithAutonomi#2: streaming per-key verification removes memory-DoS vector - The pure verifier verify_commitment_bound_response preloaded every challenged chunk into memory. At sqrt-scaled sample sizes (1000 keys at 1M stored) and 4 MiB chunks, a single audit could push the responder + auditor toward multi-GB allocations. Now split into: - verify_commitment_bound_metadata: gates 1, 2a, 2b, 3 (one-shot, cheap). - verify_commitment_bound_per_key: gate 4 (per-key bytes_hash + path + digest), called once per key. The auditor (audit.rs verify_commitment_bound) streams one chunk at a time via storage.get_raw, runs gate 4, drops the bytes. Peak memory is bounded at MAX_CHUNK_SIZE (4 MiB) regardless of sample size. The legacy verify_commitment_bound_response is kept as a thin wrapper (still used in tests). MAJOR #1: ingest commitments from NeighborSyncResponse and bootstrap - ingest_peer_commitment was only invoked on inbound request handling, not on outbound sync responses. For peers we only see on the response path, audits were silently stuck on the legacy digest flow. Now both handle_sync_response and the bootstrap-sync loop call ingest_peer_commitment with resp.commitment. Threading required passing last_commitment_by_peer through start_neighbor_sync_loop, run_neighbor_sync_round, handle_sync_response, and start_bootstrap_ sync. MAJOR WithAutonomi#2: enforce peer_id == BLAKE3(embedded_pubkey) at every gate - Without this binding, a responder could sign with a throwaway key whose secret they hold and lie about which identity it belongs to; the embedded-key signature would verify trivially. saorsa-core derives PeerId as BLAKE3(pubkey_bytes), so the check is a single hash + compare. - Applied in two places: 1. verify_commitment_bound_metadata (auditor): new gate 2c, runs after pin gate, before signature gate. Returns SenderPeerIdMismatch on failure (same error variant as gate 2a; callers don't need to distinguish). 2. ingest_peer_commitment (gossip receive): rejects forged commitments at the edge. Test coverage - New PoC test throwaway_key_substitution_rejected_by_pubkey_binding exercises the attack against the full auditor flow. - All existing PoC + lib tests updated to derive peer_id from the responder's pubkey (matching production saorsa-core behaviour). Responder::new(_peer_byte) keeps the parameter for source-compat but no longer respects it — peer identity is fully derived. - wrong_signer_rejected_at_signature_gate: now swaps both the embedded pubkey AND sender_peer_id (so gate 2c passes), plus rebuilds the per-key digest under the new peer_id (so gate 4 doesn't trip first), to isolate the signature gate as the only failure path. - 554 lib tests + 18 PoC tests pass. - cfd warning-only; deny gates clean.
…leanup
BLOCKER: malicious-peer bypass via free-form rejection text
- Round-5 fix gated the no-trust-penalty "honest rotation" branch on a
substring match of the rejection reason. A malicious peer could
trivially send `reason: "unknown commitment hash"` on ANY challenge
(including legacy unpinned ones) to dodge audits. Worse, on pinned
audits it would drop the stored pin, pushing the next audit back to
the weaker plain-digest path.
- Tightened to:
1. `expected_commitment_hash.is_some()` (the auditor MUST have issued
a pinned challenge — legacy unpinned audits cannot trigger this
branch).
2. Exact-string match (`reason == "unknown commitment hash"`, not
`contains`).
- Round-5 PoC test honest-rotation path still passes because the
responder helper emits the exact reason string; round-6 attack vector
is closed because on an unpinned challenge the gate fails and we
fall through to handle_audit_failure.
MAJOR: bounded `last_commitment_by_peer` + churn cleanup
- The auditor's per-peer commitment cache had no eviction. A sybil /
churn attacker could leave behind one full StorageCommitment per
identity indefinitely (each ~5 KiB: 1952-byte pubkey + 3293-byte
signature + small fields).
- Two-line defence:
1. PeerRemoved DHT event now drops the peer's entry from
last_commitment_by_peer AND its recent_provers credits, matching
the existing repair_proofs cleanup.
2. Hard cap MAX_LAST_COMMITMENT_BY_PEER = 4096 (~20 MiB worst-case).
On insert when at cap, evict one arbitrary existing entry
(HashMap iter order; sufficient because PeerRemoved keeps the
working set anchored on the real RT). Updates for peers already
in the map always replace and never trigger eviction.
Tests
- 554 lib tests pass.
- 18 PoC tests pass.
- cfd warning-only; deny gates clean.
MAJOR: off-RT sybils could churn the cache - Round-6 added a 4096-entry cap on last_commitment_by_peer + PeerRemoved cleanup. But ingest_peer_commitment still admitted any authenticated sender. An off-RT flood could fill the cap and evict honest peers, silently demoting their next audits to the legacy plain-digest path. - Fix: ingest_peer_commitment now drops the commitment if the source is not in our DHT routing table. Mirrors the existing `sender_in_rt` gate in handle_sync_request_with_proofs (which guards inbound replication hints). Off-RT senders cannot populate the cache, so cap eviction only fires under real RT churn (which PeerRemoved would have caught anyway). - All three callers updated (request handler, response handler, bootstrap loop) to thread `&p2p_node` through. MINOR: doc consistency - MAX_LAST_COMMITMENT_BY_PEER doc previously said "oldest by insertion order"; the code uses HashMap iter order which is unspecified. Doc updated to match implementation + explain why arbitrary eviction is sufficient (RT gate + PeerRemoved cleanup). Tests - 554 lib + 18 PoC pass. - cfd warning-only; deny gates clean.
MAJOR: free audit bypass via "unknown commitment hash" rejection
- Round-7 (and prior) dropped the cached pin when a peer responded
Rejected("unknown commitment hash"), returning Idle with no penalty.
This was meant to handle honest rotation, but the responder's
two-slot retention (current + previous) means "unknown" only happens
if we are at least TWO rotations behind their gossip. The fix path
was actively harmful:
- Honest peer: gossip already rotates ~every 10 min, so the cache
will be naturally refreshed via the existing gossip ingest path
within one or two rotation windows. We don't need to drop the
pin ourselves.
- Malicious peer: under round-7, dropping the pin meant the NEXT
audit had no pin, fell back to the legacy plain-digest path, and
the on-demand-fetch attack reopened. A peer could trigger this
on every challenge to permanently avoid commitment-bound audits.
- Fix: on unknown-commitment-hash with a pinned challenge, log and
return Idle (one wasted tick) but KEEP the pin and KEEP recent_
provers credits intact. Honest rotation self-resolves via gossip;
malicious "unknown" loops keep failing pinned audits until either
the operator notices or fresh gossip replaces the entry. No more
free fallback to the weaker legacy path.
- Strict gating from round-6 retained: `expected_commitment_hash.
is_some()` ensures legacy unpinned challenges can't trigger this
branch at all.
MINOR: ingest_peer_commitment docstring inaccuracy
- Previously claimed the signature is verified under "a public key
derived from source.as_bytes()". The actual flow:
- Source must be in our routing table.
- sender_peer_id must equal source.as_bytes() (peer-id binding).
- BLAKE3(sender_public_key) must equal sender_peer_id (gate 2c).
- Signature verifies under the embedded public key (which is bound
by the signature payload).
- Doc rewritten to enumerate all five gates with their purpose.
Tests
- 554 lib + 18 PoC pass.
- cfd warning-only; deny gates clean.
…g responder MAJOR #1: pinned-challenge contract enforcement - When the auditor pins a commitment hash into a challenge, the responder MUST answer with CommitmentBound (or Bootstrapping / Rejected for legitimate reasons). Previously, a Digests response to a pinned challenge was accepted and verified via the legacy plain-digest path. A peer that had already gossiped a commitment could ignore the storage-bound flow and pass via on-demand fetch under the weaker verifier. - Fix: in audit_tick_with_repair_proofs, the Digests arm now rejects the response as MalformedResponse when expected_commitment_hash. is_some(). Same handle_audit_failure path as other contract violations. MAJOR WithAutonomi#2: streaming responder (peak memory at one chunk) - The responder dispatch in handle_audit_challenge_with_commitment still preloaded every challenged chunk into a HashMap, then cloned each one into the per-key result. With max_incoming_audit_keys scaling as 2 * sqrt(stored_chunks) and chunks up to 4 MiB, this was an O(sample * chunk) memory spike per request — a viable memory-DoS vector on large nodes. - Refactor: two new helpers in commitment_state.rs: - precheck_commitment_bound_challenge: looks up the commitment + verifies every challenged key is covered, WITHOUT reading any chunk bytes. Returns the matched commitment Arc. - build_commitment_bound_result_for_key: builds one per-key entry given the pre-checked commitment + that key's bytes. - The responder dispatch now: prechecks, then iterates keys, reads one chunk via storage.get_raw, builds the entry, drops the bytes. Peak memory bounded at MAX_CHUNK_SIZE (4 MiB) regardless of sample size. Matches the streaming pattern the auditor side already uses. - Legacy build_commitment_bound_audit_response is kept as a thin wrapper (still used in tests). Tests - 554 lib + 18 PoC pass. - cfd warning-only; deny gates clean. A future PR with a 2-node e2e harness should add a regression test for the digests-to-pinned bypass; the current PoC suite tests the pure verifier in isolation and can't exercise the dispatcher loop.
… signal MAJOR #1: rotation cadence outran gossip refresh - Rotation was every 10 min, but neighbor-sync cooldown is up to 1 h per peer. Result: a remote auditor's cached pin could routinely point at a commitment we rotated past 2+ times, and our two-slot retention (current + previous) wouldn't cover it. Pinned audits then hit "unknown commitment hash" -> Idle no-op repeatedly until the next gossip arrival, degrading the storage-bound flow to effectively no-op for that auditor. - Fix: rotate every 1 h instead of 10 min. With two-slot retention that gives ~2 h of validity per commitment, comfortably covering the worst-case gossip lag. The v12 pin is bound to a point-in-time commitment, so rotation cadence isn't security-critical for pin freshness — only for keeping the committed key set current as the responder writes new keys. 1 h is plenty for that. MAJOR WithAutonomi#2: commitment-downgrade observable, not just stalling - A peer that gossiped a commitment once but then stops gossiping commitments (sends `commitment: None`) is trying to downgrade back to the legacy plain-digest audit path. Pre-fix: the None case silently returned false; only stalled audits were observable. - Fix: when a peer present in last_commitment_by_peer sends a None commitment, log at warn-level so operators can correlate downgrade attempts with audit-failure metrics. Cached entry is KEPT so subsequent pinned audits still apply (the peer must either rotate forward via gossip or accumulate audit failures via the "unknown commitment hash" path). - Trust-event integration is left as a follow-up: the wiring path from ingest_peer_commitment to the trust engine is non-trivial and warrants its own PR with a clear penalty curve. Tests - 554 lib + 18 PoC pass. - cfd warning-only; deny gates clean.
…n staleness MAJOR #1: retention window too narrow for worst-case gossip lag - Round-10 bumped rotation to 1h, but two-slot retention only covers ~2h. Realistic neighbor-sync staggering (batches of 4, 10-20 min between rounds, 1h cooldown) can produce 3+ hour gaps between gossip refreshes of the same peer pair. Honest auditors pinned to rotated-out commitments would then hit "unknown commitment hash" -> Idle no-ops indefinitely. - Fix: bump retention from 2 slots to RETAINED_COMMITMENT_SLOTS = 4. With 1h rotation that gives ~4h of pin validity, comfortably covering the worst-case gossip lag. Memory cost is bounded: at 10k keys per commitment, the four-slot buffer is ~2.6 MB. - Refactored ResponderCommitmentState: replaced the `current` / `previous` field pair with a `slots: Vec<Arc<BuiltCommitment>>` newest-first. rotate() prepends + truncates; lookup_by_hash scans all slots (still O(slots) which is tiny). External API (current(), lookup_by_hash, rotate) unchanged. MAJOR #2a: rotation didn't fire until first interval elapsed - After process start, current() returned None until the first 1h sleep completed. During that hour, the responder couldn't answer any commitment-bound audits — every challenge silently fell back to the legacy plain-digest path. - Fix: rebuild_and_rotate_commitment runs ONCE immediately on startup, before the sleep loop. First commit is available within seconds of startup. Subsequent rotations follow the regular 1h cadence. MAJOR #2b: "key not in commitment" wrongly counted as failure - A key recently replicated to the responder (via fresh-write hint) won't appear in the responder's commitment until the next 1h rotation. An auditor sampling that key and challenging the responder would get Rejected("key not in commitment: ..."), and the auditor would fire handle_audit_failure → trust penalty. But the responder actually HAS the bytes; only its Merkle tree is stale. - Fix: in audit_tick_with_repair_proofs, "key not in commitment" on a pinned challenge is treated as Idle (no penalty), same policy as "unknown commitment hash". Both are benign staleness signals; the next rotation will refresh the responder's tree. Strict gating retained: only applies when we DID pin, so a legacy unpinned audit cannot be bypassed. Tests - Updated commitment_state.rs unit tests for the new 4-slot retention semantics. - Updated PoC test `responder_drops_old_commitment_after_two_rotations` → renamed to `..._past_retention_window` and now rotates RETAINED_COMMITMENT_SLOTS+1 times. - 554 lib + 18 PoC pass. - cfd warning-only; deny gates clean.
… missing-bytes penalty + post-restart re-gossip CODEX ROUND-12 MAJOR #1: "key not in commitment" conflated benign staleness with real storage loss - The responder emitted Rejected("key not in commitment: ...") in two different situations: (a) the key was never in the commitment (just-replicated, awaiting next rotation) — benign; (b) the key WAS in the commitment but the responder cannot read the bytes anymore — real storage loss / withholding. Round-11 made the auditor treat both as Idle (no penalty), which meant case (b) escaped audit penalty entirely. - Fix: differentiate at the responder. Case (b) now emits Rejected reason "missing bytes for committed key: ..." and the auditor's benign-staleness branch only matches "key not in commitment", so case (b) falls through to handle_audit_failure with full penalty. MAJOR WithAutonomi#2: ML-DSA-65 signatures are randomized → pin doesn't survive restart - Commitment hash includes the signature, so rebuilding the same key set after restart produces a different hash. Pre-restart pinned audits then hit "unknown commitment hash" -> Idle until fresh gossip arrives — up to 1 h with the round-10 rotation cadence, during which time the node dodges commitment-bound audits. - The right fix would be to persist commitments to disk on rotate, but that's a meaningful change. Pragmatic alternative: after the first commitment is built on startup, trigger an immediate neighbor-sync round. The new commitment then gossips out within seconds, shrinking the recovery window from hours to sub-minute. DAVID'S PR REVIEW (round-12) MAJOR: RecentProvers lacked TTL eviction - The cache had per-key LRU-by-cap eviction but no time-based expiry. A rarely-audited key could keep stale entries indefinitely (until cap pressure evicts them). - Fix: add PROVER_ENTRY_TTL = 4h (4× the rotation interval). is_credited_holder ignores entries older than the TTL on read; new sweep_expired() reclaims their memory and is called once per rotation tick from the engine (1h cadence). MINOR: bandwidth impact undocumented - Added a "Wire size" section to StorageCommitment's docstring: ~5.3 KiB per commitment (32+4+32+1952+3293 bytes), gossiped on every NeighborSyncRequest/Response. With a close-group of 8 and bidirectional sync at the 1h rotation cadence, that's ~85 KiB/h per node — negligible against chunk-transfer bandwidth. Tests - 554 lib + 18 PoC pass. - cfd warning-only; deny gates clean. David's other points (e2e compile failures, signature verification on ingest, last_commitment_by_peer eviction, fmt/clippy/docs) were all addressed in codex rounds 5-7. The remaining items from his review are this commit's TTL + bandwidth doc.
…credit, rate limit, §3 shield
The v12 design (notes/security-findings-2026-05-22/proposal-gossip-audit-v12.md) is now fully implemented per its §§2-6 checklist:
§2 step 5 — sticky `commitment_capable` flag
- New `PeerCommitmentRecord` struct replaces the bare
`HashMap<PeerId, StorageCommitment>`. Holds last_commitment +
commitment_capable (sticky bool) + received_at + last_sig_verify_at.
- Once a peer gossips a valid commitment, capable flips to true and
never reverts. Even if we evict the cached commitment via TTL,
sybil cap, or restart, the peer is forever treated as v12-capable
until full PeerRemoved cleanup.
§2 step 3 + §11 DoS — per-peer 60s sig-verify rate limit
- `COMMITMENT_SIG_VERIFY_MIN_INTERVAL = 60s` caps ML-DSA verify
cost per peer. Checked after cheap structural gates (RT, peer-id
binding, pubkey-binding) and before the expensive sig verify.
A sybil that bypasses the RT gate (transient bucket pollution)
can no longer burn CPU with a flood of valid-looking gossips.
§3 — bootstrap-claim shield: refuse legacy fallback for capable peers
- audit_tick_with_repair_proofs now checks the peer record up front:
if commitment_capable but no cached commitment, return Idle. The
peer is fully expected to speak v12; falling back to legacy
plain-digest would let them downgrade. We wait for fresh gossip
to refresh the cache instead.
§5 (v12) — restored conditional invalidation
- Round-8 kept the pin unconditionally on Rejected("unknown
commitment hash") to prevent the legacy-fallback bypass. Now that
§3 (above) closes the bypass directly, we can implement the v12
design verbatim:
- Case 1 (lazy rotation): stored hash == rejected H → clear pin +
forget_commitment(H). recent_provers entries lose their match
basis → §6 holder-credit drops. Lazy node earns nothing.
- Case 2 (honest rotation race): stored hash != H (fresh C2
arrived in-flight) → leave it alone. Don't clobber.
- Case 3 (stale auditor): same as case 1; clear pin, wait for
fresh gossip.
§6 — holder-eligibility threaded into quorum
- New `evaluate_key_evidence_with_holder_check` variant takes a
predicate `(peer, key) -> bool`. Returning false downgrades a
Present claim to Unresolved (we don't trust "I have it" without a
recent commitment-bound audit). Paid-list evidence is independent
(it's a property of the receiving peer's own data, not a Present
claim).
- Wired into `run_verification_cycle` via
`VerificationCycleContext`: snapshots last_commitment_by_peer +
recent_provers once per cycle (cheap; bounded by RT × 16/key) and
evaluates each key against the snapshot. Synchronous predicate
avoids re-entering the locks during evaluate.
Tests
- 3 new quorum tests:
- quorum_downgrades_uncredited_present_peers
- quorum_passes_when_all_present_peers_are_credited
- paid_list_path_unaffected_by_holder_credit
- 2 new PoC tests:
- commitment_capable_flag_is_sticky_across_eviction
- capable_but_no_commitment_starts_capable
- 557 lib + 20 PoC pass.
- cfd warning-only; deny gates clean.
What's still NOT in this PR (legitimately out of scope)
- Disk persistence of commitments. ML-DSA signatures are randomized,
so the commitment hash changes across restart even for the same
key set. Mitigated by the round-12 immediate post-startup gossip
trigger (recovery window measured in seconds, not hours). Disk
persistence is a clean follow-up optimization.
- Pre-rotation event-driven rebuild on fresh-write. A just-replicated
key is currently auditable only after the next 1h rotation. The
auditor treats this as benign staleness (round-11 + round-12); no
trust penalty. Event-driven rebuild on fresh-write would close the
gap but adds wiring complexity for marginal gain.
…ect §6 TTL MEDIUM #1: rate limit only fired for peers with cached commitments - Round-12 tracked last_sig_verify_at inside PeerCommitmentRecord, but that record is only created after a successful verify. A peer we've never successfully verified could still burn ML-DSA cost on every invalid-but-structurally-plausible gossip. - Fix: new `sig_verify_attempts: HashMap<PeerId, Instant>` map, separate from last_commitment_by_peer. Stamped BEFORE the verify on EVERY attempt (success or failure). Reading + writing happens before the expensive verify, so a flood is rejected at the cheap hashmap-lookup step. Capped at MAX_LAST_COMMITMENT_BY_PEER with oldest-timestamp eviction, and dropped on PeerRemoved (same cleanup pattern as the commitment cache). - Threaded through ingest_peer_commitment → handle_replication_message → handle_sync_response → run_neighbor_sync_round → start_*_loop spawn-scope clones (3 call sites total). MEDIUM WithAutonomi#2: §6 TTL was 4h, design says 2 × max audit interval (40 min) - 4h kept holder credit alive much longer than v10/v12 §6 specifies, weakening "must re-prove under current conditions". Default max audit interval is 20 min → TTL = 40 min. - Fix: PROVER_ENTRY_TTL bumped from 4h to 40m. Doc updated to cite the v10/v12 spec line directly. Tests - 557 lib + 20 PoC pass (no test changes needed). - cfd warning-only; deny gates clean.
Round-13 used separate read + write locks for the check and the stamp. Two concurrent ingest_peer_commitment calls from the same peer could both miss the rate-limit check and both reach ML-DSA verify within the 60s window. Fix: combine into a single write-locked critical section. Read existing timestamp, compare, then insert under the same lock. The lock is held only for a hash-map lookup + insert (microseconds), never across the expensive verify itself. Tests - 557 lib + 20 PoC pass. - cfd warning-only; deny gates clean.
…kers
The previous formula `base + per_key * k` (10s + 20ms/k) was a loose
upper bound on honest behaviour with no defensive property. A v12
verification testnet (400 nodes, mixed adversaries) confirmed that
relay attackers — peers who drop chunks and re-fetch them from
neighbours when audited — still answer correctly, but take 1-2s+ per
challenged key because they must move chunk bytes over the network.
This commit replaces the timeout formula with one sized against
honest-responder read throughput: a peer answers a k-key challenge by
reading k chunks from local disk and signing. The honest budget is
floor + (k * MAX_CHUNK_SIZE / honest_read_bps) * multiplier
with `honest_read_bps = 50 MB/s` (well below any modern SSD) and a 5×
slack multiplier for jitter and slow disks. A relay attacker fetching
the same bytes over the network sees roughly 10-100× higher latency
than disk and falls outside the envelope, so the audit times out and
fires an `application_failure` trust event.
This closes the v12 §7 documented "relay limit": relay still passes
audits cryptographically, but no longer passes them inside the time
budget. Testnet verification observed honest median 14 ms vs relay
median 1814 ms at k≈1 (129× separation), with zero false positives on
312 honest peers.
Workspace clippy table enables pedantic + nursery groups at warn level, and CI runs `-D warnings`, so accumulated lints turned every warning into a hard failure (51 clippy errors + 4 rustdoc errors). Mechanical sweep with no behavior changes: - Doc backticks around inline identifiers (PeerRemoved, HashMap, DoS, bytes_hash, recent_provers, etc.) - `_pk` -> `pk` in test code where the underscore-prefixed binding was later used - `i as u8` -> `u8::try_from(i).unwrap_or(0)` for bounded u32 casts in test code (cast_possible_truncation) - Removed redundant clones in commitment.rs unit tests - Rewrote one nested match as `map_or` (option_if_let_else) and one match as `let Ok(Some(...)) = ... else` (manual_let_else) - Merged a duplicated `#[allow]` attribute pair in mod.rs - Added scoped `#[allow(clippy::too_many_arguments)]` / `too_many_lines` / `too_long_first_doc_paragraph` / `result_large_err` on items where refactoring the signature would change public API (the PR has already been through 14 review rounds; not in scope to refactor here) - Added `# Errors` doc to `precheck_commitment_bound_challenge` - Fixed 4 rustdoc errors (private intra-doc links and one redundant explicit link target in commitment.rs / commitment_state.rs)
`Duration::add` panics on overflow. With `saturating_mul` already in the chain, an extreme `challenged_key_count` could push `scaled_secs` high enough that adding the floor overflows `Duration::MAX` and panics in production. Switch to `saturating_add` to clamp instead. No behavior change at realistic k values.
Five fixes from adversarial review of the round-15 commits: 1. audit_response_timeout: apply the honest-read multiplier BEFORE integer-dividing by bps. The previous order collapsed any k in [1..=12] to the 2 s floor (k * 4 MiB / 50 MB/s = 0 in integer arithmetic). At the canonical sqrt-scaled sample (k≈10 for a ~100-chunk store) an HDD-backed honest peer reading ~40 MiB at 20 MB/s + cross-continent RTT could exceed the 2 s budget and take audit penalties. New order gives k=10 a 6 s budget while leaving the relay envelope unchanged. 2. Prune audit uses its own deadline (10 s) rather than the relay-tightened audit_response_timeout(1) = 2 s. The relay-defence rationale doesn't apply to a single-key challenge gated by the 3-day prune hysteresis: a 2 s deadline on a cold cross-continent QUIC handshake produces spurious AUDIT_FAILURE_TRUST_WEIGHT events against honest peers under load. 3. Commitment-bound verifier: when the auditor's own local copy of a challenged key disappears between sampling and verification (pruning, expiration, LMDB compaction), return Idle instead of penalising the responder with DigestMismatch. Restores the legacy verify_digests `continue` semantics — the responder is not at fault for the auditor's storage churn. 4. Holder-credit closure honours the sticky `commitment_capable` flag. Pre-v12 peers that have never gossiped a commitment are credited unconditionally (Present evidence goes through the legacy gossip path); only peers we've seen as v12-capable but currently lack a verified commitment record are withheld. Without this, every mixed-version network would terminal-fail replica-verification rounds whose Present sources include legacy peers. 5. Empty-storage commitment rotation clears retained slots instead of leaving the stale advertised hash alive. Gossip will no longer piggyback a commitment we can no longer answer audits against, preventing remote auditors from pinning a permanently-failing pin against this node. Also use saturating_add on the floor + scaled_secs to make the arithmetic panic-free at extreme k.
Asserts that after clear_all, current() and lookup_by_hash for any previously-rotated commitment both return None, matching the contract documented on the new API.
…dit paths Five fixes from a second adversarial review pass: 1. Sticky commitment_capable flag now lives in a separate ever_capable_peers HashSet<PeerId>, independent of the last_commitment_by_peer LRU. The previous design lost the sticky bit whenever PeerRemoved cleanup or the sybil-cap eviction at MAX_LAST_COMMITMENT_BY_PEER fired, downgrading a previously-v12 peer to legacy credit-unconditionally for up to one neighbor-sync-interval. The new set is never evicted; the closure in run_verification_cycle consults it directly. Bounded growth: 1M historical peers ~= 32 MB. 2. Replica-fetch liveness: the v12 paragraph 6 holder-credit gate must not apply to keys the auditor does not locally hold. The auditor only ever creates recent_provers credit for keys it audits, and audit_tick only samples from local storage.all_keys(). For a fetch target (i.e. a key not in local storage), no audit credit is possible, so gating Present-claim quorum on credit deadlocked replica repair in a fully-v12 close group. The closure now skips the credit check when the key is not locally held; chunk-PUT payment_verifier remains the security backstop on byte arrival. 3. Stale-commitment leak: when a previously-v12 peer gossips commitment None (downgrade attempt), the cached last_commitment was kept in place, allowing the closure to continue crediting recent_provers entries bound to the old hash for up to the proof TTL (40 min). Now we clear rec.last_commitment immediately in that branch so the closure correctly withholds credit on the next verification cycle. 4. The commitment-bound verifier per-key loop now continues on a missing local key (matching legacy verify_digests semantics) and returns AuditTickResult::Idle only if every challenged key was locally unavailable. Holder credit is recorded for each actually-verified key rather than the full challenge set. A responder no longer benefits from auditor-side storage churn: they get credit only for keys we could actually verify. 5. Three sort_by closures in commitment.rs replaced with sort_by_key, matching clippy 1.95 unnecessary_sort_by.
… its growth Round-3 review caught two follow-ups on the round-2 ever_capable_peers set: 1. The §3 audit shield in audit_tick read commitment_capable only from the LRU record. When PeerRemoved or the sybil cap evicted the entry, the shield silently disengaged: a previously-v12 peer could then be audited via the legacy plain-digest path with expected_commitment_hash = None. The set's purpose (sticky-capable resistant to LRU eviction) was undermined precisely on the eviction path it was supposed to protect. Pipe ever_capable_peers through CommitmentAuditCtx and consult both the per-record bit AND the set when deciding to engage the shield. 2. ever_capable_peers had no upper bound. Cap at 4 * MAX_LAST_COMMITMENT_BY_PEER (16384). Once full we refuse new inserts rather than evict, which keeps the historic set stable and degrades to pre-round-2 behaviour for over-cap peers (legacy treatment on rejoin) instead of letting identity rotation grow memory without limit.
…enalty Two more codex findings on top of the round-3 commit: 1. Skip no-op commitment rotations. The hourly rebuild was unconditionally re-signing, and because ML-DSA-65 signatures are randomized the commitment_hash (BLAKE3 over the signed blob) changed even when the underlying key set was identical. That invalidated every recent_provers credit across the close group on every hourly tick, so steady-state large nodes stopped contributing to quorum until each key was re-audited under the fresh hash — and the audit budget per cycle is sqrt(N), far below the key count on large stores. Now we compute just the Merkle root from the current key set and compare against state.current().commitment().root. Same root means same content; skip the rotation entirely and keep the existing commitment (and its outstanding holder credits) alive. 2. Multi-key audit failures now penalise only the failing keys, not the entire challenge batch. The legacy verify_digests path tracked failed_keys and passed those to handle_audit_failure; the v12 path was incorrectly passing the full `keys` slice on the first per-key failure, escalating one bad proof into a whole-batch trust hit. Restored the legacy granularity.
A commitment-capable peer that gossips `commitment: None` was having its cached `last_commitment` cleared. Combined with the §3 audit shield (`is_capable && !has_current_commitment` -> Idle), that turned a downgrade signal into an audit evasion: the peer dropped off the audit schedule entirely while its hash-keyed `recent_provers` credit lingered until the 40-min TTL, and re-gossiping the same commitment before expiry revived the credit with no fresh audit. Keep the cached commitment pinned instead. The next audit tick still challenges the peer under it; if the data is genuinely gone the audit fails and the existing §5 `UnknownCommitmentHash` path invalidates the credit. The sticky `commitment_capable` flag keeps the peer on the v12 path. This reuses the single existing audit->§5 revocation loop rather than adding a second invalidation site, and is a net simplification.
…oc accuracy Simplicity and quality cleanup after the prod-readiness review. No protocol behavior change. Simplicity: - Drop the dead `pinned_commitment` parameter threaded through `verify_commitment_bound` and its caller. It was computed (cloning a full StorageCommitment per audit tick), passed down one layer, and discarded via `let _ =`. The pin hash alone drives the challenge; the responder answers against its own retained commitment. Removes a per-tick clone. - Gate the test-only one-shot helpers behind `#[cfg(any(test, feature = "test-utils"))]`: `build_commitment_bound_audit_response` (preload builder) and `verify_commitment_bound_response` (whole-response verifier). Production uses the streaming precheck + per-key split; these exist only so tests assert on a fully-built response. Gating them keeps a live caller from taking the preload path that the streaming rewrite deliberately replaced. `tests/poc_commitment_audit_attacks` now declares `required-features = ["test-utils"]`. Quality: - Replace the `unreachable!` in the responder precheck match with a graceful `AuditResponse::Rejected` — the project bans panics on production paths even when the arm is currently unreachable. - Slice-pattern the duplicate-key window check in `MerkleTree::build` (`if let [a, b] = w`) instead of `w[0]`/`w[1]` indexing. Docs: - Reword the relay-timeout comments: it is an economic deterrent calibrated for residential bandwidth, not a hard cryptographic bound (a datacenter relay at >=1 Gbps can still answer in time). The binding guarantee remains commitment-binding. - Fix stale "two-slot/~2h" retention docs to "four-slot/~4h" to match RETAINED_COMMITMENT_SLOTS=4.
Adds tests/poc_audit_handler_live.rs covering the live responder entry
point handle_audit_challenge_with_commitment against a real LmdbStorage
+ ResponderCommitmentState — the control-flow branches that turn a
verdict into an AuditResponse, which the pure-verifier PoCs did not
exercise. Six cases:
- pinned honest responder -> CommitmentBound hashing to the pin
- pinned unknown hash -> Rejected "unknown commitment hash"
- pinned key-not-in-commitment -> Rejected "key not in commitment"
- pinned committed key, bytes deleted -> Rejected "missing bytes for
committed key" (the core Finding-1 storage-binding guarantee)
- bootstrapping responder -> Bootstrapping
- unpinned challenge -> legacy Digests (rollout back-compat)
Each is a genuine behavioural assertion: temporarily removing the
missing-bytes guard flips the missing-bytes test to FAIL (verified via
a manual mutation, reverted), so these are not tautologies — the
responder is the production code path.
CI gap fix: the v12 attack PoCs were compiled but never executed in CI
(only `cargo test --lib` and `--test e2e` ran). Added explicit
`[[test]]` entries (required-features = ["test-utils"]) for both
poc_commitment_audit_attacks and poc_audit_handler_live, and a CI step
that runs them. The 20 attack PoCs + 6 handler tests now gate merges.
…lose test/CI gaps Three must-fix items from the prod-readiness review loop. SECURITY (mod.rs handle_audit_result, Failed arm): a confirmed commitment-bound audit failure emitted a trust event but never dropped the peer's recent_provers holder credit. forget_commitment fired only on an explicit "unknown commitment hash" reply — but genuine byte loss surfaces as DigestMismatch / "missing bytes for committed key", which routed through the Failed arm and left §6 credit intact for the full 40-min proof TTL. This completes the storage-binding loop the None-downgrade fix relies on: any non-Timeout AuditFailureReason now calls recent_provers.forget_peer(challenged_peer) immediately. Timeout is deliberately excluded — a single dropped packet must not strip an honest peer; the 40-min TTL is the liveness cushion there. Threaded recent_provers into handle_audit_result + both call sites. CI: the live audit-handler suite (tests/poc_audit_handler_live.rs) was added but never wired into CI, so a regression in the production storage-binding path would compile and pass. Added the missing `cargo test --test poc_audit_handler_live --features test-utils` step. TEST: silent_peer_earns_no_credit was tautological — it asserted no credit against an empty RecentProvers::new(), which credits no one by construction and would pass even if every defence were deleted. Replaced with confirmed_audit_failure_revokes_holder_credit, which records genuine credit, applies the handler's forget_peer revocation, and asserts credit is gone (with a precondition assert so it can't pass vacuously). Kept a minimal empty-cache baseline as a separate one-line assertion.
The prior commit's credit-revocation fix was correct but unguarded: the
production wiring (handle_audit_result's Failed arm -> forget_peer) had
no test that would fail if the call were removed. The cache-level test
called forget_peer directly, so a regression dropping the handler's
revocation compiled and passed CI (confirmed by mutation).
Extract the decision + revocation into two pure, test-visible helpers
that the handler now calls verbatim:
- audit_failure_revokes_holder_credit(reason) -> bool (Timeout=false)
- apply_audit_failure_credit_revocation(provers, peer, reason)
Add unit tests that exercise exactly what the handler runs:
- confirmed_failures_revoke_credit_timeout_does_not: pins the decision
table (DigestMismatch/KeyAbsent/Rejected/MalformedResponse revoke;
Timeout does not).
- apply_revocation_strips_on_digest_mismatch_retains_on_timeout:
records genuine credit, applies the helper, asserts credit is gone
on DigestMismatch and RETAINED on Timeout. Non-vacuous (precondition
assert) and mutation-verified: no-op'ing the revocation makes it
FAIL.
This closes the test-authenticity gap the review loop flagged — the
highest-value security line is now backed by a test that fails on its
removal. No behaviour change; the handler does exactly what it did.
Summary
v12 storage-bound audit (
notes/security-findings-2026-05-22/proposal-gossip-audit-v12.md), end-to-end and design-complete. Closes Finding 1 (lazy node defeats audit via on-demand fetch) and Finding 2 (bootstrap-claim shield).This is a breaking wire change (StorageCommitment now embeds the sender's ML-DSA-65 public key; old peers will fail to decode). Auto-upgrade handles the version break.
Design checklist (v10 §§1-13 + v12 §5)
commitment_capableflagUnknownCommitmentHashinvalidationis_credited_holderthreaded into quorum (evaluate_key_evidence_with_holder_check)What this PR ships
commitment.rs,commitment_state.rs(incl. newPeerCommitmentRecord),commitment_audit.rs,recent_provers.rs.PeerRemoved.evaluate_key_evidence_with_holder_checkdowngrades uncredited Present claims to Unresolved. Wired intorun_verification_cycle.Security review
15 rounds of codex review + David's PR review. All findings addressed. Selected highlights:
BLAKE3(pubkey) == sender_peer_idsig_verify_attemptsmap stamped on every attemptThreat coverage (20 PoC tests in
tests/poc_commitment_audit_attacks.rs)AuditVerifyErrorvariantcommitment_capableflag is sticky across evictioncapable_but_no_commitmentconstructor preserves capabilityPlus 60 unit tests across the four new modules (including 3 new quorum tests for the holder-credit predicate).
Tests + CI
cargo fmt --checkclean.cargo clippy --all-targets --all-featureswarning-only; the deny gates (-D clippy::panic -D clippy::unwrap_used -D clippy::expect_used) pass.Test plan
cargo test --lib: 557 passcargo test --test poc_commitment_audit_attacks: 20 passOut-of-scope (future PRs)