Skip to content

QUIC rewrite + rendezvous + relay + mTLS + 16 audit fixes#3

Open
cdcseacave wants to merge 26 commits into
developfrom
quic
Open

QUIC rewrite + rendezvous + relay + mTLS + 16 audit fixes#3
cdcseacave wants to merge 26 commits into
developfrom
quic

Conversation

@cdcseacave
Copy link
Copy Markdown
Owner

Summary

Lands the four-phase QUIC rewrite plus the 16-finding security/robustness audit on develop.

  • Phase 0 — replace TCP + sliding window with QUIC + TLS 1.3 (cert-pinned). Per-chunk uni streams; per-chunk CRC / ACK / window code paths removed.
  • Phase 1 — new p2p-rendezvous crate + rendezvousd binary; --rendezvous / --code CLI flags; STUN-based NAT classification and UDP hole punching via race_connect_and_accept.
  • Phase 2rendezvousd --relay-bind UDP forwarder for symmetric-NAT pairs; QUIC TLS still terminates end-to-end (the relay sees ciphertext only).
  • Phase 3 — GUI Pair-with-code mode (off-thread establishment via Command::perform), nat-test --rendezvous self-loop punch.
  • Audit — 16 findings (4 Critical, 6 High, 6 Medium): drained QUIC streams, u64 chunk indices, bounds-checked chunk_index, hard receiver SHA-256, path-traversal sanitizer, mutual TLS with cross_check_fingerprint enforcement, address-validated accept_from, STUN tx-id check, rendezvous concurrency cap + TCP-sourced public IP, relay slot pre-binding + 65 KiB recv buffer + off-hot-path idle eviction, typed Error::Disconnected framing.
  • Docs — README, DESIGN, TODO, CHANGELOG and per-crate AGENTS.md (including a new p2p-rendezvous/AGENTS.md) all updated.

Per the project's "no backwards compatibility on redesigns" rule, the four phases and the audit fixes change wire formats and call sites in place — no compat shims.

Test plan

  • cargo build --workspace clean
  • cargo clippy --workspace --all-targets -- -D warnings clean
  • cargo test --workspace green (65 p2p-core unit + 7 p2p-rendezvous unit + 2 p2p-gui unit + 3 integration + 3 doc tests)
  • Local smoke: two p2p-transfer instances on loopback exchange a multi-MB file; both SHA-256s match
  • Cross-NAT smoke: two clients pair through a self-hosted rendezvousd (direct punch)
  • Relay smoke: 100 MB transfer through rendezvousd --relay-bind with both clients forced into relay mode
  • GUI smoke: Pair-with-code tab establishes a session and the UI stays responsive during the wait

🤖 Generated with Claude Code

cdcseacave and others added 7 commits May 22, 2026 23:40
Single QUIC transport on a single UDP socket per endpoint, with per-chunk
unidirectional streams instead of the sliding window. Mandatory TLS 1.3
with per-device Ed25519 + self-signed cert pinned by SHA-256 fingerprint
(generated and persisted to <config_dir>/p2p-transfer/identity.{key,cert}
on first run; published in LAN beacons and via --peer-fingerprint).

Removes: tcp.rs, window.rs (sliding window), per-chunk CRC32 + crc32fast,
ChunkAck / AckStatus / ChunkMessage, ENCRYPTION + WINDOWED capability
bits, --window-size + --max-retries CLI flags, the legacy blocking
nat.rs (collapsed into traversal/stun.rs). PROTOCOL_VERSION bumped to 2;
equality check only — no v1 compat code.

Verified: cargo test --features full and cargo clippy --all-targets
--all-features -- -D warnings green; LAN loopback transfer of a 2 MB
file is byte-identical SHA-256 at ~80 MB/s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New workspace crate p2p-rendezvous with a pairing-by-code MessagePack
protocol (tiny: register-and-await on TCP) and a rendezvousd binary
operators can self-host. p2p-core::traversal::establish_via_rendezvous
binds a UDP socket, runs STUN on it (the same socket quinn will own),
registers the public endpoint + cert fingerprint + device id with the
rendezvous under a short shared code, and on peer match races
QuicEndpoint::connect against accept on the punched socket — QUIC
Initial packets themselves do the punching.

Symmetric NAT is detected up front via stun::classify_nat (two STUN
servers, compare mapped ports) and surfaces Error::HolePunchFailed
without an attempt. New tests/traversal_loopback_test.rs covers the
rendezvous + race_connect_and_accept primitives end-to-end on localhost.

CLI gains --rendezvous <host:port> and --code <code> on send/receive
that take precedence over --peer and --discover when present.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`rendezvousd --relay-bind <addr> --max-relay-mbps <n>` runs a tiny UDP
packet forwarder alongside the rendezvous. When STUN spots symmetric
NAT (or --force-relay is set), the registrant asks for relay mode and
the rendezvous returns Message::RelayMatch with a fresh 16-byte session
token + the relay address + the peer fingerprint. Each peer sends a
RelayHello so the relay records its source address, then runs a normal
QUIC handshake with the relay's address as the apparent peer endpoint.
The relay forwards UDP packets verbatim — QUIC TLS still terminates
end-to-end so the relay only sees ciphertext.

New: p2p_rendezvous::relay::{Relay, RelayHello}, Message::RelayMatch,
RegisterRequest.want_relay (defaults to false for back-compat),
register_full + MatchOutcome enum, --force-relay CLI flag, --relay-bind
and --max-relay-mbps on rendezvousd, tests/relay_loopback_test.rs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…es (Phase 3)

GUI Connection tab gains a third mode "Pair with code (cross-NAT)" with
rendezvous-server + shared-code inputs and a Generate button. The
existing Connect mode now also exposes the --peer-fingerprint field
needed for direct mode. Session establishment runs inside Command::perform
and only the resulting P2PSession is wrapped in Arc<tokio::Mutex<...>>,
so the iced message loop stays responsive during a multi-second
rendezvous wait.

CLI `nat-test --rendezvous <host[:port]>` runs a real self-loop punch
test: spawns two local peers, registers both at the rendezvous with a
fresh code, races a QUIC handshake between them, and reports direct /
relay / failed plus latency.

Two underlying bugs surfaced while validating end-to-end:

  1. The race_connect_and_accept tokio::select! could pick different
     directions on each peer, leaving them on mismatched connections.
     Replaced with a deterministic role split keyed on device_id —
     smaller device_id runs connect(), larger runs accept(). This also
     fixes the "two processes sharing the same persistent identity
     deadlock both as Responder" case.

  2. quinn's open_bi is local-only — the responder's accept_bi only
     unblocks once the initiator writes to the stream. Have
     open_control_initiator write the 4-byte PROTOCOL_MAGIC immediately
     after open_bi, and open_control_responder consume it after
     accept_bi. The control stream is now visible to both sides without
     waiting for the application's first send_message.

Verified: cargo clippy --features full --all-targets -- -D warnings
clean; cargo test --features full all green; manual end-to-end of
`send <file> --rendezvous --code --force-relay` ↔
`receive --rendezvous --code --force-relay` produces a byte-identical
SHA-256 file at ~55 MB/s through a localhost relay.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Data integrity (C1, C3, H4, M4, C4):
 - Drain QUIC uni-streams via stream.stopped() so the last chunk
   isn't lost when the sender closes the connection.
 - Widen chunk indices (ChunkReader.total_chunks, read_chunk,
   fold_chunk, ChunkWriter.write_chunk) to u64; large files no
   longer truncate at the 2^32-chunk boundary.
 - Bounds-check the wire-supplied chunk_index against total_chunks
   in receive_file.
 - Receiver SHA mismatch is now a hard Error::Verification, not a
   silent warn.

Path sanitization (M3): receiver and scan_folder both run a
sanitize_relative_path that rejects absolute paths, .., ., drive
and root components.

mTLS (H1): server now requires a client cert via a custom
AcceptAnyClientCert verifier; client presents its cert in
with_client_auth_cert. cross_check_fingerprint rejects missing
observations, closing the responder-side HELLO bypass.

NAT traversal (C2, H5, M1):
 - race_connect_and_accept launches both connect and accept on
   both sides; the larger-device-id peer staggers its connect by
   50ms to avoid Initial-packet collisions.
 - accept_from loops on accept and drops connections whose
   source address doesn't match the rendezvous-supplied peer.
 - STUN query rejects responses whose transaction id doesn't
   match the request.

Rendezvous + relay (H6, M6, H3, H2, M5):
 - Server::bind_with takes a max_concurrent cap (default 1024)
   enforced via tokio::sync::Semaphore.
 - Server rewrites the RegisterRequest IP to the TCP peer's IP
   (keeping the user-supplied UDP port), blocking traffic
   reflection through forged public_endpoint.
 - Relay recv buffer up to 65 KiB; warns on full-buffer reads.
 - Relay slot binding is now fingerprint-keyed lookup;
   reserve_session refuses identical fingerprints.
 - Idle-session eviction moved to a 30s background task off the
   per-packet forward path.

Typed EOF (M2): framing::read_message maps UnexpectedEof on the
magic read to Error::Disconnected; frame-interior short reads
become Error::Protocol. Drops the string-matching arm in
session.rs::run_event_loop.

Per the project's no-backwards-compat rule, no compat shims
were added: wire formats and call sites changed in place. New
tests cover sanitization, STUN tx-id, IP rewriting, concurrency
cap, cross-check, peer-address matching, and typed EOF.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- README: highlights now cover mTLS, rendezvous + hole punching, relay
  fallback, u64 chunk indices, path sanitization, hard SHA verification;
  add Security Model section.
- DESIGN: mTLS identity model (both ends pin), updated module map with
  traversal/punch.rs + p2p-rendezvous, expanded NAT-traversal phases
  with anti-reflection IP rewrite + STUN tx-id validation + relay slot
  pre-binding; new "Security & robustness guarantees" section
  enumerating the 16 invariants the data path enforces.
- TODO: mark "Security & robustness hardening" done (2026-05-23) with
  bullet summary of all 16 findings.
- CHANGELOG: add dated "Fixed - 2026-05-23 - Security & robustness
  audit (16 findings)" entry grouped by Data integrity / Security /
  Robustness, listing each C/H/M id alongside the fix.
- Root AGENTS: workspace now lists 4 member crates + 3 integration
  tests; new gotchas covering u64 chunk indices, sanitize_relative_path,
  mTLS, accept_from source validation, server-side public-IP rewrite,
  no-backwards-compat rule.
- p2p-core AGENTS: mTLS layer description, traversal/punch.rs entry,
  receiver SHA mismatch + path sanitization + STUN tx-id sections.
- p2p-cli AGENTS: SessionParams includes --rendezvous/--code/
  --force-relay, drop the stale --window-size / --max-retries
  mentions, document nat-test self-loop mode.
- p2p-gui AGENTS: ConnectionMode triplet (Listen/Connect/Rendezvous),
  Command::perform pattern for off-thread establishment.
- p2p-rendezvous AGENTS: new file - server (concurrency cap +
  public-IP rewrite), relay (slot pre-binding + off-hot-path eviction),
  client, rendezvousd binary, wire protocol.
- CONTRIBUTING + .github/copilot-instructions: project structure tree
  refreshed (p2p-rendezvous, traversal/punch.rs, integration tests,
  test counts no longer claim exact numbers).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 23, 2026 09:08
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR lands a major transport redesign and pairing stack for the project: replacing the legacy TCP + sliding-window protocol with QUIC (TLS 1.3, mutual cert auth + fingerprint pinning), adding rendezvous + relay infrastructure for cross-NAT pairing, and incorporating a set of security/robustness audit fixes across the core library, CLI, and GUI.

Changes:

  • Introduces QUIC transport + mTLS identity model (persistent device identity, fingerprint pinning, updated handshake/protocol framing).
  • Adds cross-NAT pairing via new p2p-rendezvous crate (rendezvousd server + optional UDP relay) and wires it into CLI/GUI flows.
  • Updates discovery beacons, tests, and docs; removes legacy TCP and sliding-window modules.

Reviewed changes

Copilot reviewed 61 out of 61 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/traversal_loopback_test.rs New loopback rendezvous + punch smoke test.
tests/relay_loopback_test.rs New loopback rendezvous + relay + QUIC handshake smoke test.
tests/integration_test.rs Updates workspace integration test to QUIC + P2PSession handshake.
p2p-rendezvous/src/protocol.rs Defines rendezvous wire protocol messages and types.
p2p-rendezvous/src/lib.rs Crate entry + private framing helpers and re-exports.
p2p-rendezvous/src/client.rs Implements rendezvous client registration and match handling.
p2p-rendezvous/src/bin/rendezvousd.rs Adds rendezvousd binary (server + optional relay).
p2p-rendezvous/Cargo.toml Adds new workspace crate manifest.
p2p-rendezvous/AGENTS.md Adds crate-specific developer guidance.
p2p-gui/src/views/settings.rs Removes legacy window-size UI input.
p2p-gui/src/views/connection.rs Adds rendezvous pairing UI and peer fingerprint input for direct connect.
p2p-gui/src/state.rs Adds rendezvous mode state + removes window-size setting.
p2p-gui/src/operations.rs Implements rendezvous pairing + fingerprint parsing + updated establish calls.
p2p-gui/src/message.rs Adds messages for rendezvous + fingerprint input; removes window-size message.
p2p-gui/Cargo.toml Adds hex dependency for fingerprint decoding.
p2p-gui/AGENTS.md Adds GUI-specific developer guidance and pairing mode notes.
p2p-core/src/window.rs Removes legacy sliding-window implementation.
p2p-core/src/verification.rs Drops CRC32 and keeps file-level SHA-256 verification with improved messages.
p2p-core/src/traversal/stun.rs Adds async STUN query + NAT classification on borrowed UDP socket.
p2p-core/src/traversal/punch.rs Adds QUIC-based race connect/accept punch with address validation.
p2p-core/src/traversal/mod.rs Adds traversal orchestrator for STUN + rendezvous + punch/relay.
p2p-core/src/tls.rs Adds rustls QUIC TLS config (pinning verifier + mTLS server config).
p2p-core/src/reconnect.rs Removes legacy transient-error helper in favor of Error::is_recoverable.
p2p-core/src/protocol.rs Updates control-plane protocol for QUIC era; adds cert fingerprints; removes chunk ACK/CRC/window fields.
p2p-core/src/network/udp.rs Updates discovery beacons to include cert fingerprint.
p2p-core/src/network/tcp.rs Removes legacy TCP transport.
p2p-core/src/network/mod.rs Switches network module to QUIC-only transport.
p2p-core/src/network/framing.rs Improves framing error typing (Disconnected vs Protocol).
p2p-core/src/lib.rs Updates exports/constants for QUIC era (protocol v2, ALPN, default rendezvous port).
p2p-core/src/known_peers.rs Adds TOFU known-peers fingerprint store.
p2p-core/src/identity.rs Adds persistent per-device identity (Ed25519 + self-signed cert).
p2p-core/src/handshake.rs Updates handshake for QUIC + fingerprint cross-check under mTLS.
p2p-core/src/error.rs Adds QUIC/TLS/rendezvous traversal error variants + retry classification.
p2p-core/src/discovery.rs Threads cert fingerprint into discovery manager/service.
p2p-core/Cargo.toml Updates dependencies for QUIC/rustls/rcgen + adds rendezvous crate.
p2p-core/AGENTS.md Adds crate-specific developer guidance for the QUIC rewrite.
p2p-cli/src/send.rs Updates send flow to QUIC identity, rendezvous path, and fingerprint pinning.
p2p-cli/src/resume.rs Updates resume to require peer fingerprint and connect via QUIC.
p2p-cli/src/rendezvous.rs Adds CLI helper for --rendezvous/--code establishment.
p2p-cli/src/receive.rs Updates receive flow to QUIC identity, rendezvous path, and fingerprint pinning.
p2p-cli/src/nat_test.rs Updates NAT test to STUN-based classification + optional self-loop rendezvous punch.
p2p-cli/src/lib.rs Wires updated nat-test and resume args; adds rendezvous module.
p2p-cli/src/discover.rs Updates discover output to include peer fingerprint.
p2p-cli/src/cli.rs Adds session flags for fingerprint + rendezvous pairing; removes window-size/max-retries transfer flags.
p2p-cli/Cargo.toml Adds deps for rendezvous + hex parsing.
p2p-cli/AGENTS.md Adds CLI-specific developer guidance reflecting QUIC + rendezvous changes.
CONTRIBUTING.md Updates repo layout documentation for new crates/modules.
CHANGELOG.md Adds detailed unreleased changelog entry for QUIC rewrite + rendezvous/relay + audit fixes.
Cargo.toml Adds p2p-rendezvous to workspace and dev-deps for tests.
.github/copilot-instructions.md Updates project guidance to reflect QUIC/rendezvous architecture and flags.

Comment thread p2p-core/src/traversal/stun.rs
Comment thread p2p-cli/src/rendezvous.rs Outdated
Comment thread p2p-gui/src/operations.rs
Comment thread p2p-cli/src/nat_test.rs Outdated
Comment thread p2p-cli/src/cli.rs Outdated
Comment thread p2p-core/src/network/framing.rs Outdated
cdcseacave and others added 9 commits May 23, 2026 13:03
`run_cli_sync` only printed the "GUI not available" hint when
`cli.command.is_none()`. An explicit `p2p-transfer gui` fell through
to the async dispatcher, which hit `unreachable!()` and panicked.
Both code paths now match on `None | Some(Commands::Gui)` and exit
with the same hint, keeping the dispatcher's `unreachable!()` arm
truly unreachable.

Found during PR smoke testing on a workspace build that compiled
p2p-cli without the gui feature.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- STUN query loops until QUERY_TIMEOUT, dropping packets whose source
  isn't the queried server or whose tx-id doesn't match (instead of
  failing on the first mismatched packet — fixes stale-response
  false negatives when the socket is shared).
- IPv6-safe rendezvous host parsing: new `p2p_core::with_default_port`
  helper handles bare/bracketed IPv6, hostname:port, and bare IP forms.
  Replaces three `contains(':')` checks in p2p-cli rendezvous/nat_test
  and p2p-gui operations.
- `read_message` distinguishes between 0-byte EOF before any magic
  byte (Disconnected — clean between-frames close) and partial-magic
  truncation (Protocol — mid-frame).
- `--code` CLI help text no longer references the nonexistent
  `pair --new` subcommand.
- Apply `cargo fmt --all` across the workspace.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rust 1.95 (current CI stable) adds `clippy::unnecessary_sort_by`, which
flags the `sort_by(|a, b| b.x.cmp(&a.x))` pattern in two history files
and breaks `cargo clippy -- -D warnings` on all three test runners.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Code Coverage job fails when codecov.io returns 429 (rate limit
for tokenless uploads). Switch fail_ci_if_error to false so an upload
failure does not gate the PR — coverage is informational, and the
codecov report still appears whenever the upload succeeds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- DEFAULT_CHUNK_SIZE 64 KB → 1 MiB (CLI flag --chunk-size and
  TransferConfig::chunk_size_kb defaults follow). Under QUIC the
  chunk is no longer the ACK unit (packets are), so chunk size now
  only affects per-chunk overhead (stream setup, progress events,
  SHA-256 segmentation) and resume granularity. Larger chunks
  amortize that overhead ~16x with negligible cost on resume.
- quinn TransportConfig flow-control windows: stream_receive_window
  = 8 MiB, receive_window / send_window = 64 MiB. quinn's defaults
  (~1.25 MB / ~12.5 MB) stall on high-BDP links (e.g. gigabit at
  30 ms RTT ≈ 3.75 MB BDP — just at the connection-window limit).
- Drop the now-dead `max_chunks_in_flight`, `chunk_timeout_ms`, and
  `max_chunk_retries` fields from `TransferConfig` — they were
  declared but never read after the Phase 0 sliding-window removal.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stress-testing the CLI surface found four real product issues and two
related minor wins. All fixes are TDD: a failing test was added first,
then the minimal change to make it green.

ISSUE 4 — ReconnectConfig::default retried forever
  max_attempts: 0 (unlimited) meant a sender that lost its peer would
  back off exponentially to 180s and never surface the "resume with: ..."
  hint that the error branch was already wired to print.
  - Default is now max_attempts: 5 (3+6+12+24+48s ≈ 1.5min total).
  - New flag --max-reconnect-attempts on Send and Resume; pass 0 to opt
    back into infinite retries.
  - p2p-core/src/reconnect.rs: test_default_caps_at_5_attempts

ISSUE 1 — resume CLI rejected single-file transfers
  resume.rs guarded `if !path.exists() || !path.is_dir()`, so the
  README's documented `--path ./bigfile.bin` always errored with
  "Folder path does not exist or is not a directory". send_path itself
  has no folder requirement.
  - Dropped the is_dir() half; renamed log + help text.
  - p2p-cli/src/resume.rs: accepts_file_path + rejects_nonexistent_path

ISSUE 3 — no way to run two CLI processes on one host
  Identity::load_or_generate was hardcoded to dirs::config_dir(); on
  Windows that resolves via SHGetKnownFolderPath and ignores $APPDATA,
  so two child processes always shared the on-disk identity. The relay
  then correctly refused them as same-fingerprint impostors — blocking
  live-CLI loopback testing of the relay path.
  - Collapsed load_or_generate / load_or_generate_in into a single
    load_or_generate(dir: Option<&Path>); None preserves the default.
  - New global --identity-dir flag threaded through send/receive/
    discover/resume; GUI calls pass None.
  - p2p-core/src/identity.rs: distinct_dirs_yield_distinct_fingerprints

ISSUE 2 — `history` was never populated by the CLI
  TransferHistory::add_record was only called from p2p-gui/operations.rs;
  CLI's send/receive paths never recorded, so `p2p-transfer history`
  always reported "No transfer history found".
  - New helper p2p_core::history::record_transfer(record, path?) that
    handles load-or-empty + append + persist in one call.
  - Wired into send.rs (single-shot) and receive.rs (per-iter loop;
    replaces run_event_loop in CLI only — GUI still uses it).
  - p2p-core/src/history.rs: record_transfer_appends_and_persists

MINOR 1 — history concurrent writes clobbered each other
  load → push → save races between co-located sender/receiver: a 16-task
  stress test kept only 1 of 16 records.
  - record_transfer now does the read-modify-write under an OS-level
    advisory exclusive lock via fs2; runs on a blocking task so it
    doesn't stall the async runtime.
  - p2p-core/src/history.rs: record_transfer_concurrent_appends_dont_clobber

MINOR 2 — `history` output was hidden by `-v warn`
  handle_history used info! for every display line, so anything below
  info verbosity printed nothing despite a populated history file.
  - Display lines moved to println! (stdout, unconditional). Logging-
    style noise removed (emojis replaced with plain ASCII tags).

Test deltas:
  baseline:    90 tests pass
  after fixes: 95 tests pass (5 new TDD tests, 0 regressions)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three iterations of an end-to-end Bash harness that drives the built
binaries against each documented CLI surface:

  stress.sh    — broad sweep (T0-T10): direct send/receive, folder,
                 throttle, discover, nat-test (STUN + rendezvous self-loop),
                 rendezvous-mediated transfer, relay path, resume + history
  stress_v2.sh — re-runs the v1 failures with corrected verbosity and
                 per-process identity dirs (uncovered the relay same-
                 fingerprint refusal and the resume-CLI is_dir bug)
  stress_v3.sh — only the previously-failing tests, with corrected
                 expectations (24 MB file for the 2-s burst window,
                 receiver-kill so the sender persists state, etc.)

These are the harnesses used to find the four bugs fixed in 59cf3ca.
Run from repo root:
  bash smoke/src/stress.sh

Logs land in target/tmp/stress*/<test>-{send,recv,rvz}.log for forensics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`FileTransferSession::send_file` skips chunks already present in the
caller-supplied `completed_chunks` bitmap — so on a resume from N of M
chunks done, the sender opens (M - N) unidirectional streams, not M.
`receive_file` looped `while received < total_chunks` and unconditionally
awaited M streams, so the receiver waited forever for the N streams the
sender intentionally never opened.

The deadlock was masked by the Issue 1 guard (resume CLI rejected file
paths outright). Now that single-file resume works, the underlying
chunk-count drift surfaces — `smoke/src/stress_v4.sh::T10b` reproduced it.

Fix: `receive_file` now takes a separate `streams_to_receive` parameter.
The caller (`FolderTransferSession::receive_folder`) computes it from
`TransferInfo.resume_from.completed_chunks.len()`, which the sender
already populates on every transfer. The existing `total_chunks`
parameter stays as the bounds check on incoming `chunk_index` values.

TDD:
  RED  p2p-core/src/transfer_file.rs::tests::resume_with_skipped_chunks_does_not_deadlock
       — loopback pair, sender supplies `completed_chunks=[0,1]`, receiver
         times out at 5 s on the buggy code.
  GREEN passes in 70 ms.

Verified live: stress_v4.sh T10b now PASSES end-to-end (sha256 match
after interrupt + resume of a single-file transfer).

Test deltas: 95 → 96 (one new test, zero regressions).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Builds on stress_v1-v3 with the capabilities that landed in the recent
fixes:
  - --identity-dir per process to drive a real CLI relay loopback (T9)
  - --max-reconnect-attempts bound to make T10 finite (T10a/T10b)
  - resume with a file path (T10b — previously unreachable)
  - history populated by the CLI itself + readable at -v warn (T11)
  - concurrent CLI processes appending to history (T12, 4-pair parallel)

Final run: 21/21 PASS on the current branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 69 out of 69 changed files in this pull request and generated 5 comments.

Comment thread p2p-cli/src/receive.rs Outdated
Comment thread p2p-core/src/traversal/mod.rs
Comment thread p2p-core/src/protocol.rs
Comment thread p2p-core/AGENTS.md Outdated
Comment thread p2p-core/src/known_peers.rs
cdcseacave and others added 8 commits May 24, 2026 21:15
…e chunk_size source

PR-1 of the post-review remediation plan. Addresses five review findings:

* 1.3 — DEFAULT_CHUNK_SIZE is now the single source of truth. ConfigMessage::default
  and AppSettings::default both derived 64 KiB; sender CLI defaulted to 1 MiB so any
  session where the 64-KiB-default side won the handshake silently downgraded chunk
  size, neutralising the f07aae4 perf retune. Added regression test
  default_chunk_size_tests::config_message_default_matches_default_chunk_size.

* 1.4 — history.rs no longer wipes the user's transfer log when a single bad byte
  appears in history.json. unwrap_or_default() on parse failure is replaced with a
  rename-aside-and-quarantine flow: corrupt bytes land in history.json.corrupt-<ts>
  and a fresh history is written. Regression test added.

* 1.5 — history.rs append is now atomic and durable. Old code did seek(0)+set_len(0)
  +write_all in place with no sync_all, so a crash mid-write left an empty file and
  a clean exit could still lose the last record from page cache. New flow writes to
  history.json.tmp, fsyncs, then atomically renames; the OS-level exclusive lock is
  held on a sibling .lock file (stable identity across the rename of the data file).

* 5.1 — TransferRecord::{new,complete,interrupt,fail} no longer panic on a
  pre-1970 SystemTime. unwrap() is replaced with unwrap_or(0); a bad clock now
  degrades the timestamp instead of crashing the receive loop.

* 5.2 — fs2::FileExt::unlock failure after a successful write+fsync+rename is now
  a warn! rather than an Err return — the record is already durable; reporting Err
  would mislead callers about the outcome.

AGENTS.md cleanup: stale "DEFAULT_CHUNK_SIZE = 65536" references replaced with
1 MiB plus a "single source of truth" note.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR-2 of the post-review remediation plan. Addresses Wave 1 items 1.1 and 1.2.

* 1.1 — Multi-file folder resume no longer deadlocks when the sender skips
  files marked complete in a prior session. Before the fix the sender opened
  zero streams (and sent no FileChecksum) for indices in state.completed_files
  while the receiver iterated every entry in transfer_info.items and blocked
  in accept_uni() forever waiting for streams that would never arrive. New
  protocol field TransferInfo.completed_files carries the sender's skip set
  so the receiver can fast-forward in lock-step. Regression test
  multi_file_resume_with_completed_files_does_not_deadlock added with a 5 s
  timeout so a future re-introduction surfaces as a failed test, not a hang.

* 1.2 — FolderTransferState now persists the negotiated ConfigMessage
  (chunk_size, compression_enabled, compression_level, adaptive_compression,
  bandwidth_limit) and exposes to_config_message() for resume. resume.rs
  uses that instead of ConfigMessage::default(); without this the .partial
  on disk (laid out under the original chunk_size) and the resumed session's
  ChunkWriter offsets would disagree, silently corrupting the file. The
  FolderTransferState::new signature now takes &ConfigMessage explicitly so
  the chunk_size source-of-truth is clear at every call site. Regression
  test state_remembers_negotiated_chunk_size_across_serde_roundtrip added.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…c, DoS cap

PR-3 of the post-review remediation plan. Addresses Wave 1 items 1.6, 1.7, 1.8
and Wave 4 item 4.1.

* 1.6 — receive_file now tracks distinct chunk_indices in a HashSet rather
  than counting raw stream arrivals. A buggy or hostile sender that re-opens
  the same chunk_index used to satisfy the count while a real chunk went
  missing (silent SHA-256 mismatch at end-of-transfer); the duplicate is now
  dropped with a warn and the loop continues until streams_to_receive
  DISTINCT indices have arrived. Regression test added.

* 1.7 — ChunkWriter::new now takes expected_file_size and truncates an
  over-long .partial back to that length on open. Handles chunk_size drift
  between sessions and external truncation that left stale trailing bytes.
  receive_file passes total_chunks * chunk_size as the expected size, plus
  an up-front sanity check that streams_to_receive <= total_chunks.

* 1.8 — write_chunk now calls sync_data (not just flush) before returning,
  so the bytes are durable BEFORE chunk_complete_callback fires and the
  resume state advertises the chunk as complete. Without this a crash
  between the write and finalize()'s sync_all would leave the resume
  metadata lying about which chunks are on disk.

* 4.1 — receive_folder validates every peer-supplied file size against
  MAX_TRANSFER_FILE_SIZE (1 TiB) before opening a single stream. Without
  this a hostile peer could advertise an enormous size, computing a huge
  total_chunks and pinning the receiver in accept_uni() forever
  (KEEPALIVE_INTERVAL keeps the QUIC idle timeout from saving us).
  Unit test validate_file_size_rejects_absurd_values added.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…summary, re-accept

PR-4 of the post-review remediation plan. Addresses Wave 2 (all four
receive.rs regressions from commit 59cf3ca) plus the related Wave 3 item 3.2.

* 2.1 — session.receive_to now takes an accept_decision callback. The CLI
  passes |info| accept_or_prompt(auto_accept, info), which prints the
  incoming transfer's filenames + size to stderr and reads y/N from stdin
  when --auto-accept is false. On Reject the receiver sends Message::Cancel
  and the sender returns Err(Cancelled) without opening any chunk streams.
  The GUI variant rejects when auto-accept is off (modal dialog can be
  wired later). Regression test
  receive_folder_reject_sends_cancel_and_returns_empty_summary added.

* 2.2 — receive loop no longer matches Error::Network in the disconnect arm.
  That variant wraps tokio::fs::Error via #[from] std::io::Error, so disk
  failures (ENOSPC, EACCES on the output dir, broken symlink) used to be
  swallowed as if the peer cleanly hung up. They now propagate, hit the
  catch-all Err arm, get recorded as TransferStatus::Failed in history, and
  return a non-zero exit code.

* 2.3 — receive_to / receive_folder return TransferSummary { root_name,
  files, bytes } populated from TransferInfo.items. The CLI records those
  filenames in history instead of the --output directory path. Regression
  test receive_folder_accept_returns_summary_with_per_file_list added.

* 2.4 — P2PSession::reaccept() runs endpoint.accept() + handshake on the
  existing endpoint, so the receive CLI keeps listening on the same --port
  across peer disconnects instead of exiting. The send CLI's
  --max-reconnect-attempts no longer races against a vanished listener.
  Sender also handles a receiver that closes the connection before sending
  Ready as Error::Cancelled (graceful reject), not Error::Network.

* 3.2 (bundled) — session.send_path returns TransferSummary, so the send
  CLI's history record carries every file in a folder transfer instead of
  just the folder name. Falls back to base_name for the single-file case.

Removed Session::run_event_loop in favour of an inlined loop per surface
(receive CLI, GUI). The two surfaces have different accept-policy and UI
needs; a one-size-fits-all helper would obscure rather than help.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…flag

PR-5 of the post-review remediation plan. Addresses Wave 3 items 3.1, 3.3,
and 3.4. (3.2 already shipped in PR-4.)

* 3.1 — derive_base_name() in the new p2p-cli::util module replaces the
  `path.file_name().unwrap()` in send.rs that panicked on `p2p-transfer
  send .` (or `..`, or any path ending in a separator). The helper falls
  back to canonicalize() when file_name() returns None, and to the path's
  own display form for filesystem roots. Unit tests
  derive_base_name_handles_dot_and_dotdot and
  derive_base_name_handles_plain_file_name added.

* 3.3 — replaced the false "Transfer interrupted again. State has been
  saved." log in resume.rs with truthful "State persisted up to the most
  recent file boundary." The send_path layer persists state on every file
  completion (and on error/retry), so mid-file chunk progress lost on
  Ctrl+C is now correctly communicated to the user. The resume hint also
  echoes the --state-dir if one was supplied so copy-paste resume works.

* 3.4 — new --state-dir flag on both `send` and `resume` subcommands.
  resolve_state_file() builds the canonical "transfer_<id>.json" path
  under the chosen directory (auto-created on demand) or falls back to
  the historical CWD-relative default. Resolves the "users who cd-ed
  between failure and resume got cryptic 'State file not found'" issue
  from the review. Unit tests
  resolve_state_file_honours_explicit_state_dir,
  resolve_state_file_defaults_to_cwd_when_state_dir_absent, and the
  integration-shaped finds_state_file_via_state_dir_flag_from_unrelated_cwd
  added.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…; atomic known_peers flush; cargo fmt

Addresses Copilot review #4353138790:

- protocol.rs: doc the actual `[u64 LE index | u8 flags | payload]`
  wire format and `FLAG_COMPRESSED = 0x01` semantics (not just
  `config.compression_enabled`).
- traversal/mod.rs: doc now describes symmetric-NAT path as
  `want_relay = true` + Relay outcome (not the early
  `Error::HolePunchFailed` it claimed before).
- known_peers.rs: `flush` now writes to `*.tmp`, `fsync`s, then
  renames — matches the module-doc crash-safety claim.
- workspace-wide `cargo fmt` to unblock the Linux CI check that
  fail-cancelled Windows and macOS.

The "auto_accept is silently ignored" finding was already fixed in
27c8191; the "DEFAULT_CHUNK_SIZE = 65536" finding in
p2p-core/AGENTS.md was already fixed in a follow-up — both verified
against current HEAD.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds scripts/deploy.py — a stdlib-only Python 3 installer for Ubuntu 24+
with install / uninstall / clean-build subcommands. Each step checks
current state before acting (dpkg -s for apt deps, SHA256 compare for
the binary, byte compare for the systemd unit) so the service is only
restarted when something actually changed and re-runs don't drop
in-flight pairings. `--prune-build` and `clean-build` reclaim disk by
wiping <dest>/target/ after a successful deploy; a later install just
rebuilds from scratch. README and AGENTS notes updated.
* progress: reset elapsed/ETA on the 0→real total_bytes transition so
  bytes/sec doesn't include the interactive y/N wait.
* framing: map peer-close io::Error kinds (ConnectionAborted, etc.) to
  Error::Disconnected so the receive loop sees a graceful EOF instead
  of dying with "Network error: connection lost".
* cli: extract a single establish_session() helper in rendezvous.rs and
  funnel send/receive/resume through it — kills the duplicated
  `if is_rendezvous_mode { ... } else { ... }` blocks across handlers.
* receive: on disconnect, branch on the original pairing mode.
  Rendezvous-paired sessions re-pair through the rendezvous (the QUIC
  role is decided by a UUID compare post-pair, so reaccept() is
  structurally wrong half the time). Direct-mode sessions still use
  reaccept() to keep the user's --port stable.
* resume: flatten SessionParams into the Resume CLI command so resume
  works cross-NAT via --rendezvous/--code. Drops the old --to flag in
  favour of --peer (matches send/receive). No back-compat shim.
* test: tests/rendezvous_disconnect_resume_test.rs drives both fixes
  end-to-end — receiver paired via rendezvous survives the first
  sender disconnecting and accepts a second sender that arrives via
  handle_resume() with --rendezvous/--code.

Doc updates for the new resume flags in README.md and AGENTS.md.
… perf

Sweep across the workspace pruning code that no longer earns its keep
and tightening the per-chunk hot path. Bumps PROTOCOL_VERSION to 3 +
ALPN to p2pf/3 since the wire format loses the `capabilities` field
from HelloMessage and DiscoveryBeacon (no compat shim, per project policy).

Removed:
- p2p-core: orphan transfer.rs stub, unused config.rs (TCP-era rot),
  Capabilities struct + intersect/has_compression negotiation, ConnectionRole
  enum (collapsed into initiator_target Option), P2PSession::establish
  string dispatch (replaced by typed connect/accept + parse_peer_addr /
  discover_one_peer helpers), is_alive/file_size stubs and dead
  FileTransferSession fields.
- p2p-gui: styles.rs palette, 10+ never-emitted Message variants
  (StartReceive, ReceiveComplete/Failed, ListenerWaiting/Active,
  TransferStarted/InProgress/Completed/Error, ProgressUpdate, RefreshHistory,
  AdaptiveCompressionToggled) plus cascading dead state fields, fixed
  setup_receive `bytes_received = 0` TODO to thread the real cumulative
  byte count through run_gui_receive_loop.
- Cargo: unused deps (log, env_logger, anyhow at root and core, indicatif/
  local-ip-address/futures in core, bogus [package.metadata] uuid block,
  log from gui), unused profile.release-small.
- smoke: collapsed stress.sh/_v2/_v3 iterations down to the v4 version
  (renamed to stress.sh).
- docs: merged still-relevant content from .github/copilot-instructions.md
  (test pipeline, refactor/feature/bugfix workflow) into AGENTS.md and
  removed the copilot file.

Hot-path perf (transfer_file.rs):
- completed_chunks slice → HashSet for O(1) skip lookup on resume.
- chunk header packed into one 9-byte write_all (was three calls).
- uncompressed receive path no longer allocates payload.to_vec().
- ChunkWriter::finalize re-read buffer 64 KiB → 1 MiB.
- Message::TransferInfo(Box<TransferInfo>) so enum stops paying max-variant
  cost on every recv.
- Iterative VecDeque-based folder walk (was Box::pin per directory).
- Hoisted chunk_count(file_size, chunk_size) helper (3 copies → 1).
- Collapsed display_transfer_stats into a single info!-per-line flow.

Architectural:
- FolderTransferState embeds ConfigMessage directly instead of mirroring
  five scalars and rebuilding via to_config_message().

cargo build --release / test --all (119 pass, one pre-existing parallel-
port flake in rendezvous_disconnect_resume_test) / clippy --all-targets
--all-features -D warnings / fmt --check all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cdcseacave cdcseacave requested a review from Copilot May 26, 2026 08:40
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

@cdcseacave
Copy link
Copy Markdown
Owner Author

@claude review

… test

The rendezvous resume test (and any local-dev pairing through a
loopback `rendezvousd`) was hitting Google STUN servers and getting
back a NAT-mapped UDP port that the rendezvous then paired with a
`127.0.0.1` source IP. The resulting punch target — `127.0.0.1` plus
a STUN-mapped port — never reached the local socket, so the QUIC
handshake timed out after 30s. macOS CI surfaced this every run;
Linux coverage saw it intermittently under tarpaulin.

- `establish_via_rendezvous`: when the rendezvous IP is loopback,
  skip STUN entirely and use the bound socket address as the public
  endpoint. The rendezvous still stamps the TCP source IP, so the
  paired target becomes `127.0.0.1:<actual-local-port>` — i.e. the
  peer's real socket.

- `rendezvous_disconnect_resume_test`: now that STUN no longer takes
  3–6s per side, the receiver/sender register race is exposed.
  Sleep 200ms after `spawn_receiver` so the receiver lands in the
  rendezvous waiter slot first, and 500ms before phase 2 so the
  receiver's post-disconnect re-register completes before the resume
  sender arrives.

Test runtime drops from ~30s timeout to <1s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants