bigbio · ypriverol · May 26, 2026 · May 23, 2026 · May 23, 2026 · May 24, 2026
diff --git a/BUG_REVIEW.md b/BUG_REVIEW.md
@@ -0,0 +1,72 @@
+# msgf-rust bug review (2026-05-23)
+
+Branch: `review/bug-hunt` (from `master` @ 18360a3d)
+
+Systematic review of the Rust MS-GF+ port: static analysis of critical paths,
+full `cargo test --release --workspace`, and targeted code reading.
+
+## Fixed in this branch
+
+| ID | Severity | Location | Issue | Fix |
+|---|---|---|---|---|
+| B1 | **Critical** | `msgf-rust.rs` `send_chunks` | Bench cap (`--max-spectra N`) truncated the final partial chunk to zero when `total == N` (e.g. N=100 with chunk size 5000 → empty output). | Removed erroneous tail `truncate` block; loop already stops at cap. |
+| B2 | **High** | `msgf-rust.rs` param routing | Activation auto-detect was gated on `instrument == low-res`, so `--fragmentation auto --instrument QExactive` on mzML skipped peek and resolved to CID params for HCD data. | Gate auto-route on `fragmentation == auto` + mzML extension only. |
+| B3 | **High** | `msgf-rust.rs` TSV write | `write_tsv(..., is_mgf=true)` always emitted MGF layout (extra `Title` column) even for mzML inputs. | Pass `!is_mzml`. |
+| B4 | **High** | `match_engine.rs` GF | SpecE GF graph used `start_offset == 0` for protein N-term instead of `cand.is_protein_n_term`, breaking Met-cleaved N-termini at offset 1. | Use `cand.is_protein_n_term` / `is_protein_c_term`. |
+| B5 | **Medium** | `tsv.rs` | `IsotopeError` column hardcoded to 0 while PIN writes `psm.isotope_offset`. | Thread isotope offset from PSM. |
+| B6 | **Medium** | `msgf-rust.rs` CLI | Inverted `--charge-min/--charge-max` or isotope ranges produced empty ranges with no error. | Validate at startup and return clear error. |
+| B7 | **High** | `match_engine.rs` dedup | Dedup used bare sequence + pin score; merged mod variants incorrectly. | Mod-aware pepSeq key + `rank_score`. |
+| B8 | **Medium** | `match_engine.rs` dedup | HashMap survivor order was nondeterministic. | `BTreeMap` + best-`rank_score` survivor rule. |
+
+## Open — not fixed (documented for follow-up)
+
+| ID | Severity | Location | Issue |
+|---|---|---|---|
+| B9 | **Low** | `sa_walk.rs` | Test-only SA walk helper does not enforce `max_missed_cleavages`; production search uses `candidate_gen::enumerate_candidates`, which does. |
+| B10 | **High** | `mzml.rs` `Iterator::next` | First per-spectrum parse error sets `done=true` and aborts the entire file; remaining spectra are silently skipped. |
+| B11 | **Low** | `sa_walk.rs` Met pass | Dedupes Met-cleaved peptides on residue bytes only, collapsing distinct C-terminal contexts. |
+
+## Known test failures (pre-existing, CI-skipped)
+
+These fail on `master` without the 7 CI skip flags; tracked as parity/min_peaks regressions:
+
+- `match_engine_smoke::known_peptide_appears_in_top_n`
+- `match_engine_smoke::charge_missing_spectrum_uses_per_charge_scored_spec`
+- `match_engine_smoke::spectrum_without_charge_tries_charge_range`
+- Maven fixture loads, thread-determinism test (see `.github/workflows/ci.yml`)
+
+## Verification
+
+```bash
+cargo test --release --workspace -- \
+  --skip charge_missing_spectrum_uses_per_charge_scored_spec \
+  --skip spectrum_without_charge_tries_charge_range \
+  --skip known_peptide_appears_in_top_n \
+  --skip read_bsa_canno_text_format \
+  --skip read_tryp_pig_bov_revcat_csarr_cnlcp \
+  --skip tryp_pig_bov_revcat_full_set_loads \
+  --skip match_spectra_output_invariant_across_thread_counts
+```
+
+## Performance (dedup pass)
+
+- PepSeq dedup keys use integer mod units + `Arc` cache per candidate (avoids repeated string formatting).
+- Per-charge `TopNQueue` map uses `FxHashMap<u8, _>` (typically 1–3 charges per spectrum).
+
+## Documentation review (2026-05-24)
+
+Fixes applied on this branch:
+
+| Issue | Location | Fix |
+|---|---|---|
+| PIN column count said "28" | `README.md` | Corrected to 36 (default charge 2–3) + EdgeScore note |
+| Auto-detect described "first spectrum" only | `README.md` | First 64 MS2 histogram; `--instrument` does not gate peek |
+| Auto-detect required `--instrument low-res` | `DOCS.md` §4 | Matches code: only `--fragmentation auto` + mzML |
+| TSV `IsotopeError` documented as always 0 | `DOCS.md` §3b | Updated after B5 fix |
+| Broken `known-divergences.md` links | `README.md`, `DOCS.md` §8d | Legacy file removed in iter39; point to §8d / tests |
+| Inverted charge/isotope ranges undocumented | `DOCS.md` §1 | Startup validation documented |
+
+**Still stale (not fixed here):**
+
+- `benchmark/ci/README.md` — references Java Maven workflow; no Rust benchmark workflow in `.github/workflows/` yet.
+- `.claude/CLAUDE.md` — Java-tree context; accurate on `java-legacy` branch only.
diff --git a/README.md b/README.md
@@ -66,7 +66,7 @@ msgf-rust \
 
 This runs a tryptic search at 20 ppm precursor tolerance with the bundled HCD_QExactive_Tryp scoring model, writes Percolator-format PSMs to `out.pin`, and prints per-phase timings to stderr. Feed `out.pin` directly into Percolator (Docker or native) to compute q-values.
 
-A row in `out.pin` is one peptide–spectrum match with 28 columns: `SpecId`, `Label`, `ScanNr`, charge one-hot encoding, then features like `RawScore`, `lnSpecEValue`, `DeNovoScore`, ion-current ratios, peptide-length stats, etc. Full column reference: `DOCS.md` §3a.
+A row in `out.pin` is one peptide–spectrum match. With the default charge range (2–3), each row has **36 tab-separated columns**: 35 Java-parity Percolator features plus Rust-only `EdgeScore` (inserted before `Peptide`). Charge one-hot columns scale with `[--charge-min, --charge-max]`. Full column reference: `DOCS.md` §3a.
 
 ## Common workflows
 
@@ -131,11 +131,11 @@ Run `msgf-rust --help` for the auto-generated help with full descriptions.
 
 ## Auto-detection
 
-For mzML inputs, msgf-rust reads the activation block of the first MS2 spectrum and selects a bundled `.param` file accordingly. The detection covers HCD/CID/ETD/UVPD activation and LowRes/HighRes/TOF/QExactive instrument classes (via mzML CV params). The bundled model is then resolved from `(fragmentation, instrument, protocol)`. MGF files have no activation metadata, so they go through the CLI defaults (which can be overridden with explicit `--fragmentation` / `--instrument` flags). Full resolution table: `DOCS.md` §4.
+For mzML inputs with `--fragmentation auto` (the default), msgf-rust peeks the first 64 MS2 spectra, histograms activation methods and analyzer types, and selects a bundled `.param` file from the dominant values. The `--instrument` CLI flag is **not** required for this path — instrument class is read from the mzML when possible. `--protocol` from the CLI is still applied when resolving the bundled model. MGF files have no activation metadata, so they use flag-based resolution (defaulting to `HCD_QExactive_Tryp.param`). Full resolution table: `DOCS.md` §4.
 
 ## Parity vs Java MS-GF+
 
-PIN output columns are bit-exact with Java MS-GF+ on the agreement bucket (same scan + same top-1 peptide) for most features. Three residual divergences exist as deferred research: `lnEValue` (num_distinct semantics), `MeanRelErrorTop7` (error-stat normalization), and the BSA charge-3 SEV gap from the deconvolution-implementation difference (`known-divergences.md` item #3, kept on the development branch). None gate cutover; aggregate 1% FDR PSM counts beat Java on all three benchmark datasets. Full detail: `DOCS.md` §8d.
+PIN output columns are bit-exact with Java MS-GF+ on the agreement bucket (same scan + same top-1 peptide) for most features. Three residual divergences exist as deferred research: `lnEValue` (num_distinct semantics), `MeanRelErrorTop7` (error-stat normalization), and the BSA charge-3 SEV gap from deconvolution-implementation differences. None gate cutover; aggregate 1% FDR PSM counts beat Java on all three benchmark datasets. Full detail: `DOCS.md` §8d.
 
 ## Citation
 

diff --git a/benchmark/ci/README.md b/benchmark/ci/README.md
@@ -1,6 +1,8 @@
 # CI benchmark (PXD001819)
 
-- **Workflow:** `.github/workflows/benchmark-pxd001819.yml` (`workflow_dispatch`)
+> **Note:** This scaffold targets the Java MS-GF+ tree (`mvn`, mzIdentML metrics). The Rust port (`msgf-rust`) uses `.github/workflows/ci.yml` for tests but does not yet wire this benchmark harness. See [`benchmark/README.md`](../README.md) for scope.
+
+- **Workflow:** `.github/workflows/benchmark-pxd001819.yml` (`workflow_dispatch`) — Java branch only
 - **Run locally:** `mvn -B package -DskipTests && bash benchmark/ci/PXD001819/run_ci.sh`
 - **Compare:** `python3 benchmark/ci/PXD001819/compare_metrics.py benchmark/results/PXD001819/ci/ci_metrics.txt benchmark/ci/PXD001819/baseline.tsv`
 - **Self-test:** `python3 -m unittest benchmark.ci.PXD001819.test_compare_metrics`

diff --git a/crates/search/src/match_engine.rs b/crates/search/src/match_engine.rs
@@ -3,6 +3,7 @@
 use std::collections::{BTreeMap, HashMap};
 use std::hash::Hasher;
 use std::sync::atomic::{AtomicU64, Ordering};
+use std::sync::Arc;
 
 // GF failure-mode diagnostics (2026-05-19). Module-level atomics
 // incremented per-bin from compute_spec_e_values_for_spectrum and
@@ -716,13 +717,9 @@ fn compute_spec_e_values_for_spectrum(
         let mut any_c = false;
         for psm in queue.iter_psms() {
             let cand = &candidates[psm.primary_candidate_idx() as usize];
-            if let Some(prot) = search_index.protein_at(cand.protein_index) {
-                let start = cand.start_offset_in_protein;
-                let pep_len = cand.peptide.length();
-                if start == 0 { any_n = true; }
-                if start + pep_len >= prot.sequence.len() { any_c = true; }
-                if any_n && any_c { break; }
-            }
+            if cand.is_protein_n_term { any_n = true; }
+            if cand.is_protein_c_term { any_c = true; }
+            if any_n && any_c { break; }
         }
         (any_n, any_c)
     };
@@ -1402,51 +1399,200 @@ mod feature_tests {
     }
 }
 
-/// Pre-merge dedup pass (R-2.2): collapse PSMs that share the same
-/// (peptide_residue, rounded_score) key into a single entry, aggregating
-/// their `candidate_idxs` into a unified Vec. Mirrors Java's
-/// `DBScanner.java:719-733` `pepSeqMap` dedup.
-///
-/// Called by the per-spectrum loop after the per-candidate scoring loop,
-/// before per-charge GF compute (so SpecE is computed on the deduped set).
-///
-/// Inputs:
-/// - `psms`: drained from a per-charge `TopNQueue` via `drain_into_vec`
-/// - `candidates`: the search's enumerated candidate slice; used to resolve
-///   each PSM's peptide residue sequence for the dedup key
-///
-/// Returns: deduped `Vec<PsmMatch>`. The caller re-pushes these into the
-/// per-charge queue via `queue.push()` for each entry.
+/// Pre-merge dedup pass (R-2.2): collapse PSMs sharing the same Java
+/// `(pepSeq, score)` key before per-charge GF compute.
 pub(crate) fn dedup_pepseq_score(
     psms: Vec<PsmMatch>,
     candidates: &[Candidate],
 ) -> Vec<PsmMatch> {
-    use std::collections::HashMap;
+    use std::collections::btree_map::Entry;
+    use std::collections::BTreeMap;
 
-    // Key: (peptide_residue_bytes, rounded_score_i32)
-    // The residue sequence is the unmodified bare AA string, matching Java's
-    // `m.getPepSeq()` used as the dedup key (DBScanner.java:721).
-    let mut groups: HashMap<(Vec<u8>, i32), PsmMatch> = HashMap::new();
+    let mut pep_key_cache: FxHashMap<u32, Arc<PepDedupKey>> = FxHashMap::default();
+    let mut groups: BTreeMap<DedupMapKey, PsmMatch> = BTreeMap::new();
 
     for psm in psms {
-        let cand = &candidates[psm.primary_candidate_idx() as usize];
-        let pep_residues: Vec<u8> = cand.peptide.residues.iter().map(|aa| aa.residue).collect();
-        let score_rounded = psm.score.round() as i32;
-        let key = (pep_residues, score_rounded);
-
-        groups
-            .entry(key)
-            .and_modify(|existing| {
-                // Aggregate this PSM's indices into the surviving entry.
-                // Avoid duplicates if the same idx somehow appears twice.
-                for &idx in &psm.candidate_idxs {
-                    if !existing.candidate_idxs.contains(&idx) {
-                        existing.candidate_idxs.push(idx);
-                    }
+        let primary = psm.primary_candidate_idx();
+        let pep_key = pep_key_cache
+            .entry(primary)
+            .or_insert_with(|| Arc::new(PepDedupKey::from_peptide(&candidates[primary as usize].peptide)))
+            .clone();
+        let key = DedupMapKey {
+            pep: pep_key,
+            score: psm.rank_score.round() as i32,
+        };
+
+        match groups.entry(key) {
+            Entry::Vacant(slot) => {
+                slot.insert(psm);
+            }
+            Entry::Occupied(mut slot) => {
+                let existing = slot.get_mut();
+                merge_unique_candidate_idxs(&mut existing.candidate_idxs, &psm.candidate_idxs);
+                if psm.rank_score > existing.rank_score {
+                    let merged_idxs = std::mem::take(&mut existing.candidate_idxs);
+                    let mut survivor = psm;
+                    merge_unique_candidate_idxs(&mut survivor.candidate_idxs, &merged_idxs);
+                    *existing = survivor;
                 }
-            })
-            .or_insert(psm);
+            }
+        }
     }
 
-    groups.into_values().collect()
+    let mut out = Vec::with_capacity(groups.len());
+    out.extend(groups.into_values());
+    out
+}
+
+/// Mod-aware dedup key: bare residues plus per-position mod mass (1e-5 Da units).
+/// Matches Java pepSeq semantics without string formatting on the hot path.
+#[derive(Clone, PartialEq, Eq, PartialOrd, Ord)]
+struct PepDedupKey {
+    residues: Vec<u8>,
+    mod_units: Vec<i32>,
+}
+
+impl PepDedupKey {
+    fn from_peptide(peptide: &model::Peptide) -> Self {
+        let len = peptide.residues.len();
+        let mut residues = Vec::with_capacity(len);
+        let mut mod_units = Vec::with_capacity(len);
+        for aa in &peptide.residues {
+            residues.push(aa.residue);
+            mod_units.push(
+                aa.mod_
+                    .as_ref()
+                    .map(|m| (m.mass_delta * 100_000.0).round() as i32)
+                    .unwrap_or(0),
+            );
+        }
+        Self { residues, mod_units }
+    }
+}
+
+#[derive(Clone, PartialEq, Eq)]
+struct DedupMapKey {
+    pep: Arc<PepDedupKey>,
+    score: i32,
+}
+
+impl PartialOrd for DedupMapKey {
+    fn partial_cmp(&self, other: &Self) -> Option<std::cmp::Ordering> {
+        Some(self.cmp(other))
+    }
+}
+
+impl Ord for DedupMapKey {
+    fn cmp(&self, other: &Self) -> std::cmp::Ordering {
+        self.pep
+            .residues
+            .cmp(&other.pep.residues)
+            .then_with(|| self.pep.mod_units.cmp(&other.pep.mod_units))
+            .then(self.score.cmp(&other.score))
+    }
+}
+
+fn merge_unique_candidate_idxs(into: &mut Vec<u32>, from: &[u32]) {
+    for &idx in from {
+        if !into.contains(&idx) {
+            into.push(idx);
+        }
+    }
+}
+
+#[cfg(test)]
+mod dedup_tests {
+    use super::*;
+    use std::sync::Arc;
+    use model::amino_acid::AminoAcid;
+    use model::modification::{ModLocation, Modification};
+    use model::peptide::Peptide;
+    use model::ResidueSpec;
+    use crate::psm::PsmMatch;
+
+    fn seq_peptide(bytes: &[u8]) -> Peptide {
+        let residues: Vec<AminoAcid> = bytes
+            .iter()
+            .filter_map(|&b| AminoAcid::standard(b))
+            .collect();
+        Peptide::new(residues, b'R', b'K')
+    }
+
+    fn cand_with_peptide(peptide: Peptide) -> Candidate {
+        Candidate {
+            peptide,
+            protein_index: 0,
+            start_offset_in_protein: 0,
+            is_decoy: false,
+            is_protein_n_term: false,
+            is_protein_c_term: false,
+        }
+    }
+
+    fn psm(primary: u32, rank: f32, pin: f32) -> PsmMatch {
+        PsmMatch {
+            spectrum_idx: 0,
+            candidate_idxs: vec![primary],
+            charge_used: 2,
+            mass_error_ppm: 0.0,
+            score: pin,
+            rank_score: rank,
+            edge_score: (rank - pin) as i32,
+            spec_e_value: 1.0,
+            de_novo_score: 0,
+            activation_method: None,
+            e_value: 1.0,
+            features: Default::default(),
+            isotope_offset: 0,
+        }
+    }
+
+    #[test]
+    fn dedup_uses_rank_score_not_pin_score() {
+        let pep = seq_peptide(b"PEPTK");
+        let cands = vec![cand_with_peptide(pep.clone())];
+        let psms = vec![
+            psm(0, 100.4, 99.6),
+            psm(0, 120.0, 99.6),
+        ];
+        let out = dedup_pepseq_score(psms, &cands);
+        assert_eq!(out.len(), 2, "different rank_score keys must not merge");
+    }
+
+    #[test]
+    fn dedup_distinguishes_mod_state() {
+        let mut ox = seq_peptide(b"PEPMK");
+        ox.residues[3].mod_ = Some(Arc::new(Modification {
+            name: "Ox".into(),
+            mass_delta: 15.99491,
+            residue: ResidueSpec::Specific(b'M'),
+            location: ModLocation::Anywhere,
+            fixed: false,
+            accession: None,
+        }));
+        let cands = vec![
+            cand_with_peptide(seq_peptide(b"PEPMK")),
+            cand_with_peptide(ox),
+        ];
+        let psms = vec![
+            psm(0, 50.0, 50.0),
+            psm(1, 50.0, 50.0),
+        ];
+        let out = dedup_pepseq_score(psms, &cands);
+        assert_eq!(out.len(), 2, "mod-aware pepSeq keys must differ");
+    }
+
+    #[test]
+    fn dedup_keeps_highest_rank_score_survivor() {
+        let pep = seq_peptide(b"PEPTK");
+        let cands = vec![cand_with_peptide(pep)];
+        // Same rounded score bucket (60) but different float rank_score — merge to best.
+        let psms = vec![
+            psm(0, 59.6, 50.0),
+            psm(0, 60.4, 50.0),
+        ];
+        let out = dedup_pepseq_score(psms, &cands);
+        assert_eq!(out.len(), 1);
+        assert!((out[0].rank_score - 60.4).abs() < f32::EPSILON);
+    }
 }