Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 72 additions & 0 deletions BUG_REVIEW.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# msgf-rust bug review (2026-05-23)

Branch: `review/bug-hunt` (from `master` @ 18360a3d)

Systematic review of the Rust MS-GF+ port: static analysis of critical paths,
full `cargo test --release --workspace`, and targeted code reading.

## Fixed in this branch

| ID | Severity | Location | Issue | Fix |
|---|---|---|---|---|
| B1 | **Critical** | `msgf-rust.rs` `send_chunks` | Bench cap (`--max-spectra N`) truncated the final partial chunk to zero when `total == N` (e.g. N=100 with chunk size 5000 → empty output). | Removed erroneous tail `truncate` block; loop already stops at cap. |
| B2 | **High** | `msgf-rust.rs` param routing | Activation auto-detect was gated on `instrument == low-res`, so `--fragmentation auto --instrument QExactive` on mzML skipped peek and resolved to CID params for HCD data. | Gate auto-route on `fragmentation == auto` + mzML extension only. |
| B3 | **High** | `msgf-rust.rs` TSV write | `write_tsv(..., is_mgf=true)` always emitted MGF layout (extra `Title` column) even for mzML inputs. | Pass `!is_mzml`. |
| B4 | **High** | `match_engine.rs` GF | SpecE GF graph used `start_offset == 0` for protein N-term instead of `cand.is_protein_n_term`, breaking Met-cleaved N-termini at offset 1. | Use `cand.is_protein_n_term` / `is_protein_c_term`. |
| B5 | **Medium** | `tsv.rs` | `IsotopeError` column hardcoded to 0 while PIN writes `psm.isotope_offset`. | Thread isotope offset from PSM. |
| B6 | **Medium** | `msgf-rust.rs` CLI | Inverted `--charge-min/--charge-max` or isotope ranges produced empty ranges with no error. | Validate at startup and return clear error. |
| B7 | **High** | `match_engine.rs` dedup | Dedup used bare sequence + pin score; merged mod variants incorrectly. | Mod-aware pepSeq key + `rank_score`. |
| B8 | **Medium** | `match_engine.rs` dedup | HashMap survivor order was nondeterministic. | `BTreeMap` + best-`rank_score` survivor rule. |

## Open — not fixed (documented for follow-up)

| ID | Severity | Location | Issue |
|---|---|---|---|
| B9 | **Low** | `sa_walk.rs` | Test-only SA walk helper does not enforce `max_missed_cleavages`; production search uses `candidate_gen::enumerate_candidates`, which does. |
| B10 | **High** | `mzml.rs` `Iterator::next` | First per-spectrum parse error sets `done=true` and aborts the entire file; remaining spectra are silently skipped. |
| B11 | **Low** | `sa_walk.rs` Met pass | Dedupes Met-cleaved peptides on residue bytes only, collapsing distinct C-terminal contexts. |

## Known test failures (pre-existing, CI-skipped)

These fail on `master` without the 7 CI skip flags; tracked as parity/min_peaks regressions:

- `match_engine_smoke::known_peptide_appears_in_top_n`
- `match_engine_smoke::charge_missing_spectrum_uses_per_charge_scored_spec`
- `match_engine_smoke::spectrum_without_charge_tries_charge_range`
- Maven fixture loads, thread-determinism test (see `.github/workflows/ci.yml`)

## Verification

```bash
cargo test --release --workspace -- \
--skip charge_missing_spectrum_uses_per_charge_scored_spec \
--skip spectrum_without_charge_tries_charge_range \
--skip known_peptide_appears_in_top_n \
--skip read_bsa_canno_text_format \
--skip read_tryp_pig_bov_revcat_csarr_cnlcp \
--skip tryp_pig_bov_revcat_full_set_loads \
--skip match_spectra_output_invariant_across_thread_counts
```

## Performance (dedup pass)

- PepSeq dedup keys use integer mod units + `Arc` cache per candidate (avoids repeated string formatting).
- Per-charge `TopNQueue` map uses `FxHashMap<u8, _>` (typically 1–3 charges per spectrum).

## Documentation review (2026-05-24)

Fixes applied on this branch:

| Issue | Location | Fix |
|---|---|---|
| PIN column count said "28" | `README.md` | Corrected to 36 (default charge 2–3) + EdgeScore note |
| Auto-detect described "first spectrum" only | `README.md` | First 64 MS2 histogram; `--instrument` does not gate peek |
| Auto-detect required `--instrument low-res` | `DOCS.md` §4 | Matches code: only `--fragmentation auto` + mzML |
| TSV `IsotopeError` documented as always 0 | `DOCS.md` §3b | Updated after B5 fix |
| Broken `known-divergences.md` links | `README.md`, `DOCS.md` §8d | Legacy file removed in iter39; point to §8d / tests |
| Inverted charge/isotope ranges undocumented | `DOCS.md` §1 | Startup validation documented |

**Still stale (not fixed here):**

- `benchmark/ci/README.md` — references Java Maven workflow; no Rust benchmark workflow in `.github/workflows/` yet.
- `.claude/CLAUDE.md` — Java-tree context; accurate on `java-legacy` branch only.
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ msgf-rust \

This runs a tryptic search at 20 ppm precursor tolerance with the bundled HCD_QExactive_Tryp scoring model, writes Percolator-format PSMs to `out.pin`, and prints per-phase timings to stderr. Feed `out.pin` directly into Percolator (Docker or native) to compute q-values.

A row in `out.pin` is one peptide–spectrum match with 28 columns: `SpecId`, `Label`, `ScanNr`, charge one-hot encoding, then features like `RawScore`, `lnSpecEValue`, `DeNovoScore`, ion-current ratios, peptide-length stats, etc. Full column reference: `DOCS.md` §3a.
A row in `out.pin` is one peptide–spectrum match. With the default charge range (2–3), each row has **36 tab-separated columns**: 35 Java-parity Percolator features plus Rust-only `EdgeScore` (inserted before `Peptide`). Charge one-hot columns scale with `[--charge-min, --charge-max]`. Full column reference: `DOCS.md` §3a.

## Common workflows

Expand Down Expand Up @@ -131,11 +131,11 @@ Run `msgf-rust --help` for the auto-generated help with full descriptions.

## Auto-detection

For mzML inputs, msgf-rust reads the activation block of the first MS2 spectrum and selects a bundled `.param` file accordingly. The detection covers HCD/CID/ETD/UVPD activation and LowRes/HighRes/TOF/QExactive instrument classes (via mzML CV params). The bundled model is then resolved from `(fragmentation, instrument, protocol)`. MGF files have no activation metadata, so they go through the CLI defaults (which can be overridden with explicit `--fragmentation` / `--instrument` flags). Full resolution table: `DOCS.md` §4.
For mzML inputs with `--fragmentation auto` (the default), msgf-rust peeks the first 64 MS2 spectra, histograms activation methods and analyzer types, and selects a bundled `.param` file from the dominant values. The `--instrument` CLI flag is **not** required for this path — instrument class is read from the mzML when possible. `--protocol` from the CLI is still applied when resolving the bundled model. MGF files have no activation metadata, so they use flag-based resolution (defaulting to `HCD_QExactive_Tryp.param`). Full resolution table: `DOCS.md` §4.

## Parity vs Java MS-GF+

PIN output columns are bit-exact with Java MS-GF+ on the agreement bucket (same scan + same top-1 peptide) for most features. Three residual divergences exist as deferred research: `lnEValue` (num_distinct semantics), `MeanRelErrorTop7` (error-stat normalization), and the BSA charge-3 SEV gap from the deconvolution-implementation difference (`known-divergences.md` item #3, kept on the development branch). None gate cutover; aggregate 1% FDR PSM counts beat Java on all three benchmark datasets. Full detail: `DOCS.md` §8d.
PIN output columns are bit-exact with Java MS-GF+ on the agreement bucket (same scan + same top-1 peptide) for most features. Three residual divergences exist as deferred research: `lnEValue` (num_distinct semantics), `MeanRelErrorTop7` (error-stat normalization), and the BSA charge-3 SEV gap from deconvolution-implementation differences. None gate cutover; aggregate 1% FDR PSM counts beat Java on all three benchmark datasets. Full detail: `DOCS.md` §8d.

## Citation

Expand Down
4 changes: 3 additions & 1 deletion benchmark/ci/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
# CI benchmark (PXD001819)

- **Workflow:** `.github/workflows/benchmark-pxd001819.yml` (`workflow_dispatch`)
> **Note:** This scaffold targets the Java MS-GF+ tree (`mvn`, mzIdentML metrics). The Rust port (`msgf-rust`) uses `.github/workflows/ci.yml` for tests but does not yet wire this benchmark harness. See [`benchmark/README.md`](../README.md) for scope.

- **Workflow:** `.github/workflows/benchmark-pxd001819.yml` (`workflow_dispatch`) — Java branch only
- **Run locally:** `mvn -B package -DskipTests && bash benchmark/ci/PXD001819/run_ci.sh`
- **Compare:** `python3 benchmark/ci/PXD001819/compare_metrics.py benchmark/results/PXD001819/ci/ci_metrics.txt benchmark/ci/PXD001819/baseline.tsv`
- **Self-test:** `python3 -m unittest benchmark.ci.PXD001819.test_compare_metrics`
Expand Down
234 changes: 190 additions & 44 deletions crates/search/src/match_engine.rs
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
use std::collections::{BTreeMap, HashMap};
use std::hash::Hasher;
use std::sync::atomic::{AtomicU64, Ordering};
use std::sync::Arc;

// GF failure-mode diagnostics (2026-05-19). Module-level atomics
// incremented per-bin from compute_spec_e_values_for_spectrum and
Expand Down Expand Up @@ -716,13 +717,9 @@ fn compute_spec_e_values_for_spectrum(
let mut any_c = false;
for psm in queue.iter_psms() {
let cand = &candidates[psm.primary_candidate_idx() as usize];
if let Some(prot) = search_index.protein_at(cand.protein_index) {
let start = cand.start_offset_in_protein;
let pep_len = cand.peptide.length();
if start == 0 { any_n = true; }
if start + pep_len >= prot.sequence.len() { any_c = true; }
if any_n && any_c { break; }
}
if cand.is_protein_n_term { any_n = true; }
if cand.is_protein_c_term { any_c = true; }
if any_n && any_c { break; }
}
(any_n, any_c)
};
Expand Down Expand Up @@ -1402,51 +1399,200 @@ mod feature_tests {
}
}

/// Pre-merge dedup pass (R-2.2): collapse PSMs that share the same
/// (peptide_residue, rounded_score) key into a single entry, aggregating
/// their `candidate_idxs` into a unified Vec. Mirrors Java's
/// `DBScanner.java:719-733` `pepSeqMap` dedup.
///
/// Called by the per-spectrum loop after the per-candidate scoring loop,
/// before per-charge GF compute (so SpecE is computed on the deduped set).
///
/// Inputs:
/// - `psms`: drained from a per-charge `TopNQueue` via `drain_into_vec`
/// - `candidates`: the search's enumerated candidate slice; used to resolve
/// each PSM's peptide residue sequence for the dedup key
///
/// Returns: deduped `Vec<PsmMatch>`. The caller re-pushes these into the
/// per-charge queue via `queue.push()` for each entry.
/// Pre-merge dedup pass (R-2.2): collapse PSMs sharing the same Java
/// `(pepSeq, score)` key before per-charge GF compute.
pub(crate) fn dedup_pepseq_score(
psms: Vec<PsmMatch>,
candidates: &[Candidate],
) -> Vec<PsmMatch> {
use std::collections::HashMap;
use std::collections::btree_map::Entry;
use std::collections::BTreeMap;

// Key: (peptide_residue_bytes, rounded_score_i32)
// The residue sequence is the unmodified bare AA string, matching Java's
// `m.getPepSeq()` used as the dedup key (DBScanner.java:721).
let mut groups: HashMap<(Vec<u8>, i32), PsmMatch> = HashMap::new();
let mut pep_key_cache: FxHashMap<u32, Arc<PepDedupKey>> = FxHashMap::default();
let mut groups: BTreeMap<DedupMapKey, PsmMatch> = BTreeMap::new();

for psm in psms {
let cand = &candidates[psm.primary_candidate_idx() as usize];
let pep_residues: Vec<u8> = cand.peptide.residues.iter().map(|aa| aa.residue).collect();
let score_rounded = psm.score.round() as i32;
let key = (pep_residues, score_rounded);

groups
.entry(key)
.and_modify(|existing| {
// Aggregate this PSM's indices into the surviving entry.
// Avoid duplicates if the same idx somehow appears twice.
for &idx in &psm.candidate_idxs {
if !existing.candidate_idxs.contains(&idx) {
existing.candidate_idxs.push(idx);
}
let primary = psm.primary_candidate_idx();
let pep_key = pep_key_cache
.entry(primary)
.or_insert_with(|| Arc::new(PepDedupKey::from_peptide(&candidates[primary as usize].peptide)))
.clone();
let key = DedupMapKey {
pep: pep_key,
score: psm.rank_score.round() as i32,
};

match groups.entry(key) {
Entry::Vacant(slot) => {
slot.insert(psm);
}
Entry::Occupied(mut slot) => {
let existing = slot.get_mut();
merge_unique_candidate_idxs(&mut existing.candidate_idxs, &psm.candidate_idxs);
if psm.rank_score > existing.rank_score {
let merged_idxs = std::mem::take(&mut existing.candidate_idxs);
let mut survivor = psm;
merge_unique_candidate_idxs(&mut survivor.candidate_idxs, &merged_idxs);
*existing = survivor;
}
})
.or_insert(psm);
}
}
}

groups.into_values().collect()
let mut out = Vec::with_capacity(groups.len());
out.extend(groups.into_values());
out
}

/// Mod-aware dedup key: bare residues plus per-position mod mass (1e-5 Da units).
/// Matches Java pepSeq semantics without string formatting on the hot path.
#[derive(Clone, PartialEq, Eq, PartialOrd, Ord)]
struct PepDedupKey {
residues: Vec<u8>,
mod_units: Vec<i32>,
}

impl PepDedupKey {
fn from_peptide(peptide: &model::Peptide) -> Self {
let len = peptide.residues.len();
let mut residues = Vec::with_capacity(len);
let mut mod_units = Vec::with_capacity(len);
for aa in &peptide.residues {
residues.push(aa.residue);
mod_units.push(
aa.mod_
.as_ref()
.map(|m| (m.mass_delta * 100_000.0).round() as i32)
.unwrap_or(0),
);
}
Self { residues, mod_units }
}
}

#[derive(Clone, PartialEq, Eq)]
struct DedupMapKey {
pep: Arc<PepDedupKey>,
score: i32,
}

impl PartialOrd for DedupMapKey {
fn partial_cmp(&self, other: &Self) -> Option<std::cmp::Ordering> {
Some(self.cmp(other))
}
}

impl Ord for DedupMapKey {
fn cmp(&self, other: &Self) -> std::cmp::Ordering {
self.pep
.residues
.cmp(&other.pep.residues)
.then_with(|| self.pep.mod_units.cmp(&other.pep.mod_units))
.then(self.score.cmp(&other.score))
}
}

fn merge_unique_candidate_idxs(into: &mut Vec<u32>, from: &[u32]) {
for &idx in from {
if !into.contains(&idx) {
into.push(idx);
}
}
}

#[cfg(test)]
mod dedup_tests {
use super::*;
use std::sync::Arc;
use model::amino_acid::AminoAcid;
use model::modification::{ModLocation, Modification};
use model::peptide::Peptide;
use model::ResidueSpec;
use crate::psm::PsmMatch;

fn seq_peptide(bytes: &[u8]) -> Peptide {
let residues: Vec<AminoAcid> = bytes
.iter()
.filter_map(|&b| AminoAcid::standard(b))
.collect();
Peptide::new(residues, b'R', b'K')
}

fn cand_with_peptide(peptide: Peptide) -> Candidate {
Candidate {
peptide,
protein_index: 0,
start_offset_in_protein: 0,
is_decoy: false,
is_protein_n_term: false,
is_protein_c_term: false,
}
}

fn psm(primary: u32, rank: f32, pin: f32) -> PsmMatch {
PsmMatch {
spectrum_idx: 0,
candidate_idxs: vec![primary],
charge_used: 2,
mass_error_ppm: 0.0,
score: pin,
rank_score: rank,
edge_score: (rank - pin) as i32,
spec_e_value: 1.0,
de_novo_score: 0,
activation_method: None,
e_value: 1.0,
features: Default::default(),
isotope_offset: 0,
}
}

#[test]
fn dedup_uses_rank_score_not_pin_score() {
let pep = seq_peptide(b"PEPTK");
let cands = vec![cand_with_peptide(pep.clone())];
let psms = vec![
psm(0, 100.4, 99.6),
psm(0, 120.0, 99.6),
];
let out = dedup_pepseq_score(psms, &cands);
assert_eq!(out.len(), 2, "different rank_score keys must not merge");
}

#[test]
fn dedup_distinguishes_mod_state() {
let mut ox = seq_peptide(b"PEPMK");
ox.residues[3].mod_ = Some(Arc::new(Modification {
name: "Ox".into(),
mass_delta: 15.99491,
residue: ResidueSpec::Specific(b'M'),
location: ModLocation::Anywhere,
fixed: false,
accession: None,
}));
let cands = vec![
cand_with_peptide(seq_peptide(b"PEPMK")),
cand_with_peptide(ox),
];
let psms = vec![
psm(0, 50.0, 50.0),
psm(1, 50.0, 50.0),
];
let out = dedup_pepseq_score(psms, &cands);
assert_eq!(out.len(), 2, "mod-aware pepSeq keys must differ");
}

#[test]
fn dedup_keeps_highest_rank_score_survivor() {
let pep = seq_peptide(b"PEPTK");
let cands = vec![cand_with_peptide(pep)];
// Same rounded score bucket (60) but different float rank_score — merge to best.
let psms = vec![
psm(0, 59.6, 50.0),
psm(0, 60.4, 50.0),
];
let out = dedup_pepseq_score(psms, &cands);
assert_eq!(out.len(), 1);
assert!((out[0].rank_score - 60.4).abs() < f32::EPSILON);
}
}
Loading
Loading