Summary
The reverse-translation worker job (reverse_translate_variants_for_score_set in mavedb-api) records benign, non-reverse-translatable variants — deletions, insertions, dup, delins, frameshift, extensions — as FAILED cross-level-translation annotations. These variants have no synonymous equivalence class by definition, so producing no candidates is the correct outcome, not an error. Reclassify them as SKIPPED by inspecting the protein consequence's edit type up front, rather than reacting to the translation engine's error strings.
Problem
Reverse translation only has meaning for substitutions (missense / synonymous / nonsense), where multiple codons encode the same protein consequence. For any other consequence there is nothing to enumerate:
- A single-residue deletion (
p.Xxxdel) is an accepted protein form, but it enumerates to nothing → the engine returns no candidates → surfaces as TranslationError("Reverse translation returned no candidate DNA variants").
- A multi-residue / length-changing consequence (e.g. a net length-changing
delins or frameshift) is not a form the engine supports → surfaces as TranslationError("Could not parse HGVS protein string. Expected forms like 'p.Arg175His', 'p.Arg175del', or 'NP_000537.3:p.Arg175His'.").
Both currently increment the failed counter and write AnnotationStatus.FAILED annotations. This is observed on genomic-source score sets whose assay-level HGVS are genomic deletions/delins (e.g. saturation-genome-editing-style data), where many input rows are indels. Concrete observed inputs:
NC_000016.10:g.67621590_67621592del → "Reverse translation returned no candidate DNA variants"
NC_000016.10:g.67621589_67621591del → "Could not parse HGVS protein string. …"
NC_000016.10:g.67637685_67637689delinsCA → length-changing delins → "Could not parse HGVS protein string. …"
Two consequences:
- Status counts are misleading — a large share of a perfectly healthy run is reported as failures.
- There is a real tail case behind the job's own guard
if translated == 0 and failed > 0: return JobExecutionOutcome.failed(...): a score set composed entirely of non-substitution variants would fail the whole job purely due to misclassification, even though nothing is wrong. No such all-indel score sets exist today; this is noted for awareness and can be handled case-by-case if one appears.
Classification must key off the protein consequence, not the input variant's DNA spelling: a delins whose net effect is a single amino-acid swap collapses to a simple missense and is reverse-translatable, while a delins that nets an in-frame length change is not. The deciding signal is the consequence's edit type.
Proposed behavior
- After the
g→c→p (or c→p) collapse, classify the protein consequence's edit type before dispatching to the reverse-translation engine.
- If the consequence is in the substitution family (missense, synonymous, nonsense), reverse-translate as today.
- If it is anything else (del / ins / dup / delins / fs / ext / any ranged form), do not call the engine; emit a typed "no equivalence class" outcome that the worker maps to
AnnotationStatus.SKIPPED, with a clear reason such as "non-substitution protein consequence has no synonymous equivalence class".
- Because only substitution consequences reach the engine, any engine error returned afterward is unambiguously a genuine failure and should remain
FAILED.
Acceptance criteria
- A genomic deletion input that collapses to
p.Xxxdel is recorded as SKIPPED (not FAILED), with a reason indicating no equivalence class.
- A length-changing
delins / frameshift / multi-residue consequence is recorded as SKIPPED (not FAILED).
- A
delins (or other non-substitution DNA form) whose net protein consequence is a simple missense is still reverse-translated normally and produces candidate alleles.
- Missense, synonymous, and nonsense substitution inputs continue to reverse-translate and create alleles exactly as before (including WT-codon members under
WtCodonMode.ALL).
- A substitution consequence that genuinely yields no candidates still reports
FAILED (the skip path must not swallow real substitution failures).
- Classification does not depend on matching the translation engine's free-text error messages.
- Status counts returned by the job (
translated / failed / skipped) reflect the new classification, and the translated == 0 and failed > 0 guard is no longer tripped by non-substitution-only inputs that previously counted as failures.
Implementation notes
- The classification belongs at the consequence-resolution step in the variant-annotation translation core (where inputs are collapsed to a
ProteinConsequence in construct_equivalent_variants / _resolve_consequence), so non-substitution consequences short-circuit before the subprocess call.
- Introduce a typed third outcome distinct from
TranslationResult and TranslationError (e.g. a "skipped / not translatable" result carrying the input and a reason). Today the API is binary, which is what forces the brittle error-string interpretation in the worker.
- For robustly determining the edit type, prefer parsing the protein consequence with the HGVS parser and switching on the parsed edit (substitution vs
AADel / AADelIns / AAFs / AAExt / dup / ranged). The translation core is intentionally port-based and does not import hgvs directly, so route the parse through the existing hgvs-backed coordinate adapter (worker side) or expose it via a small port method, rather than importing hgvs into the core. A tight deterministic classifier over the fixed protein-consequence grammar is an acceptable fallback if a parser cannot be placed cleanly, but is slightly less robust against exotic forms.
- Update the worker (
reverse_translate_variants_for_score_set) to map the new typed skip outcome to AnnotationStatus.SKIPPED with the carried reason, alongside the existing transcript-resolution skip handling.
- Add regression tests: a deletion consequence and a length-changing delins/frameshift consequence are classified
SKIPPED; a delins-input-that-is-really-a-missense reverse-translates; a missense with no candidates remains FAILED.
Summary
The reverse-translation worker job (
reverse_translate_variants_for_score_setin mavedb-api) records benign, non-reverse-translatable variants — deletions, insertions, dup, delins, frameshift, extensions — asFAILEDcross-level-translation annotations. These variants have no synonymous equivalence class by definition, so producing no candidates is the correct outcome, not an error. Reclassify them asSKIPPEDby inspecting the protein consequence's edit type up front, rather than reacting to the translation engine's error strings.Problem
Reverse translation only has meaning for substitutions (missense / synonymous / nonsense), where multiple codons encode the same protein consequence. For any other consequence there is nothing to enumerate:
p.Xxxdel) is an accepted protein form, but it enumerates to nothing → the engine returns no candidates → surfaces asTranslationError("Reverse translation returned no candidate DNA variants").delinsor frameshift) is not a form the engine supports → surfaces asTranslationError("Could not parse HGVS protein string. Expected forms like 'p.Arg175His', 'p.Arg175del', or 'NP_000537.3:p.Arg175His'.").Both currently increment the
failedcounter and writeAnnotationStatus.FAILEDannotations. This is observed on genomic-source score sets whose assay-level HGVS are genomic deletions/delins (e.g. saturation-genome-editing-style data), where many input rows are indels. Concrete observed inputs:NC_000016.10:g.67621590_67621592del→ "Reverse translation returned no candidate DNA variants"NC_000016.10:g.67621589_67621591del→ "Could not parse HGVS protein string. …"NC_000016.10:g.67637685_67637689delinsCA→ length-changing delins → "Could not parse HGVS protein string. …"Two consequences:
if translated == 0 and failed > 0: return JobExecutionOutcome.failed(...): a score set composed entirely of non-substitution variants would fail the whole job purely due to misclassification, even though nothing is wrong. No such all-indel score sets exist today; this is noted for awareness and can be handled case-by-case if one appears.Classification must key off the protein consequence, not the input variant's DNA spelling: a
delinswhose net effect is a single amino-acid swap collapses to a simple missense and is reverse-translatable, while adelinsthat nets an in-frame length change is not. The deciding signal is the consequence's edit type.Proposed behavior
g→c→p(orc→p) collapse, classify the protein consequence's edit type before dispatching to the reverse-translation engine.AnnotationStatus.SKIPPED, with a clear reason such as "non-substitution protein consequence has no synonymous equivalence class".FAILED.Acceptance criteria
p.Xxxdelis recorded asSKIPPED(notFAILED), with a reason indicating no equivalence class.delins/ frameshift / multi-residue consequence is recorded asSKIPPED(notFAILED).delins(or other non-substitution DNA form) whose net protein consequence is a simple missense is still reverse-translated normally and produces candidate alleles.WtCodonMode.ALL).FAILED(the skip path must not swallow real substitution failures).translated/failed/skipped) reflect the new classification, and thetranslated == 0 and failed > 0guard is no longer tripped by non-substitution-only inputs that previously counted as failures.Implementation notes
ProteinConsequenceinconstruct_equivalent_variants/_resolve_consequence), so non-substitution consequences short-circuit before the subprocess call.TranslationResultandTranslationError(e.g. a "skipped / not translatable" result carrying the input and a reason). Today the API is binary, which is what forces the brittle error-string interpretation in the worker.AADel/AADelIns/AAFs/AAExt/ dup / ranged). The translation core is intentionally port-based and does not import hgvs directly, so route the parse through the existing hgvs-backed coordinate adapter (worker side) or expose it via a small port method, rather than importing hgvs into the core. A tight deterministic classifier over the fixed protein-consequence grammar is an acceptable fallback if a parser cannot be placed cleanly, but is slightly less robust against exotic forms.reverse_translate_variants_for_score_set) to map the new typed skip outcome toAnnotationStatus.SKIPPEDwith the carried reason, alongside the existing transcript-resolution skip handling.SKIPPED; a delins-input-that-is-really-a-missense reverse-translates; a missense with no candidates remainsFAILED.