Skip to content

Move taxonomy_id to TargetGene; populate via mapping for accession-based targets #697

@bencap

Description

@bencap

Background

Taxonomy currently lives on TargetSequence, but organism is a property of the gene target, not of how its sequence is represented. Accession-based targets have no taxonomy representation at all, despite every accession-based target being implicitly Homo sapiens by virtue of CDOT's current scope. This makes it impossible to query or filter accession-based score sets by organism in a structured way.

Proposed Changes

Move taxonomy_id to TargetGene

taxonomy_id moves from target_sequences to target_genes, applying uniformly to both sequence and accession types.

For accession-based targets, taxonomy is populated by the mapping job via CDOT lookup. It is never user-supplied. While CDOT is human-only this will always resolve to Homo sapiens, but the design is forward-compatible with (potential) future multi-organism support in accession based targets.

For sequence-based targets, taxonomy remains user-supplied but moves up to TargetGeneCreate rather than being nested inside TargetSequenceCreate.

Preserve non-breaking response shapes

The view model serialization layer absorbs the storage change so existing clients are unaffected:

  • Sequence-based GET response (unchanged): target_gene.target_sequence.taxonomy is preserved by populating taxonomy in SavedTargetSequence from TargetGene.taxonomy during serialization rather than from the sequence row directly.
  • Accession-based GET response (additive): target_gene.taxonomy is a new field populated after mapping. Clients that do not know about it are unaffected.

target_gene.taxonomy is intentionally not added to sequence-based responses at this time to avoid taxonomy appearing in two places in the same response. Normalizing the response shape across both types is a separate future change.

Breaking Changes

Change Breaking?
TargetSequenceCreate Remove taxonomy field input
TargetGeneCreate Add taxonomy field (sequence-based targets) input
TargetSequence response taxonomy stays in place (serialized from TargetGene) no change
TargetGene response Add taxonomy for accession-based targets additive only

When Taxonomy Is Null vs. Populated

For sequence-based targets, taxonomy is user-supplied at creation time and will always be populated. A null value indicates a data integrity problem.

For accession-based targets, taxonomy is derived by the mapping job and will be null until mapping completes successfully. This is expected transient state. Consumers of the API should treat a null taxonomy on an accession-based target as "mapping has not yet run or has not yet succeeded" rather than as an absent or unknown organism. A null value on a published accession-based target is valid and simply means the mapping job has not yet run for that score set.

Migration Notes

  • taxonomy_id column moves from target_sequences to target_genes; data migration required
  • Existing accession-based score sets will have null taxonomy until their mapping jobs are re-run or a backfill migration runs a CDOT lookup for each accession

Metadata

Metadata

Assignees

No one assigned

    Labels

    app: backendTask implementation touches the backendapp: databaseTask implementation requires database changesapp: frontendTask implementation touches the frontendapp: mapperTask implementation touches the mapperapp: workerTask implementation touches the workertype: enhancementEnhancement to an existing featuretype: maintenanceMaintaining this project

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions