-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Background
Taxonomy currently lives on TargetSequence, but organism is a property of the gene target, not of how its sequence is represented. Accession-based targets have no taxonomy representation at all, despite every accession-based target being implicitly Homo sapiens by virtue of CDOT's current scope. This makes it impossible to query or filter accession-based score sets by organism in a structured way.
Proposed Changes
Move taxonomy_id to TargetGene
taxonomy_id moves from target_sequences to target_genes, applying uniformly to both sequence and accession types.
For accession-based targets, taxonomy is populated by the mapping job via CDOT lookup. It is never user-supplied. While CDOT is human-only this will always resolve to Homo sapiens, but the design is forward-compatible with (potential) future multi-organism support in accession based targets.
For sequence-based targets, taxonomy remains user-supplied but moves up to TargetGeneCreate rather than being nested inside TargetSequenceCreate.
Preserve non-breaking response shapes
The view model serialization layer absorbs the storage change so existing clients are unaffected:
- Sequence-based GET response (unchanged):
target_gene.target_sequence.taxonomyis preserved by populatingtaxonomyinSavedTargetSequencefromTargetGene.taxonomyduring serialization rather than from the sequence row directly. - Accession-based GET response (additive):
target_gene.taxonomyis a new field populated after mapping. Clients that do not know about it are unaffected.
target_gene.taxonomy is intentionally not added to sequence-based responses at this time to avoid taxonomy appearing in two places in the same response. Normalizing the response shape across both types is a separate future change.
Breaking Changes
| Change | Breaking? | |
|---|---|---|
TargetSequenceCreate |
Remove taxonomy field |
input |
TargetGeneCreate |
Add taxonomy field (sequence-based targets) |
input |
TargetSequence response |
taxonomy stays in place (serialized from TargetGene) |
no change |
TargetGene response |
Add taxonomy for accession-based targets |
additive only |
When Taxonomy Is Null vs. Populated
For sequence-based targets, taxonomy is user-supplied at creation time and will always be populated. A null value indicates a data integrity problem.
For accession-based targets, taxonomy is derived by the mapping job and will be null until mapping completes successfully. This is expected transient state. Consumers of the API should treat a null taxonomy on an accession-based target as "mapping has not yet run or has not yet succeeded" rather than as an absent or unknown organism. A null value on a published accession-based target is valid and simply means the mapping job has not yet run for that score set.
Migration Notes
taxonomy_idcolumn moves fromtarget_sequencestotarget_genes; data migration required- Existing accession-based score sets will have
nulltaxonomy until their mapping jobs are re-run or a backfill migration runs a CDOT lookup for each accession