Skip to content

obs column silently shadows var gene expression when key exists in both #621

@timtreis

Description

@timtreis

obs column silently shadows var gene expression when key exists in both

Environment: spatialdata-plot 0.3.4.dev (main, commit 5cfedc7), Python 3.13


Problem

When the same key name exists in both table.obs.columns and table.var_names, the obs value silently wins with no warning. Users who intend to color by gene expression (from the X matrix via var_names) get obs data instead — with no indication that anything unexpected has happened.

The root cause is an elif in spatialdata's _get_table_origins:

if value_key in element.obs.columns:
    origins.append(_ValueOrigin(origin="obs", ...))
elif value_key in element.var_names:   # ← skipped when obs matches
    origins.append(_ValueOrigin(origin="var", ...))

Because elif is used, finding the key in obs entirely prevents the var check. The spatialdata-plot layer (utils.py:1074–1078) handles the multi-origin case with a descriptive ValueError, but it never gets the chance because only one origin (obs) is returned.

This is particularly dangerous when:

  • obs[gene] stores a pre-computed aggregate or a different assay
  • var[gene] is the per-cell expression matrix the user wants to visualize

Minimal reproducible example

import matplotlib; matplotlib.use("Agg")
import matplotlib.pyplot as plt
import numpy as np, pandas as pd, anndata as ad
import dask; dask.config.set({"dataframe.query-planning": False})
import spatialdata as sd
from spatialdata.models import PointsModel, TableModel
import spatialdata_plot

pts = PointsModel.parse(pd.DataFrame({"x": [1., 2., 3., 4.], "y": [1., 2., 3., 4.]}))

obs = pd.DataFrame({
    "instance_id": [0, 1, 2, 3],
    "region": ["pts"] * 4,
    "GeneA": [0.9, 0.8, 0.7, 0.6],   # obs: summary/aggregate values, all similar
})
obs.index = obs.index.astype(str)

# var GeneA expression has a very different range: [1.0, 0.8, 0.3, 0.1]
X = np.array([[1.0, 0.5], [0.8, 0.2], [0.3, 0.9], [0.1, 0.7]])
adata = ad.AnnData(X=X, obs=obs, var=pd.DataFrame(index=["GeneA", "GeneB"]))
table = TableModel.parse(adata, region=["pts"], region_key="region", instance_key="instance_id")
sdata = sd.SpatialData(points={"pts": pts}, tables={"t": table})

# User expects gene expression from var — but gets obs values
sdata.pl.render_points("pts", color="GeneA", table_name="t").pl.show()
# No error, no warning — silently uses obs GeneA [0.9, 0.8, 0.7, 0.6]
# instead of var GeneA expression [1.0, 0.8, 0.3, 0.1]

Expected behaviour

When a key exists in both obs and var_names, either:

  • A UserWarning is raised explaining that obs is being used and var is being shadowed, with a hint to disambiguate
  • Or: a ValueError is raised asking the user to specify which source they want

Actual behaviour

No warning. The plot uses obs["GeneA"] values [0.9, 0.8, 0.7, 0.6] — the user intended the var-sourced expression values [1.0, 0.8, 0.3, 0.1].


Fix sketch

In _get_table_origins (upstream spatialdata), change elif to a second if for the var check. When both obs AND var match, both origins are appended. The spatialdata-plot layer at utils.py:1074–1078 already handles multiple origins with a descriptive ValueError that explains the ambiguity and asks the user to resolve it — this code would then be triggered correctly.

Alternatively, if obs-first priority is the intended behavior, emit a UserWarning at the spatialdata-plot layer when the value was found in obs but would also match var_names.


Triage tier: Tier 3

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions