perf: single-pass _sparse_nanmean (#1894) by Lawson-Darrow · Pull Request #4141 · scverse/scanpy

Lawson-Darrow · 2026-06-01T15:10:04Z

Closes #1894.

_sparse_nanmean copied the matrix twice and did a sparse data[isnan] = 0 set-index plus eliminate_zeros. This replaces it with single-pass numba kernels over the compressed buffers: one reduces within each compressed slot (per row for CSR), the other scatters across slots for the other axis. Both axes and CSR/CSC are handled. Semantics are unchanged: implicit zeros count as observed values, only stored NaNs are excluded, matching np.nanmean on the dense array.

Benchmark (20k x 3k, 5% dense, with NaNs): axis=1 ~88x faster, axis=0 ~9.5x faster.

Existing test_sparse_nanmean (both axes) still passes; added CSC regression tests. Used numba per the issue's suggestion, happy to adjust if you'd prefer.

_sparse_nanmean copied the matrix twice and did a sparse set-index + eliminate_zeros. Replace with single-pass numba kernels over the compressed buffers (one for within-slot reduction, one for the scatter across slots), handling both axes and CSR/CSC. Implicit zeros still count as observed; only stored NaNs are excluded, matching np.nanmean on the dense array. Benchmarked on a 20k x 3k 5%-dense matrix: ~88x faster for axis=1, ~9.5x for axis=0. Adds CSC regression tests.

Lawson-Darrow · 2026-06-02T00:01:40Z

The failing pre and low-vers checks here are unrelated to this change. The 8 failures in the pre (3.14) job are all in tests/test_read_10x.py and tests/test_aggregated.py, caused by new scipy pre-release DeprecationWarnings (the spmatrix-to-sparray migration and the block_diag interface change) being promoted to errors by the warnings-as-errors config. Every score_genes / _sparse_nanmean test passes in that same job. The low-vers (3.12) job did not actually fail a test, it was cancelled by fail-fast when the pre job failed. Happy to rebase once the pre env is fixed on main.

Zethson · 2026-06-05T09:44:49Z

Could you please share your benchmarks so that we can reproduce them? Also note that we have a benchmark setup in scanpy but one thing after another.

Lawson-Darrow · 2026-06-05T17:55:53Z

Of course!

Reproducible numbers below comparing the pre-PR implementation against the new single pass kernel on a 20000x3000 matrix at 5% density with NaNs in the stored data (the shape from the description) for both axes and both layouts.

Before timing the script asserts old and new return the same result (nan-aware) on this input, so the numbers below are the same computation, just faster. I also separately checked parity against np.nanmean on edge cases (all-NaN rows/cols, integer dtype, empty axes, duplicate stored entries).

Env: Python 3.12, Windows 11, 13th gen Intel (Raptor Lake), scipy 1.17.1, numpy 2.4.6, numba 0.65.1. Median of 7 runs, JIT warm up excluded.

layout	axis	old	new	speedup
CSR	0	~28 ms	~3.1 ms	~9x
CSR	1	~28 ms	~0.15 ms	~180x
CSC	0	~27 ms	~0.28 ms	~95x
CSC	1	~28 ms	~4 ms	~6-9x

The within slot reduction (CSR axis=1, CSC axis=0) is the big win. The cross slot scatter axis is ~6-9x. The new times on the fast paths are sub-millisecond so those exact multipliers are noisy run to run but the order of magnitude is stable.

import statistics, time
import numpy as np, scipy.sparse as sp
from scanpy.tools._score_genes import _sparse_nanmean as new

def old(x, axis):  # pre-PR implementation
    z = x.copy(); z.data = np.isnan(z.data); z.eliminate_zeros()
    n = z.shape[axis] - z.sum(axis)
    y = x.copy(); y.data[np.isnan(y.data)] = 0; y.eliminate_zeros()
    return y.sum(axis, dtype="float64") / n

def median_ms(fn, X, axis, repeats=7):
    ts = []
    for _ in range(repeats):
        t0 = time.perf_counter(); fn(X, axis); ts.append(time.perf_counter() - t0)
    return statistics.median(ts) * 1e3

rng = np.random.default_rng(0)
base = sp.random(20_000, 3_000, density=0.05, format="csr", random_state=0)
base.data[rng.random(base.data.size) < 0.05] = np.nan
for fmt in ("csr", "csc"):
    X = base.asformat(fmt)
    for axis in (0, 1):
        assert np.allclose(np.asarray(old(X, axis)).ravel(),
                           np.asarray(new(X, axis)).ravel(), equal_nan=True)
        new(X, axis)  # warm up JIT (excluded)
        t_old, t_new = median_ms(old, X, axis), median_ms(new, X, axis)
        print(f"{fmt} axis={axis}: old={t_old:.2f}ms new={t_new:.2f}ms {t_old/t_new:.1f}x")

One thing on the suite, I didn't find an existing asv case covering _sparse_nanmean or score_genes in benchmarks/. Happy to add one (helper-level and/or end-to-end score_genes) as a follow-up if you'd like it in the harness!

Lawson-Darrow added 2 commits June 1, 2026 11:07

docs: add release note for scverse#4141

1be798e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: single-pass _sparse_nanmean (#1894)#4141

perf: single-pass _sparse_nanmean (#1894)#4141
Lawson-Darrow wants to merge 2 commits into
scverse:mainfrom
Lawson-Darrow:perf/sparse-nanmean-single-pass-1894

Lawson-Darrow commented Jun 1, 2026

Uh oh!

Lawson-Darrow commented Jun 2, 2026

Uh oh!

Zethson commented Jun 5, 2026

Uh oh!

Lawson-Darrow commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Lawson-Darrow commented Jun 1, 2026

Uh oh!

Lawson-Darrow commented Jun 2, 2026

Uh oh!

Zethson commented Jun 5, 2026

Uh oh!

Lawson-Darrow commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants