Skip to content

perf: single-pass _sparse_nanmean (#1894)#4141

Open
Lawson-Darrow wants to merge 2 commits into
scverse:mainfrom
Lawson-Darrow:perf/sparse-nanmean-single-pass-1894
Open

perf: single-pass _sparse_nanmean (#1894)#4141
Lawson-Darrow wants to merge 2 commits into
scverse:mainfrom
Lawson-Darrow:perf/sparse-nanmean-single-pass-1894

Conversation

@Lawson-Darrow
Copy link
Copy Markdown

Closes #1894.

_sparse_nanmean copied the matrix twice and did a sparse data[isnan] = 0 set-index plus eliminate_zeros. This replaces it with single-pass numba kernels over the compressed buffers: one reduces within each compressed slot (per row for CSR), the other scatters across slots for the other axis. Both axes and CSR/CSC are handled. Semantics are unchanged: implicit zeros count as observed values, only stored NaNs are excluded, matching np.nanmean on the dense array.

Benchmark (20k x 3k, 5% dense, with NaNs): axis=1 ~88x faster, axis=0 ~9.5x faster.

Existing test_sparse_nanmean (both axes) still passes; added CSC regression tests. Used numba per the issue's suggestion, happy to adjust if you'd prefer.

_sparse_nanmean copied the matrix twice and did a sparse set-index +
eliminate_zeros. Replace with single-pass numba kernels over the compressed
buffers (one for within-slot reduction, one for the scatter across slots),
handling both axes and CSR/CSC. Implicit zeros still count as observed; only
stored NaNs are excluded, matching np.nanmean on the dense array.

Benchmarked on a 20k x 3k 5%-dense matrix: ~88x faster for axis=1, ~9.5x for
axis=0. Adds CSC regression tests.
@Lawson-Darrow
Copy link
Copy Markdown
Author

The failing pre and low-vers checks here are unrelated to this change. The 8 failures in the pre (3.14) job are all in tests/test_read_10x.py and tests/test_aggregated.py, caused by new scipy pre-release DeprecationWarnings (the spmatrix-to-sparray migration and the block_diag interface change) being promoted to errors by the warnings-as-errors config. Every score_genes / _sparse_nanmean test passes in that same job. The low-vers (3.12) job did not actually fail a test, it was cancelled by fail-fast when the pre job failed. Happy to rebase once the pre env is fixed on main.

@Zethson
Copy link
Copy Markdown
Member

Zethson commented Jun 5, 2026

Could you please share your benchmarks so that we can reproduce them? Also note that we have a benchmark setup in scanpy but one thing after another.

@Lawson-Darrow
Copy link
Copy Markdown
Author

Of course!

Reproducible numbers below comparing the pre-PR implementation against the new single pass kernel on a 20000x3000 matrix at 5% density with NaNs in the stored data (the shape from the description) for both axes and both layouts.

Before timing the script asserts old and new return the same result (nan-aware) on this input, so the numbers below are the same computation, just faster. I also separately checked parity against np.nanmean on edge cases (all-NaN rows/cols, integer dtype, empty axes, duplicate stored entries).

Env: Python 3.12, Windows 11, 13th gen Intel (Raptor Lake), scipy 1.17.1, numpy 2.4.6, numba 0.65.1. Median of 7 runs, JIT warm up excluded.

layout axis old new speedup
CSR 0 ~28 ms ~3.1 ms ~9x
CSR 1 ~28 ms ~0.15 ms ~180x
CSC 0 ~27 ms ~0.28 ms ~95x
CSC 1 ~28 ms ~4 ms ~6-9x

The within slot reduction (CSR axis=1, CSC axis=0) is the big win. The cross slot scatter axis is ~6-9x. The new times on the fast paths are sub-millisecond so those exact multipliers are noisy run to run but the order of magnitude is stable.

import statistics, time
import numpy as np, scipy.sparse as sp
from scanpy.tools._score_genes import _sparse_nanmean as new

def old(x, axis):  # pre-PR implementation
    z = x.copy(); z.data = np.isnan(z.data); z.eliminate_zeros()
    n = z.shape[axis] - z.sum(axis)
    y = x.copy(); y.data[np.isnan(y.data)] = 0; y.eliminate_zeros()
    return y.sum(axis, dtype="float64") / n

def median_ms(fn, X, axis, repeats=7):
    ts = []
    for _ in range(repeats):
        t0 = time.perf_counter(); fn(X, axis); ts.append(time.perf_counter() - t0)
    return statistics.median(ts) * 1e3

rng = np.random.default_rng(0)
base = sp.random(20_000, 3_000, density=0.05, format="csr", random_state=0)
base.data[rng.random(base.data.size) < 0.05] = np.nan
for fmt in ("csr", "csc"):
    X = base.asformat(fmt)
    for axis in (0, 1):
        assert np.allclose(np.asarray(old(X, axis)).ravel(),
                           np.asarray(new(X, axis)).ravel(), equal_nan=True)
        new(X, axis)  # warm up JIT (excluded)
        t_old, t_new = median_ms(old, X, axis), median_ms(new, X, axis)
        print(f"{fmt} axis={axis}: old={t_old:.2f}ms new={t_new:.2f}ms {t_old/t_new:.1f}x")

One thing on the suite, I didn't find an existing asv case covering _sparse_nanmean or score_genes in benchmarks/. Happy to add one (helper-level and/or end-to-end score_genes) as a follow-up if you'd like it in the harness!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

_sparse_nanmean is inefficient

2 participants