perf: single-pass _sparse_nanmean (#1894)#4141
Conversation
_sparse_nanmean copied the matrix twice and did a sparse set-index + eliminate_zeros. Replace with single-pass numba kernels over the compressed buffers (one for within-slot reduction, one for the scatter across slots), handling both axes and CSR/CSC. Implicit zeros still count as observed; only stored NaNs are excluded, matching np.nanmean on the dense array. Benchmarked on a 20k x 3k 5%-dense matrix: ~88x faster for axis=1, ~9.5x for axis=0. Adds CSC regression tests.
|
The failing pre and low-vers checks here are unrelated to this change. The 8 failures in the pre (3.14) job are all in tests/test_read_10x.py and tests/test_aggregated.py, caused by new scipy pre-release DeprecationWarnings (the spmatrix-to-sparray migration and the block_diag interface change) being promoted to errors by the warnings-as-errors config. Every score_genes / _sparse_nanmean test passes in that same job. The low-vers (3.12) job did not actually fail a test, it was cancelled by fail-fast when the pre job failed. Happy to rebase once the pre env is fixed on main. |
|
Could you please share your benchmarks so that we can reproduce them? Also note that we have a benchmark setup in scanpy but one thing after another. |
|
Of course! Reproducible numbers below comparing the pre-PR implementation against the new single pass kernel on a 20000x3000 matrix at 5% density with NaNs in the stored data (the shape from the description) for both axes and both layouts. Before timing the script asserts old and new return the same result (nan-aware) on this input, so the numbers below are the same computation, just faster. I also separately checked parity against Env: Python 3.12, Windows 11, 13th gen Intel (Raptor Lake), scipy 1.17.1, numpy 2.4.6, numba 0.65.1. Median of 7 runs, JIT warm up excluded.
The within slot reduction (CSR axis=1, CSC axis=0) is the big win. The cross slot scatter axis is ~6-9x. The new times on the fast paths are sub-millisecond so those exact multipliers are noisy run to run but the order of magnitude is stable. import statistics, time
import numpy as np, scipy.sparse as sp
from scanpy.tools._score_genes import _sparse_nanmean as new
def old(x, axis): # pre-PR implementation
z = x.copy(); z.data = np.isnan(z.data); z.eliminate_zeros()
n = z.shape[axis] - z.sum(axis)
y = x.copy(); y.data[np.isnan(y.data)] = 0; y.eliminate_zeros()
return y.sum(axis, dtype="float64") / n
def median_ms(fn, X, axis, repeats=7):
ts = []
for _ in range(repeats):
t0 = time.perf_counter(); fn(X, axis); ts.append(time.perf_counter() - t0)
return statistics.median(ts) * 1e3
rng = np.random.default_rng(0)
base = sp.random(20_000, 3_000, density=0.05, format="csr", random_state=0)
base.data[rng.random(base.data.size) < 0.05] = np.nan
for fmt in ("csr", "csc"):
X = base.asformat(fmt)
for axis in (0, 1):
assert np.allclose(np.asarray(old(X, axis)).ravel(),
np.asarray(new(X, axis)).ravel(), equal_nan=True)
new(X, axis) # warm up JIT (excluded)
t_old, t_new = median_ms(old, X, axis), median_ms(new, X, axis)
print(f"{fmt} axis={axis}: old={t_old:.2f}ms new={t_new:.2f}ms {t_old/t_new:.1f}x")One thing on the suite, I didn't find an existing asv case covering |
Closes #1894.
_sparse_nanmeancopied the matrix twice and did a sparsedata[isnan] = 0set-index pluseliminate_zeros. This replaces it with single-pass numba kernels over the compressed buffers: one reduces within each compressed slot (per row for CSR), the other scatters across slots for the other axis. Both axes and CSR/CSC are handled. Semantics are unchanged: implicit zeros count as observed values, only stored NaNs are excluded, matchingnp.nanmeanon the dense array.Benchmark (20k x 3k, 5% dense, with NaNs): axis=1 ~88x faster, axis=0 ~9.5x faster.
Existing
test_sparse_nanmean(both axes) still passes; added CSC regression tests. Used numba per the issue's suggestion, happy to adjust if you'd prefer.