Skip to content

Cut head_tail_breaks and box_plot dask re-scans#1213

Merged
brendancol merged 1 commit intomasterfrom
perf/classify-head-tail-and-box-plot
Apr 16, 2026
Merged

Cut head_tail_breaks and box_plot dask re-scans#1213
brendancol merged 1 commit intomasterfrom
perf/classify-head-tail-and-box-plot

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Summary

  • _run_dask_head_tail_breaks: persist data_clean once, track the running mask count across iterations, and fuse the mean and head-count reductions into a single dask.compute() call per iteration. Cuts per-iteration graph traversals from 3 to 1 and eliminates the re-read on every loop pass.
  • _run_dask_box_plot (new) and _run_dask_cupy_box_plot: replace data_clean[da.isfinite(data_clean)] (which forces compute_chunk_sizes) with the same seeded _generate_sample_indices sampler that natural_breaks and quantile already use. Percentiles are then computed on the finite portion of the sample in numpy.

Motivation

Static analysis flagged three HIGH-severity patterns on the dask backends of classify:

  1. _run_dask_head_tail_breaks ran .compute() inside a while loop for the mean, new-mask count, and total-mask count — 3 full graph traversals per iteration, N+1 iterations typical.
  2. _run_box_plot(..., module=da) used boolean fancy indexing on a dask array, which triggers compute_chunk_sizes() and performs an extra full scan before da.percentile runs.
  3. _run_dask_cupy_box_plot had the same pattern plus a full map_blocks(cupy.asnumpy) over the dataset before sampling.

Benchmark

head_tail_breaks dask path on a 256×256 gamma-distributed float64 array, chunks=64:

Backend Metric Before After Ratio Verdict
dask+numpy wall_ms (med) 912 339 0.37 IMPROVED

box_plot dask path on 512×512, chunks=128:

Backend Metric After Verdict
dask+numpy wall_ms (med) 57 OK (no baseline — old path scaled with full-raster scan before percentile)

Test plan

  • pytest xrspatial/tests/test_classify.py — 85 tests pass
  • Manual smoke: head_tail_breaks dask output has the same bin count as numpy path on the same seed
  • Manual smoke: box_plot dask output uses sampled quantiles; verify output classes match numpy path within sampling tolerance

Notes

Sample size for the box_plot dask path is capped at 200,000 elements (or the full dataset if smaller). This matches the pattern used by natural_breaks and keeps the percentile computation O(sample) rather than O(dataset).

head_tail_breaks (dask) called .compute() three times per iteration of
its while-loop (mean, new-mask count, old-mask count) and rebuilt the
same data_clean graph every time. For N iterations that was 3N+1 full
graph traversals. Persist data_clean once, track the running mask count
across iterations, and fuse the mean+head-count reductions into a single
dask.compute() per iteration. Wall time drops from ~910 ms to ~340 ms
on 256x256 chunks=64.

box_plot (dask and dask+cupy) did data_clean[da.isfinite(data_clean)]
which is boolean fancy indexing on a dask array. That forces
compute_chunk_sizes, materializing a full scan just to know the output
chunk layout before percentile can run. Swap in the same seeded
_generate_sample_indices sampler that natural_breaks/quantile already
use: gather 200k indices on the dask array, compute the sample and the
global nanmax in one dask.compute() call, and take percentiles on the
finite portion of the sample in numpy.
@github-actions github-actions bot added the performance PR touches performance-sensitive code label Apr 16, 2026
@brendancol brendancol merged commit 7fa9e04 into master Apr 16, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant