Skip to content

perf: optimize KMeans centroid recomputation with thread-local parallel accumulators#6370

Open
hushengquan wants to merge 1 commit intolance-format:mainfrom
hushengquan:optimize-kmeans
Open

perf: optimize KMeans centroid recomputation with thread-local parallel accumulators#6370
hushengquan wants to merge 1 commit intolance-format:mainfrom
hushengquan:optimize-kmeans

Conversation

@hushengquan
Copy link
Copy Markdown
Contributor

What

Optimizes KMeansAlgoFloat::to_kmeans centroid recomputation step.

Closes #6369

Changes

Replaced the old parallel-scan approach (each of P cores scans all N data points, filtering by centroid range) with a thread-local parallel accumulation pattern:

  1. par_chunks splits data into P chunks (one per rayon thread)
  2. Each thread accumulates into a private centroid buffer — zero write contention
  3. reduce merges all thread-local buffers into the final centroids
  4. Centroid normalization remains parallel over k clusters

Complexity comparison

Old New
Data reads O(N × P) O(N)
Merge overhead O(k × dim × P)
Write contention None (disjoint slices) None (private buffers)

Benchmarks (release, Apple M-series)

Config Old New Speedup
N=131K, k=256, dim=128 3.53 ms 0.48 ms 7.4x
N=524K, k=1024, dim=256 9.98 ms 3.88 ms 2.6x
N=2M, k=4096, dim=256 45.95 ms 24.87 ms 1.9x

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 1, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

KMeans to_kmeans centroid recomputation is suboptimal — each core redundantly scans all data

1 participant