make sort32 fast by 39ali · Pull Request #327 · sparkjsdev/spark

39ali · 2026-04-29T08:55:21Z

try to improve the performance of sort32, on avg it's 30-40% faster .

things that changed :

pass 2 no longer re-reads keys[] , scratch stores a packed u64 of (inverted_key << 32 | original_index). pass 2 reads the high 16 bits directly from scratch with kv >> 48 making it a sequential scan
histogram and scatter are now branchless to help llvm vectorize the loop
manually unrolled histogram and both scatter passes to 8-wide

mrxz · 2026-04-29T14:26:17Z

-/// Two‑pass radix sort (base 2¹⁶) of 32‑bit float bit‑patterns,
-/// descending order (largest keys first). Mirrors the JS `sort32Splats`.
+#[inline(always)]
+unsafe fn prefix_sum_exclusive(buckets: &mut [u32]) -> u32 {


Is there a specific reason this is marked unsafe? It compiles just fine without.

i had many experiments with simd, which didn't make it marginally faster so i removed it for simplicity sake but forgot to remove the unsafe, will clean it up

mrxz · 2026-04-29T14:50:14Z

Awesome work, gave it a try and can confirm that it improves sorting performance. In my limited testing I saw ~20% reduction in sorting time (~25% faster).

manually unrolled histogram and both scatter passes to 8-wide

Without this change the performance gain seems to be roughly the same, or at least I didn't observe any significant difference. The majority of the benefit seems to come from making it branchless.

39ali · 2026-04-29T17:36:25Z

@mrxz i squeezed a bit more performance ~<=1ms by removing more branches from hot loops, and what you noticed seems about right, it will differ from one wasm engine to another, and arch to another(specially cache sizes and arch) so it's hard to give a solid number but it'll still be a pump in performance

dmarcos · 2026-05-14T21:39:16Z

can you remove the changes in the dist directory?

asundqui · 2026-05-15T01:01:30Z

@39ali great work! This looks like a cool win indeed, thanks for the work. Could remove the build in the dist/ folder from your branch, then we can merge it in?

mrxz · 2026-05-15T10:07:30Z

Could the macros used for unrolling the loops also be used for the body of remainder loops? Both should be identical, so if we could avoid the duplication we avoid the risk of it ever getting out of sync.

dmarcos · 2026-05-15T22:04:05Z

@39ali any chance for you to implement @mrxz suggestions? Thanks so much for the contribution

39ali · 2026-05-26T16:07:13Z

I'll implement the changes

39ali · 2026-06-02T05:26:43Z

@mrxz @dmarcos @asundqui done !

…the essential optimizations. Removed second branchless optimization. Added comments on why `unsafe` accesses are okay.

asundqui · 2026-06-06T01:06:50Z

@39ali really great work here! I've done some benchmarking on your method, and I'm actually getting 2x - 4x speedups in sorting from this. I'm truly shocked that this was possible! This will have a great impact on Spark's sorting performance. On a 10M splat scene on my M3 it goes from 250 ms to 60 ms or so. It's possible that the speedup is not as great on other environments, such as @mrxz was reporting 25% speedups on his system.

I went through and carefully separated the optimizations and measured them:

I found that approx 65% of the gain could be explained by storing (key, index) as a packed u64 in scratch, which turned the next loop from a random gather into a sequential read. So much better cache performance as a result.
The next 20% came from doing unsafe { unchecked... } array accesses where the compiler couldn't be sure that it would always be in bounds, so it has to check it every iteration. I went through the logic and it looks like it should always be in bounds.
The final 15% or so came from unrolling the loops. I had thought the unrolling couldn't possibly help because of branch prediction + cpu instruction reordering but it all helps!

It did seem like there was one error though: the second branchless loop seems problematic... I think writing to the array and only advancing the pointer if it's "valid" could overwrite things. So I removed it. I don't think it does very much for the performance anyway.

Finally I reverted some unnecessary changes to make it closer to @mrxz 's original formulation. I think we should merge this in @dmarcos , @mrxz ! WDYT? This should really help with #225 .

Interestingly, because the sorting is so much faster, it sort of exposes the next bottleneck more: uploading the ordering frequently to the GPU can cause stuttering sometimes when the counts get large. Now this happens more often!

mrxz · 2026-06-08T14:42:18Z

On a 10M splat scene on my M3 it goes from 250 ms to 60 ms or so. It's possible that the speedup is not as great on other environments, such as @mrxz was reporting 25% speedups on his system.

There is bound to be some variability between setups, but that's quite the discrepancy between the measurements. My guess would be that besides the gains of being branchless and more cache friendly it somehow avoids a slow-path on the M3? Regardless, it's a net positive, even the comparatively modest speed-up I've measured is a very welcome improvement.

It did seem like there was one error though: the second branchless loop seems problematic... I think writing to the array and only advancing the pointer if it's "valid" could overwrite things. So I removed it. I don't think it does very much for the performance anyway.

The overwriting was intentional AFAICT. Since the 'invalid' entries aren't tallied you don't want to count them when scattering either. The corollary is that by only advancing after writing a valid entry, you guarantee that the final write to the array at a given position is a valid entry. The only way this goes wrong is once a bucket has been fully scattered, as then it'll point into the start of the next bucket. The inverted Infinity value, however, has 0xFFFF as its low bytes, always placing it in the final bucket which can safely overflow up to max_splats.

That said, I do think the explicit check is more readable and some quick testing doesn't show a meaningful performance difference on my end between the two.

I think we should merge this in @dmarcos , @mrxz ! WDYT? This should really help with #225 .

I don't see any reason not to merge this.

Regarding #225 it will most definitely reduce the time a stale ordering is visible on screen. Whether or not this will be enough is the question. Similar to the pesky FOUC in web-design, no matter how short it will remain an "issue". Not sure what mkkellogg did differently, but it at the very least it degenerated more gracefully. It does have logic to detect large camera changes and queues partial sorts, though even before a sort update arrives, the rendered frame somehow looks less bad (subjectively).

dmarcos · 2026-06-08T17:28:30Z

Thanks everyone. Good stuff!

mrxz reviewed Apr 29, 2026

View reviewed changes

39ali force-pushed the sort32-fast branch 2 times, most recently from 8c1efb4 to 77253d1 Compare April 29, 2026 17:42

mrxz mentioned this pull request May 22, 2026

There is a noticeable difference in speed: Spark's orbitControl is much slower than MKK's GaussianSplats3D. #225

Open

make sort32 faster

34fca05

39ali force-pushed the sort32-fast branch from 77253d1 to 34fca05 Compare June 2, 2026 05:18

39ali added 2 commits June 2, 2026 08:18

remove more branches in wasm hot loops

d76eba7

cleanup

a72213b

Minimize changes on branch to original sort.rs while retaining all …

1d2fff0

…the essential optimizations. Removed second branchless optimization. Added comments on why `unsafe` accesses are okay.

dmarcos merged commit 6d24120 into sparkjsdev:main Jun 8, 2026
2 checks passed

mrxz mentioned this pull request Jun 10, 2026

splats display wrong side when moving out of screen #368

Open

Conversation

39ali commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrxz Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

39ali Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

mrxz commented Apr 29, 2026

Uh oh!

39ali commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dmarcos commented May 14, 2026

Uh oh!

asundqui commented May 15, 2026

Uh oh!

mrxz commented May 15, 2026

Uh oh!

dmarcos commented May 15, 2026

Uh oh!

39ali commented May 26, 2026

Uh oh!

39ali commented Jun 2, 2026

Uh oh!

asundqui commented Jun 6, 2026

Uh oh!

mrxz commented Jun 8, 2026

Uh oh!

dmarcos commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

39ali commented Apr 29, 2026 •

edited

Loading

39ali commented Apr 29, 2026 •

edited

Loading