cuda: ~20% improvement to GB10 prefill throughput by cv · Pull Request #110 · antirez/ds4

cv · 2026-05-12T23:33:57Z

Summary

This PR improves CUDA prefill throughput on NVIDIA GB10 / DGX Spark for the DeepSeek V4 Flash q2 imatrix model.

Changes:

allocate CUDA graph/session tensors with cudaMalloc() instead of managed memory
add a backend ds4_gpu_tensor_fill_f32() hook so CUDA can initialize tensors device-side while Metal keeps existing host-visible behavior
use cudaMemsetAsync() for compressor state_kv zero fills
make the slower MoE down block16 subvariant opt-in via DS4_CUDA_MOE_DOWN_BLOCK16; the faster non-block16 tile16 path is now the default

Results

Benchmark command:

./ds4-bench \
  -m ds4flash.gguf \
  --cuda \
  --prompt-file speed-bench/promessi_sposi.txt \
  --ctx-start 2048 \
  --ctx-max 2048 \
  --step-incr 2048 \
  --gen-tokens 128

Fresh runs on DGX Spark / GB10 (CUDA_ARCH=sm_121), comparing this branch against current origin/main:

branch	prefill t/s	generation t/s
this PR	405.89	13.98
this PR	404.00	13.98
origin/main	328.95	13.78
origin/main	329.70	13.87

That is roughly a 23% prefill improvement on this workload, with generation roughly neutral.

A longer-context spot check at 8192 tokens also improved:

branch	prefill t/s	generation t/s
this PR	369.90	13.58
origin/main	314.22	13.32

Correctness / validation

Built and ran CUDA smoke regression:

make clean
make -s CUDA_ARCH=sm_121
DS4_CUDA_TOPK_REGRESSION_SEC=10 make -s CUDA_ARCH=sm_121 cuda-regression

Output:

cuda-regression: top-k n_comp=32768 n_tokens=32 elapsed=0.004s
cuda long-context regression: OK

I also compared a short greedy CUDA logprob dump against origin/main; selected token IDs and top selected logits matched exactly for the tested prompt.

CUDA graph/session tensors are only accessed through backend read/write/copy/fill APIs. Allocate them with cudaMalloc instead of managed memory and add a backend fill_f32 hook so CUDA can initialize tensors device-side while Metal preserves its existing host-visible behavior.

Zero compressor state_kv buffers with cudaMemsetAsync instead of launching the generic fill_f32 kernel. The default stream preserves ordering while avoiding a small fill kernel in prefill/replay state setup.

Keep the tile16/row2048 MoE down path enabled for batched prefill, but only use the block16 subvariant when DS4_CUDA_MOE_DOWN_BLOCK16 is set. On GB10 the non-block16 variant was consistently faster while preserving the old path as an explicit diagnostic option.

cv added 3 commits May 12, 2026 16:24

cuda: use memset for compressor state zeroing

b5a5438

Zero compressor state_kv buffers with cudaMemsetAsync instead of launching the generic fill_f32 kernel. The default stream preserves ordering while avoiding a small fill kernel in prefill/replay state setup.

cv changed the title ~~cuda: improve GB10 prefill throughput~~ cuda: ~20% improvement to GB10 prefill throughput May 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda: ~20% improvement to GB10 prefill throughput#110

cuda: ~20% improvement to GB10 prefill throughput#110
cv wants to merge 3 commits into
antirez:mainfrom
cv:cuda-gb10-prefill-optimizations

cv commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cv commented May 12, 2026

Summary

Results

Correctness / validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant