Skip to content

cuda: ~20% improvement to GB10 prefill throughput#110

Open
cv wants to merge 3 commits into
antirez:mainfrom
cv:cuda-gb10-prefill-optimizations
Open

cuda: ~20% improvement to GB10 prefill throughput#110
cv wants to merge 3 commits into
antirez:mainfrom
cv:cuda-gb10-prefill-optimizations

Conversation

@cv
Copy link
Copy Markdown

@cv cv commented May 12, 2026

Summary

This PR improves CUDA prefill throughput on NVIDIA GB10 / DGX Spark for the DeepSeek V4 Flash q2 imatrix model.

Changes:

  • allocate CUDA graph/session tensors with cudaMalloc() instead of managed memory
  • add a backend ds4_gpu_tensor_fill_f32() hook so CUDA can initialize tensors device-side while Metal keeps existing host-visible behavior
  • use cudaMemsetAsync() for compressor state_kv zero fills
  • make the slower MoE down block16 subvariant opt-in via DS4_CUDA_MOE_DOWN_BLOCK16; the faster non-block16 tile16 path is now the default

Results

Benchmark command:

./ds4-bench \
  -m ds4flash.gguf \
  --cuda \
  --prompt-file speed-bench/promessi_sposi.txt \
  --ctx-start 2048 \
  --ctx-max 2048 \
  --step-incr 2048 \
  --gen-tokens 128

Fresh runs on DGX Spark / GB10 (CUDA_ARCH=sm_121), comparing this branch against current origin/main:

branch prefill t/s generation t/s
this PR 405.89 13.98
this PR 404.00 13.98
origin/main 328.95 13.78
origin/main 329.70 13.87

That is roughly a 23% prefill improvement on this workload, with generation roughly neutral.

A longer-context spot check at 8192 tokens also improved:

branch prefill t/s generation t/s
this PR 369.90 13.58
origin/main 314.22 13.32

Correctness / validation

Built and ran CUDA smoke regression:

make clean
make -s CUDA_ARCH=sm_121
DS4_CUDA_TOPK_REGRESSION_SEC=10 make -s CUDA_ARCH=sm_121 cuda-regression

Output:

cuda-regression: top-k n_comp=32768 n_tokens=32 elapsed=0.004s
cuda long-context regression: OK

I also compared a short greedy CUDA logprob dump against origin/main; selected token IDs and top selected logits matched exactly for the tested prompt.

cv added 3 commits May 12, 2026 16:24
CUDA graph/session tensors are only accessed through backend read/write/copy/fill APIs. Allocate them with cudaMalloc instead of managed memory and add a backend fill_f32 hook so CUDA can initialize tensors device-side while Metal preserves its existing host-visible behavior.
Zero compressor state_kv buffers with cudaMemsetAsync instead of launching the generic fill_f32 kernel. The default stream preserves ordering while avoiding a small fill kernel in prefill/replay state setup.
Keep the tile16/row2048 MoE down path enabled for batched prefill, but only use the block16 subvariant when DS4_CUDA_MOE_DOWN_BLOCK16 is set. On GB10 the non-block16 variant was consistently faster while preserving the old path as an explicit diagnostic option.
@cv cv changed the title cuda: improve GB10 prefill throughput cuda: ~20% improvement to GB10 prefill throughput May 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant