cuda: ~20% improvement to GB10 prefill throughput#110
Open
cv wants to merge 3 commits into
Open
Conversation
CUDA graph/session tensors are only accessed through backend read/write/copy/fill APIs. Allocate them with cudaMalloc instead of managed memory and add a backend fill_f32 hook so CUDA can initialize tensors device-side while Metal preserves its existing host-visible behavior.
Zero compressor state_kv buffers with cudaMemsetAsync instead of launching the generic fill_f32 kernel. The default stream preserves ordering while avoiding a small fill kernel in prefill/replay state setup.
Keep the tile16/row2048 MoE down path enabled for batched prefill, but only use the block16 subvariant when DS4_CUDA_MOE_DOWN_BLOCK16 is set. On GB10 the non-block16 variant was consistently faster while preserving the old path as an explicit diagnostic option.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR improves CUDA prefill throughput on NVIDIA GB10 / DGX Spark for the DeepSeek V4 Flash q2 imatrix model.
Changes:
cudaMalloc()instead of managed memoryds4_gpu_tensor_fill_f32()hook so CUDA can initialize tensors device-side while Metal keeps existing host-visible behaviorcudaMemsetAsync()for compressorstate_kvzero fillsblock16subvariant opt-in viaDS4_CUDA_MOE_DOWN_BLOCK16; the faster non-block16 tile16 path is now the defaultResults
Benchmark command:
Fresh runs on DGX Spark / GB10 (
CUDA_ARCH=sm_121), comparing this branch against currentorigin/main:That is roughly a 23% prefill improvement on this workload, with generation roughly neutral.
A longer-context spot check at 8192 tokens also improved:
Correctness / validation
Built and ran CUDA smoke regression:
Output:
I also compared a short greedy CUDA logprob dump against
origin/main; selected token IDs and top selected logits matched exactly for the tested prompt.