Skip to content

Conversation

@SageMoore
Copy link
Contributor

@SageMoore SageMoore commented Nov 9, 2025

Purpose

This PR contains two overhead reducing optimizations to EPLB.

The first is a refactor to the balance_packing function in vllm/distributed/eplb/rebalance_algo.py to use numpy arrays instead of torch tensors. This lowers the CPU overhead of EPLB rebalance significantly. On an 8xH100 machine running deepseek-ai/DeepSeek-V2-Lite with DP8, the overhead of rebalance_experts goes from ~80ms down to ~10ms. Because this function requires us to transfer the EPLB statistics from GPU to CPU, this extra overhead results in a GPU bubble.

The second is a change to rebalance_expert_weights_inplace to copy the new/old_global_expert_indices over to CPU before running the shuffle_layer loop instead of copying them in each iteration. This cuts out about 1.5ms of GPU time per-layer.

Both of these changes together end up reducing the GPU time of EPLB rebalance from 420ms to 260ms.

Test Plan

Here's the CUDA stream before this change when running EPLB
Screenshot 2025-11-09 at 3 14 00 PM

And here it is after
Screenshot 2025-11-09 at 3 13 48 PM

Note the total runtime differences at the top.

End to End Benchmarks

Commands

VLLM_ALL2ALL_BACKEND=deepep_low_latency g8 vllm serve --model="deepseek-ai/DeepSeek-V2-Lite" --max-num-seqs 512 --trust-remote-code --data-parallel-size 8 --enable-expert-parallel --disable-log-requests --enable-eplb --eplb-config '{"window_size":100,"step_interval":100}' 
vllm bench serve --model deepseek-ai/DeepSeek-V2-Lite --dataset-name sharegpt --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --trust-remote-code

Before

============ Serving Benchmark Result ============
Successful requests:                     1000
Failed requests:                         0
Benchmark duration (s):                  41.70
Total input tokens:                      236142
Total generated tokens:                  207186
Request throughput (req/s):              23.98
Output token throughput (tok/s):         4968.59
Peak output token throughput (tok/s):    16587.00
Peak concurrent requests:                1000.00
Total Token throughput (tok/s):          10631.59
---------------Time to First Token----------------
Mean TTFT (ms):                          7725.04
Median TTFT (ms):                        7277.67
P99 TTFT (ms):                           11804.88
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          185.52
Median TPOT (ms):                        69.74
P99 TPOT (ms):                           1558.07
---------------Inter-token Latency----------------
Mean ITL (ms):                           59.23
Median ITL (ms):                         32.52
P99 ITL (ms):                            551.66
==================================================

After

============ Serving Benchmark Result ============
Successful requests:                     1000
Failed requests:                         0
Benchmark duration (s):                  40.62
Total input tokens:                      236142
Total generated tokens:                  208910
Request throughput (req/s):              24.62
Output token throughput (tok/s):         5143.07
Peak output token throughput (tok/s):    16984.00
Peak concurrent requests:                1000.00
Total Token throughput (tok/s):          10956.55
---------------Time to First Token----------------
Mean TTFT (ms):                          7726.39
Median TTFT (ms):                        7357.05
P99 TTFT (ms):                           12068.40
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          185.49
Median TPOT (ms):                        65.78
P99 TPOT (ms):                           1555.28
---------------Inter-token Latency----------------
Mean ITL (ms):                           57.80
Median ITL (ms):                         32.46
P99 ITL (ms):                            292.02
==================================================

This PR gives a ~45% speedup in P99 ITL, which represents the worst case for EPLB overhead.

Test Result

Server command

 VLLM_ALL2ALL_BACKEND=deepep_low_latency vllm serve --model="deepseek-ai/DeepSeek-V2-Lite" --trust-remote-code --data-parallel-size 2 --enable-expert-parallel --enable-eplb --eplb-config '{"window_size":100,"step_interval":100}' 

lm eval result

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.3633|±  |0.0278|
|     |       |strict-match    |     5|exact_match|↑  |0.3600|±  |0.0278|

Signed-off-by: Sage Moore <sage@neuralmagic.com>
Signed-off-by: Sage Moore <sage@neuralmagic.com>
Signed-off-by: Sage Moore <sage@neuralmagic.com>
Signed-off-by: Sage Moore <sage@neuralmagic.com>
Signed-off-by: Sage Moore <sage@neuralmagic.com>
@SageMoore SageMoore changed the title Refactor the EPLB packing function to use numpy arrays instead of torch tensors [EPLB] Refactor balance_packing to use numpy and optimize GPU-CPU transfers in EPLB Nov 10, 2025
@SageMoore SageMoore marked this pull request as ready for review November 10, 2025 17:00
@mgoin mgoin added performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed eplb labels Nov 10, 2025
Copy link
Member

@abmfy abmfy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the contribution!

@abmfy
Copy link
Member

abmfy commented Nov 11, 2025

Just curious, why does using NumPy arrays make this faster?

@heheda12345 heheda12345 merged commit 798c7be into vllm-project:main Nov 11, 2025
69 checks passed
@SageMoore SageMoore deleted the sage/eplb-fixes branch November 11, 2025 14:57
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Nov 13, 2025
…nsfers in EPLB (vllm-project#28369)

Signed-off-by: Sage Moore <sage@neuralmagic.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
…nsfers in EPLB (vllm-project#28369)

Signed-off-by: Sage Moore <sage@neuralmagic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

eplb performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants