[EPLB] Refactor balance_packing to use numpy and optimize GPU-CPU transfers in EPLB #28369

SageMoore · 2025-11-09T23:20:57Z

Purpose

This PR contains two overhead reducing optimizations to EPLB.

The first is a refactor to the balance_packing function in vllm/distributed/eplb/rebalance_algo.py to use numpy arrays instead of torch tensors. This lowers the CPU overhead of EPLB rebalance significantly. On an 8xH100 machine running deepseek-ai/DeepSeek-V2-Lite with DP8, the overhead of rebalance_experts goes from ~80ms down to ~10ms. Because this function requires us to transfer the EPLB statistics from GPU to CPU, this extra overhead results in a GPU bubble.

The second is a change to rebalance_expert_weights_inplace to copy the new/old_global_expert_indices over to CPU before running the shuffle_layer loop instead of copying them in each iteration. This cuts out about 1.5ms of GPU time per-layer.

Both of these changes together end up reducing the GPU time of EPLB rebalance from 420ms to 260ms.

Test Plan

Here's the CUDA stream before this change when running EPLB

And here it is after

Note the total runtime differences at the top.

End to End Benchmarks

Commands

VLLM_ALL2ALL_BACKEND=deepep_low_latency g8 vllm serve --model="deepseek-ai/DeepSeek-V2-Lite" --max-num-seqs 512 --trust-remote-code --data-parallel-size 8 --enable-expert-parallel --disable-log-requests --enable-eplb --eplb-config '{"window_size":100,"step_interval":100}'

vllm bench serve --model deepseek-ai/DeepSeek-V2-Lite --dataset-name sharegpt --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --trust-remote-code

Before

============ Serving Benchmark Result ============
Successful requests:                     1000
Failed requests:                         0
Benchmark duration (s):                  41.70
Total input tokens:                      236142
Total generated tokens:                  207186
Request throughput (req/s):              23.98
Output token throughput (tok/s):         4968.59
Peak output token throughput (tok/s):    16587.00
Peak concurrent requests:                1000.00
Total Token throughput (tok/s):          10631.59
---------------Time to First Token----------------
Mean TTFT (ms):                          7725.04
Median TTFT (ms):                        7277.67
P99 TTFT (ms):                           11804.88
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          185.52
Median TPOT (ms):                        69.74
P99 TPOT (ms):                           1558.07
---------------Inter-token Latency----------------
Mean ITL (ms):                           59.23
Median ITL (ms):                         32.52
P99 ITL (ms):                            551.66
==================================================

After

============ Serving Benchmark Result ============
Successful requests:                     1000
Failed requests:                         0
Benchmark duration (s):                  40.62
Total input tokens:                      236142
Total generated tokens:                  208910
Request throughput (req/s):              24.62
Output token throughput (tok/s):         5143.07
Peak output token throughput (tok/s):    16984.00
Peak concurrent requests:                1000.00
Total Token throughput (tok/s):          10956.55
---------------Time to First Token----------------
Mean TTFT (ms):                          7726.39
Median TTFT (ms):                        7357.05
P99 TTFT (ms):                           12068.40
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          185.49
Median TPOT (ms):                        65.78
P99 TPOT (ms):                           1555.28
---------------Inter-token Latency----------------
Mean ITL (ms):                           57.80
Median ITL (ms):                         32.46
P99 ITL (ms):                            292.02
==================================================

This PR gives a ~45% speedup in P99 ITL, which represents the worst case for EPLB overhead.

Test Result

Server command

 VLLM_ALL2ALL_BACKEND=deepep_low_latency vllm serve --model="deepseek-ai/DeepSeek-V2-Lite" --trust-remote-code --data-parallel-size 2 --enable-expert-parallel --enable-eplb --eplb-config '{"window_size":100,"step_interval":100}'

lm eval result

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.3633|±  |0.0278|
|     |       |strict-match    |     5|exact_match|↑  |0.3600|±  |0.0278|

Signed-off-by: Sage Moore <sage@neuralmagic.com>

…eplb-fixes

abmfy

LGTM, thanks for the contribution!

abmfy · 2025-11-11T01:22:13Z

Just curious, why does using NumPy arrays make this faster?

…nsfers in EPLB (vllm-project#28369) Signed-off-by: Sage Moore <sage@neuralmagic.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

…nsfers in EPLB (vllm-project#28369) Signed-off-by: Sage Moore <sage@neuralmagic.com>

SageMoore added 6 commits November 6, 2025 20:45

init

5d1a968

Signed-off-by: Sage Moore <sage@neuralmagic.com>

init

c9c5f29

Signed-off-by: Sage Moore <sage@neuralmagic.com>

restore cuda syncs

3317086

Signed-off-by: Sage Moore <sage@neuralmagic.com>

comments

07fb932

Signed-off-by: Sage Moore <sage@neuralmagic.com>

comments

c20f9c0

Signed-off-by: Sage Moore <sage@neuralmagic.com>

Merge branch 'main' of https://github.com/neuralmagic/vllm into sage/…

6f9d627

…eplb-fixes

SageMoore changed the title ~~Refactor the EPLB packing function to use numpy arrays instead of torch tensors~~ [EPLB] Refactor balance_packing to use numpy and optimize GPU-CPU transfers in EPLB Nov 10, 2025

SageMoore marked this pull request as ready for review November 10, 2025 17:00

mgoin added performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed eplb labels Nov 10, 2025

abmfy approved these changes Nov 11, 2025

View reviewed changes

heheda12345 approved these changes Nov 11, 2025

View reviewed changes

heheda12345 merged commit 798c7be into vllm-project:main Nov 11, 2025
69 checks passed

SageMoore deleted the sage/eplb-fixes branch November 11, 2025 14:57

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

[EPLB] Refactor balance_packing to use numpy and optimize GPU-CPU tra…

72c74b5

…nsfers in EPLB (vllm-project#28369) Signed-off-by: Sage Moore <sage@neuralmagic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[EPLB] Refactor balance_packing to use numpy and optimize GPU-CPU transfers in EPLB #28369

[EPLB] Refactor balance_packing to use numpy and optimize GPU-CPU transfers in EPLB #28369

SageMoore commented Nov 9, 2025 •

edited by github-actions bot

Loading

Uh oh!

abmfy left a comment

Uh oh!

abmfy commented Nov 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

[EPLB] Refactor balance_packing to use numpy and optimize GPU-CPU transfers in EPLB #28369

[EPLB] Refactor balance_packing to use numpy and optimize GPU-CPU transfers in EPLB #28369

Conversation

SageMoore commented Nov 9, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

End to End Benchmarks

Test Result

Uh oh!

abmfy left a comment

Choose a reason for hiding this comment

Uh oh!

abmfy commented Nov 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

SageMoore commented Nov 9, 2025 •

edited by github-actions bot

Loading