[EPLB] Refactor balance_packing to use numpy and optimize GPU-CPU transfers in EPLB #28369
+37
−17
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Purpose
This PR contains two overhead reducing optimizations to EPLB.
The first is a refactor to the
balance_packingfunction invllm/distributed/eplb/rebalance_algo.pyto use numpy arrays instead of torch tensors. This lowers the CPU overhead of EPLB rebalance significantly. On an 8xH100 machine runningdeepseek-ai/DeepSeek-V2-Litewith DP8, the overhead ofrebalance_expertsgoes from ~80ms down to ~10ms. Because this function requires us to transfer the EPLB statistics from GPU to CPU, this extra overhead results in a GPU bubble.The second is a change to
rebalance_expert_weights_inplaceto copy thenew/old_global_expert_indicesover to CPU before running theshuffle_layerloop instead of copying them in each iteration. This cuts out about 1.5ms of GPU time per-layer.Both of these changes together end up reducing the GPU time of EPLB rebalance from 420ms to 260ms.
Test Plan
Here's the CUDA stream before this change when running EPLB

And here it is after

Note the total runtime differences at the top.
End to End Benchmarks
Commands
Before
After
This PR gives a ~45% speedup in P99 ITL, which represents the worst case for EPLB overhead.
Test Result
Server command
lm eval result