Is your feature request related to a problem? Please describe.
I am trying to reproduce the published InferenceX DeepSeek-R1 H100 1P1D disaggregated SGLang results, especially the max-dep cases with:
- prefill: TP16 / EP1 / DP attention off / 1 worker
- decode: TP16 / EP16 / DP attention on / 1 worker
- PD disaggregation enabled
- MTP/speculative decoding enabled
- closed-loop fixed-concurrency benchmark for 1k/1k, 1k/8k, and 8k/1k
The reproduced results are directionally aligned in setup, but the decode side appears significantly faster than the published InferenceX numbers. This makes the closed-loop benchmark behave differently, especially for 8k/1k where faster decode feeds new requests back into prefill more quickly and amplifies TTFT queueing.
For example, in the 1k/1k max-dep case at concurrency 64:
| Metric |
InferenceX published result |
Our reproduction |
| Output throughput |
1564.8 tok/s |
2639.0 to 2939.8 tok/s depending on IB/HCA exposure |
| Mean TPOT |
36.99 ms |
19.35 to 21.60 ms |
| Mean TTFT |
1662.6 ms |
1217.0 to 1383.3 ms |
In 8k/1k max-dep at concurrency 64, the faster local decode changes the request distribution under closed-loop load. In our run, the local decode phase is much shorter, while TTFT becomes much larger. When we switch to an open-loop arrival rate around the rate implied by the closed-loop run, the TTFT becomes much closer to the published value. This suggests the discrepancy may be caused by environment/runtime differences that change decode speed, and then by closed-loop feedback amplifying the observed TTFT difference.
We are not claiming this is a correctness bug. The current evidence suggests this is likely a reproducibility/artifact gap: the public benchmark result does not include enough runtime and network information to determine whether our H100 2P2D deployment is truly aligned with the original InferenceX environment.
Describe the solution you'd like
Could you publish or attach the official DeepSeek-R1 H100 2P2D artifacts for the published max-dep runs, similar to the artifacts already available for DeepSeek-V4 runs in GitHub Actions?
The most useful artifacts would be:
-
Aggregated and raw benchmark outputs:
agg_*.json
- raw
results_concurrency_*.json, if available
-
Server and worker logs:
- frontend logs
- prefill worker logs
- decode worker logs
- generated
srtctl / Slurm commands or rendered recipes
-
Per-node hardware and network inventory:
- active IB/HCA device list, for example
ibdev2netdev, ibstat, ibv_devices, or equivalent
nvidia-smi topo -m
nvidia-smi -q or at least GPU clocks, power limits, and MIG state
- NCCL / UCX / NIXL related environment variables, such as
NCCL_IB_HCA, NCCL_SOCKET_IFNAME, UCX_NET_DEVICES, NIXL_*, etc.
- whether all HCAs were exposed to the container and which interfaces were actually used by NCCL/NIXL
-
Runtime version information:
- InferenceX commit SHA
- srt-slurm commit SHA
- SGLang, Dynamo, CUDA, NCCL, NIXL versions
- container image tag and digest
-
Benchmark driver details:
- exact
sa-bench command
random_range_ratio
num_prompts_mult
num_warmup_mult
- whether the published result is closed-loop only, and whether an open-loop comparison was run
For reference, existing DeepSeek-V4 GitHub Actions runs already expose useful artifacts such as:
bmk_dsv4_...
server_logs_dsv4_...
gpu_metrics_dsv4_...
Example public run:
https://github.com/SemiAnalysisAI/InferenceX/actions/runs/26191083562/attempts/11
That run includes artifacts with names like:
server_logs_dsv4_1k1k_fp8_vllm_tp8-ep1-dpafalse_disagg-false_spec-mtp_conc64_h200-dgxc-slurm_11
gpu_metrics_dsv4_1k1k_fp8_vllm_tp8-ep1-dpafalse_disagg-false_spec-mtp_conc64_h200-dgxc-slurm_11
bmk_dsv4_1k1k_fp8_vllm_tp8-ep1-dpafalse_disagg-false_spec-mtp_conc64_h200-dgxc-slurm_11
Having the analogous DeepSeek-R1 H100 2P2D artifacts would make it much easier to determine whether the decode-speed difference is due to:
- network/HCA exposure,
- NCCL/NIXL device selection,
- GPU clocks or power limits,
- SGLang/Dynamo/runtime version differences,
- CUDA graph or speculative decoding behavior,
- or a benchmark workload/closed-loop interpretation difference.
Describe alternatives you've considered
We tried varying the number of exposed IB/HCA devices locally, including 4, 6, and 8 HCA configurations. The 8-HCA setup improves throughput systematically, but 4-HCA and 6-HCA profiles are often close, so HCA count alone does not fully explain the gap.
We also profiled prefill and decode with Nsight Systems. The profiles suggest that decode effective batch size, CUDA graph usage, MTP acceptance, NCCL collective behavior, and prefill/KV-transfer pressure all matter. However, without the original run logs and network/runtime inventory, we cannot tell which differences are expected environment differences and which indicate a deployment mismatch.
Additional context
The main observation is that our reproduction can be faster on decode than the published InferenceX result. In closed-loop benchmarks, this can make the system send new long-prefill requests faster, shifting more in-flight requests toward prefill and making TTFT look worse in heavy-prefill cases. This is why we think publishing the official run artifacts would help the community reproduce and interpret the results more accurately, rather than treating this as a simple pass/fail benchmark mismatch.
Is your feature request related to a problem? Please describe.
I am trying to reproduce the published InferenceX DeepSeek-R1 H100 1P1D disaggregated SGLang results, especially the
max-depcases with:The reproduced results are directionally aligned in setup, but the decode side appears significantly faster than the published InferenceX numbers. This makes the closed-loop benchmark behave differently, especially for 8k/1k where faster decode feeds new requests back into prefill more quickly and amplifies TTFT queueing.
For example, in the 1k/1k max-dep case at concurrency 64:
In 8k/1k max-dep at concurrency 64, the faster local decode changes the request distribution under closed-loop load. In our run, the local decode phase is much shorter, while TTFT becomes much larger. When we switch to an open-loop arrival rate around the rate implied by the closed-loop run, the TTFT becomes much closer to the published value. This suggests the discrepancy may be caused by environment/runtime differences that change decode speed, and then by closed-loop feedback amplifying the observed TTFT difference.
We are not claiming this is a correctness bug. The current evidence suggests this is likely a reproducibility/artifact gap: the public benchmark result does not include enough runtime and network information to determine whether our H100 2P2D deployment is truly aligned with the original InferenceX environment.
Describe the solution you'd like
Could you publish or attach the official DeepSeek-R1 H100 2P2D artifacts for the published max-dep runs, similar to the artifacts already available for DeepSeek-V4 runs in GitHub Actions?
The most useful artifacts would be:
Aggregated and raw benchmark outputs:
agg_*.jsonresults_concurrency_*.json, if availableServer and worker logs:
srtctl/ Slurm commands or rendered recipesPer-node hardware and network inventory:
ibdev2netdev,ibstat,ibv_devices, or equivalentnvidia-smi topo -mnvidia-smi -qor at least GPU clocks, power limits, and MIG stateNCCL_IB_HCA,NCCL_SOCKET_IFNAME,UCX_NET_DEVICES,NIXL_*, etc.Runtime version information:
Benchmark driver details:
sa-benchcommandrandom_range_rationum_prompts_multnum_warmup_multFor reference, existing DeepSeek-V4 GitHub Actions runs already expose useful artifacts such as:
bmk_dsv4_...server_logs_dsv4_...gpu_metrics_dsv4_...Example public run:
https://github.com/SemiAnalysisAI/InferenceX/actions/runs/26191083562/attempts/11
That run includes artifacts with names like:
server_logs_dsv4_1k1k_fp8_vllm_tp8-ep1-dpafalse_disagg-false_spec-mtp_conc64_h200-dgxc-slurm_11gpu_metrics_dsv4_1k1k_fp8_vllm_tp8-ep1-dpafalse_disagg-false_spec-mtp_conc64_h200-dgxc-slurm_11bmk_dsv4_1k1k_fp8_vllm_tp8-ep1-dpafalse_disagg-false_spec-mtp_conc64_h200-dgxc-slurm_11Having the analogous DeepSeek-R1 H100 2P2D artifacts would make it much easier to determine whether the decode-speed difference is due to:
Describe alternatives you've considered
We tried varying the number of exposed IB/HCA devices locally, including 4, 6, and 8 HCA configurations. The 8-HCA setup improves throughput systematically, but 4-HCA and 6-HCA profiles are often close, so HCA count alone does not fully explain the gap.
We also profiled prefill and decode with Nsight Systems. The profiles suggest that decode effective batch size, CUDA graph usage, MTP acceptance, NCCL collective behavior, and prefill/KV-transfer pressure all matter. However, without the original run logs and network/runtime inventory, we cannot tell which differences are expected environment differences and which indicate a deployment mismatch.
Additional context
The main observation is that our reproduction can be faster on decode than the published InferenceX result. In closed-loop benchmarks, this can make the system send new long-prefill requests faster, shifting more in-flight requests toward prefill and making TTFT look worse in heavy-prefill cases. This is why we think publishing the official run artifacts would help the community reproduce and interpret the results more accurately, rather than treating this as a simple pass/fail benchmark mismatch.