Skip to content

[Reproducibility] Request DeepSeek-R1 H100 2P2D run artifacts and network inventory #1580

@yizyyy

Description

@yizyyy

Is your feature request related to a problem? Please describe.

I am trying to reproduce the published InferenceX DeepSeek-R1 H100 1P1D disaggregated SGLang results, especially the max-dep cases with:

  • prefill: TP16 / EP1 / DP attention off / 1 worker
  • decode: TP16 / EP16 / DP attention on / 1 worker
  • PD disaggregation enabled
  • MTP/speculative decoding enabled
  • closed-loop fixed-concurrency benchmark for 1k/1k, 1k/8k, and 8k/1k

The reproduced results are directionally aligned in setup, but the decode side appears significantly faster than the published InferenceX numbers. This makes the closed-loop benchmark behave differently, especially for 8k/1k where faster decode feeds new requests back into prefill more quickly and amplifies TTFT queueing.

For example, in the 1k/1k max-dep case at concurrency 64:

Metric InferenceX published result Our reproduction
Output throughput 1564.8 tok/s 2639.0 to 2939.8 tok/s depending on IB/HCA exposure
Mean TPOT 36.99 ms 19.35 to 21.60 ms
Mean TTFT 1662.6 ms 1217.0 to 1383.3 ms

In 8k/1k max-dep at concurrency 64, the faster local decode changes the request distribution under closed-loop load. In our run, the local decode phase is much shorter, while TTFT becomes much larger. When we switch to an open-loop arrival rate around the rate implied by the closed-loop run, the TTFT becomes much closer to the published value. This suggests the discrepancy may be caused by environment/runtime differences that change decode speed, and then by closed-loop feedback amplifying the observed TTFT difference.

We are not claiming this is a correctness bug. The current evidence suggests this is likely a reproducibility/artifact gap: the public benchmark result does not include enough runtime and network information to determine whether our H100 2P2D deployment is truly aligned with the original InferenceX environment.

Describe the solution you'd like

Could you publish or attach the official DeepSeek-R1 H100 2P2D artifacts for the published max-dep runs, similar to the artifacts already available for DeepSeek-V4 runs in GitHub Actions?

The most useful artifacts would be:

  1. Aggregated and raw benchmark outputs:

    • agg_*.json
    • raw results_concurrency_*.json, if available
  2. Server and worker logs:

    • frontend logs
    • prefill worker logs
    • decode worker logs
    • generated srtctl / Slurm commands or rendered recipes
  3. Per-node hardware and network inventory:

    • active IB/HCA device list, for example ibdev2netdev, ibstat, ibv_devices, or equivalent
    • nvidia-smi topo -m
    • nvidia-smi -q or at least GPU clocks, power limits, and MIG state
    • NCCL / UCX / NIXL related environment variables, such as NCCL_IB_HCA, NCCL_SOCKET_IFNAME, UCX_NET_DEVICES, NIXL_*, etc.
    • whether all HCAs were exposed to the container and which interfaces were actually used by NCCL/NIXL
  4. Runtime version information:

    • InferenceX commit SHA
    • srt-slurm commit SHA
    • SGLang, Dynamo, CUDA, NCCL, NIXL versions
    • container image tag and digest
  5. Benchmark driver details:

    • exact sa-bench command
    • random_range_ratio
    • num_prompts_mult
    • num_warmup_mult
    • whether the published result is closed-loop only, and whether an open-loop comparison was run

For reference, existing DeepSeek-V4 GitHub Actions runs already expose useful artifacts such as:

  • bmk_dsv4_...
  • server_logs_dsv4_...
  • gpu_metrics_dsv4_...

Example public run:

https://github.com/SemiAnalysisAI/InferenceX/actions/runs/26191083562/attempts/11

That run includes artifacts with names like:

  • server_logs_dsv4_1k1k_fp8_vllm_tp8-ep1-dpafalse_disagg-false_spec-mtp_conc64_h200-dgxc-slurm_11
  • gpu_metrics_dsv4_1k1k_fp8_vllm_tp8-ep1-dpafalse_disagg-false_spec-mtp_conc64_h200-dgxc-slurm_11
  • bmk_dsv4_1k1k_fp8_vllm_tp8-ep1-dpafalse_disagg-false_spec-mtp_conc64_h200-dgxc-slurm_11

Having the analogous DeepSeek-R1 H100 2P2D artifacts would make it much easier to determine whether the decode-speed difference is due to:

  • network/HCA exposure,
  • NCCL/NIXL device selection,
  • GPU clocks or power limits,
  • SGLang/Dynamo/runtime version differences,
  • CUDA graph or speculative decoding behavior,
  • or a benchmark workload/closed-loop interpretation difference.

Describe alternatives you've considered

We tried varying the number of exposed IB/HCA devices locally, including 4, 6, and 8 HCA configurations. The 8-HCA setup improves throughput systematically, but 4-HCA and 6-HCA profiles are often close, so HCA count alone does not fully explain the gap.

We also profiled prefill and decode with Nsight Systems. The profiles suggest that decode effective batch size, CUDA graph usage, MTP acceptance, NCCL collective behavior, and prefill/KV-transfer pressure all matter. However, without the original run logs and network/runtime inventory, we cannot tell which differences are expected environment differences and which indicate a deployment mismatch.

Additional context

The main observation is that our reproduction can be faster on decode than the published InferenceX result. In closed-loop benchmarks, this can make the system send new long-prefill requests faster, shifting more in-flight requests toward prefill and making TTFT look worse in heavy-prefill cases. This is why we think publishing the official run artifacts would help the community reproduce and interpret the results more accurately, rather than treating this as a simple pass/fail benchmark mismatch.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions