[Reproducibility] Request DeepSeek-R1 H100 2P2D run artifacts and network inventory

**Is your feature request related to a problem? Please describe.**

I am trying to reproduce the published InferenceX DeepSeek-R1 H100 1P1D disaggregated SGLang results, especially the `max-dep` cases with:

- prefill: TP16 / EP1 / DP attention off / 1 worker
- decode: TP16 / EP16 / DP attention on / 1 worker
- PD disaggregation enabled
- MTP/speculative decoding enabled
- closed-loop fixed-concurrency benchmark for 1k/1k, 1k/8k, and 8k/1k

The reproduced results are directionally aligned in setup, but the decode side appears significantly faster than the published InferenceX numbers. This makes the closed-loop benchmark behave differently, especially for 8k/1k where faster decode feeds new requests back into prefill more quickly and amplifies TTFT queueing.

For example, in the 1k/1k max-dep case at concurrency 64:

| Metric | InferenceX published result | Our reproduction |
| --- | ---: | ---: |
| Output throughput | 1564.8 tok/s | 2639.0 to 2939.8 tok/s depending on IB/HCA exposure |
| Mean TPOT | 36.99 ms | 19.35 to 21.60 ms |
| Mean TTFT | 1662.6 ms | 1217.0 to 1383.3 ms |

In 8k/1k max-dep at concurrency 64, the faster local decode changes the request distribution under closed-loop load. In our run, the local decode phase is much shorter, while TTFT becomes much larger. When we switch to an open-loop arrival rate around the rate implied by the closed-loop run, the TTFT becomes much closer to the published value. This suggests the discrepancy may be caused by environment/runtime differences that change decode speed, and then by closed-loop feedback amplifying the observed TTFT difference.

We are not claiming this is a correctness bug. The current evidence suggests this is likely a reproducibility/artifact gap: the public benchmark result does not include enough runtime and network information to determine whether our H100 2P2D deployment is truly aligned with the original InferenceX environment.

**Describe the solution you'd like**

Could you publish or attach the official DeepSeek-R1 H100 2P2D artifacts for the published max-dep runs, similar to the artifacts already available for DeepSeek-V4 runs in GitHub Actions?

The most useful artifacts would be:

1. Aggregated and raw benchmark outputs:
   - `agg_*.json`
   - raw `results_concurrency_*.json`, if available

2. Server and worker logs:
   - frontend logs
   - prefill worker logs
   - decode worker logs
   - generated `srtctl` / Slurm commands or rendered recipes

3. Per-node hardware and network inventory:
   - active IB/HCA device list, for example `ibdev2netdev`, `ibstat`, `ibv_devices`, or equivalent
   - `nvidia-smi topo -m`
   - `nvidia-smi -q` or at least GPU clocks, power limits, and MIG state
   - NCCL / UCX / NIXL related environment variables, such as `NCCL_IB_HCA`, `NCCL_SOCKET_IFNAME`, `UCX_NET_DEVICES`, `NIXL_*`, etc.
   - whether all HCAs were exposed to the container and which interfaces were actually used by NCCL/NIXL

4. Runtime version information:
   - InferenceX commit SHA
   - srt-slurm commit SHA
   - SGLang, Dynamo, CUDA, NCCL, NIXL versions
   - container image tag and digest

5. Benchmark driver details:
   - exact `sa-bench` command
   - `random_range_ratio`
   - `num_prompts_mult`
   - `num_warmup_mult`
   - whether the published result is closed-loop only, and whether an open-loop comparison was run

For reference, existing DeepSeek-V4 GitHub Actions runs already expose useful artifacts such as:

- `bmk_dsv4_...`
- `server_logs_dsv4_...`
- `gpu_metrics_dsv4_...`

Example public run:

https://github.com/SemiAnalysisAI/InferenceX/actions/runs/26191083562/attempts/11

That run includes artifacts with names like:

- `server_logs_dsv4_1k1k_fp8_vllm_tp8-ep1-dpafalse_disagg-false_spec-mtp_conc64_h200-dgxc-slurm_11`
- `gpu_metrics_dsv4_1k1k_fp8_vllm_tp8-ep1-dpafalse_disagg-false_spec-mtp_conc64_h200-dgxc-slurm_11`
- `bmk_dsv4_1k1k_fp8_vllm_tp8-ep1-dpafalse_disagg-false_spec-mtp_conc64_h200-dgxc-slurm_11`

Having the analogous DeepSeek-R1 H100 2P2D artifacts would make it much easier to determine whether the decode-speed difference is due to:

- network/HCA exposure,
- NCCL/NIXL device selection,
- GPU clocks or power limits,
- SGLang/Dynamo/runtime version differences,
- CUDA graph or speculative decoding behavior,
- or a benchmark workload/closed-loop interpretation difference.



**Describe alternatives you've considered**

We tried varying the number of exposed IB/HCA devices locally, including 4, 6, and 8 HCA configurations. The 8-HCA setup improves throughput systematically, but 4-HCA and 6-HCA profiles are often close, so HCA count alone does not fully explain the gap.

We also profiled prefill and decode with Nsight Systems. The profiles suggest that decode effective batch size, CUDA graph usage, MTP acceptance, NCCL collective behavior, and prefill/KV-transfer pressure all matter. However, without the original run logs and network/runtime inventory, we cannot tell which differences are expected environment differences and which indicate a deployment mismatch.

**Additional context**

The main observation is that our reproduction can be faster on decode than the published InferenceX result. In closed-loop benchmarks, this can make the system send new long-prefill requests faster, shifting more in-flight requests toward prefill and making TTFT look worse in heavy-prefill cases. This is why we think publishing the official run artifacts would help the community reproduce and interpret the results more accurately, rather than treating this as a simple pass/fail benchmark mismatch.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Reproducibility] Request DeepSeek-R1 H100 2P2D run artifacts and network inventory #1580

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Metric	InferenceX published result	Our reproduction
Output throughput	1564.8 tok/s	2639.0 to 2939.8 tok/s depending on IB/HCA exposure
Mean TPOT	36.99 ms	19.35 to 21.60 ms
Mean TTFT	1662.6 ms	1217.0 to 1383.3 ms

[Reproducibility] Request DeepSeek-R1 H100 2P2D run artifacts and network inventory #1580

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions