[AMD] Add DeepSeek-V4-Pro FP4 MI355X SGLang MTP recipe by Oseltamivir · Pull Request #1631 · SemiAnalysisAI/InferenceX

Oseltamivir · 2026-05-31T20:05:45Z

Summary

Adds the MTP speculative-decoding sibling of dsv4-fp4-mi355x-sglang, the SGLang counterpart to the vLLM MTP recipe in #1630.

Follows sgl-project/sglang#26383 ("[AMD][DSV4] DSV4 MTP graph + sparse triton attn optimizations", merged 2026-05-27, deaba74) — the SGLang analog of vllm#43385. That PR fixes the ROCm HIP-radix backend's per-step draft out_cache_loc slicing under CUDA graph (the bug behind the false-EOS / truncated-generation symptom in sgl#20404) and validates GSM8K 0.950 with MTP on.

Changes

benchmarks/single_node/dsv4_fp4_mi355x_sglang_mtp.sh — mirrors the base MI355X SGLang recipe (compressed attention backend, DSv4 env block, deepseekv4 parsers, chat template) and adds:
- EAGLE spec flags, DP-attn-conditional: (steps 2, topk 1, draft 3) for the DP-attention path (sgl#26383's accuracy config), (3,1,4) for the TP-only low-conc path (matching dsr1_fp4_mi355x_mtp.sh).
- The two #26383 MTP env knobs (SGLANG_OPT_USE_TRITON_FUSED_MHC, SGLANG_OPT_C4_SPARSE_TOPK=512).
- --dsv4 chat encoding on the benchmark (DSv4-Pro tokenizer has no jinja template; --use-chat-template would crash).
.github/configs/amd-master.yaml — dsv4-fp4-mi355x-sglang-mtp entry mirroring the base search space (DP-attn on/off, both 1k1k and 8k1k) with spec-decoding: mtp.
perf-changelog.yaml — sweep trigger entry.

⚠️ Image dependency (reviewer note)

MTP correctness requires the sglang build in the image to carry #26383. The pinned image rocm/sgl-dev:rocm720-mi35x-**f96ac98**-20260527-DSv4 is the newest amd/deepseek_v4 build, but its source SHA f96ac98 is the same as the May 26 build — #26383 merged to main (deaba74), and these images track the amd/deepseek_v4 branch, so it's unconfirmed from the tag alone that the fix is present.

Mitigation: the matrix runs lm-eval on the high-concurrency points, so the first sweep validates GSM8K before any throughput number is trusted. If the eval gate shows the #20404-class regression, bump the image tag to a post-#26383 amd/deepseek_v4 build.

Notes

No sweep label applied yet — add sweep-enabled / full-sweep-enabled to run benchmarks.
gfx950 caveat: #26383's CI ran on MI325 (gfx942), so MI355X is enabled via is_hip() but not upstream-validated — another reason the eval gate matters here.

Note

Medium Risk
New speculative-decoding serving recipe and sweep config with a hard dependency on a not-yet-available container image; throughput numbers are untrusted until GSM8K eval on high-conc points passes, but changes are benchmark-only (no prod auth/data paths).

Overview
Adds an MTP/EAGLE speculative-decoding benchmark path for DeepSeek-V4-Pro FP4 on MI355X SGLang, alongside the existing non-MTP dsv4-fp4-mi355x-sglang config.

A new dsv4_fp4_mi355x_sglang_mtp.sh script mirrors the base DSv4 FP4 serving stack and layers EAGLE flags with DP-attention–dependent chains (2/1/3 under DP-attn, 3/1/4 for TP-only low conc, aligned with dsr1_fp4_mi355x_mtp.sh), sgl#26383 MTP env knobs, and --dsv4 on throughput so MTP runs use DSv4 chat framing instead of raw tokens. Optional lm-eval still runs when RUN_EVAL is set on high-concurrency matrix points.

.github/configs/amd-master.yaml gains dsv4-fp4-mi355x-sglang-mtp: same 1k/1k and 8k/1k search space as the base entry (TP8, DP-attn on/off, conc ranges) with spec-decoding: mtp routing to the new script. The config is pinned to rocm/sgl-dev:...-f96ac98-20260527-DSv4 and documented as blocked until a -DSv4 image includes sgl-project/sglang#26383 (DSv4-Pro load needs deep_gemm; MTP correctness needs the ROCm MTP graph fix).

perf-changelog.yaml records the new config key for sweep triggering (draft PR #1631).

^{Reviewed by Cursor Bugbot for commit 19f0921. Bugbot is set up for automated code reviews on this repo. Configure here.}

github-actions · 2026-05-31T20:05:52Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-05-31T20:05:53Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

MTP speculative-decoding sibling of dsv4-fp4-mi355x-sglang, per sgl-project/sglang#26383 ([AMD][DSV4] DSV4 MTP graph + sparse triton attn optimizations, merged 2026-05-27), which fixes the ROCm HIP-radix MTP CUDA-graph bug and validates GSM8K 0.950 with MTP on. - benchmarks/single_node/dsv4_fp4_mi355x_sglang_mtp.sh: mirrors the base MI355X SGLang recipe and adds EAGLE spec flags (DP-attn (2,1,3) per the PR's accuracy config; TP-only (3,1,4)), the two #26383 MTP env knobs, and --dsv4 chat encoding on the benchmark. - .github/configs/amd-master.yaml: dsv4-fp4-mi355x-sglang-mtp entry mirroring the base search space (dp-attn on/off) with spec-decoding: mtp, pinned to the newest amd/deepseek_v4 image (SHA f96ac98). MTP correctness depends on the image carrying #26383; the matrix lm-eval gates the first sweep (cf. sgl issue #20404).

github-actions · 2026-05-31T21:40:31Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26723126211
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26723126211

The -DSv4 branch image (f96ac98-20260527) predates sgl#26383 and crashes during MTP CUDA-graph capture (kv_score dtype mismatch in compress_state, canary run 26723126211). #26383 merged to sglang main, and the -DSv4 branch builds ended on 2026-05-27, so re-pin to the newest mainline ROCm nightly (v0.5.12.post1-rocm720-mi35x-20260531) where the fix lives. The base STP entry stays on f96ac98 until the nightly is confirmed to serve DSv4-Pro FP4; the matrix lm-eval gates correctness.

github-actions · 2026-06-01T00:19:53Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26727984372
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26727984372

…lineage The newest mainline nightly (v0.5.12.post1-rocm720-mi35x-20260531) carries sgl#26383 but omits deep_gemm, so DSv4-Pro weight load fails in _setup_fp8_wo_a_scales (run 26727984372). The -DSv4 builds bundle deep_gemm but ended at f96ac98 (2026-05-27), pre-#26383, which crashes at MTP graph capture (run 26723126211). No published image has both, so re-pin to the latest -DSv4 build (correct lineage) and park PR until a -DSv4 image carrying #26383 lands.

Oseltamivir · 2026-06-01T00:21:48Z

Parked as draft — blocked on a SGLang image

Validated the recipe against both available image lineages on MI355X; neither can run DSv4-Pro MTP today, for two different reasons:

Image	Loads DSv4-Pro?	Has sgl#26383?	Failure
`rocm/sgl-dev:…-DSv4` (newest `f96ac98`, May 27)	✅ bundles `deep_gemm`	❌ pre-#26383	MTP CUDA-graph capture crash — `kv_score` dtype mismatch in `compress_state` (run 26723126211)
`rocm/sgl-dev:v0.5.12.post1-…-20260531` (mainline nightly)	❌ no `deep_gemm`	✅	DSv4-Pro weight load fails — `_setup_fp8_wo_a_scales → from deep_gemm import …` (run 26727984372)

The -DSv4 builds are the only lineage that bundles deep_gemm (required for DSv4-Pro FP8 weight-scale setup), and they stopped at f96ac98 on May 27 — which predates #26383's MTP graph fix. No flag works around either failure (deep_gemm is a compiled kernel lib; the dtype crash is in image code).

Unblock: bump the image to the first rocm/sgl-dev:…-DSv4 build that carries #26383, then flip back to ready + re-add full-sweep-enabled. Recipe is pinned to the latest -DSv4 build so that's a one-line change. Removed the sweep label and converted to draft so it stops consuming MI355X nodes until then.

claude · 2026-06-01T00:23:04Z

claude · 2026-06-01T00:25:36Z

+    --max-concurrency "$CONC" \
+    --result-filename "$RESULT_FILENAME" \
+    --result-dir /workspace/ \
+    --dsv4


🟡 WARNING: AGENTS.md line 56 requires all *_mtp.sh scripts to pass --use-chat-template to run_benchmark_serving. This script uses --dsv4 instead.

Why it matters: The documented invariant exists so reviewers can mechanically verify MTP scripts use chat-formatted inputs (which EAGLE acceptance rates depend on). Deviating — even for a good reason — means future audits that grep for --use-chat-template across *_mtp.sh files will flag this as non-compliant.

Mitigation: The justification in lines 182-187 is solid (DSv4-Pro tokenizer has no jinja chat_template, so --use-chat-template crashes; --dsv4 provides equivalent chat framing via encoding_dsv4.py). This is a genuine technical exception, not an oversight. Consider adding a one-line note to AGENTS.md documenting the DSv4 exception so future tooling/audits don't re-flag it.

Oseltamivir requested a review from a team May 31, 2026 20:05

Oseltamivir requested review from billishyahao, chunfangamd, seungrokj and yctseng0211 as code owners May 31, 2026 20:05

github-project-automation Bot added this to InferenceMAX Board May 31, 2026

Oseltamivir requested a review from 1am9trash as a code owner May 31, 2026 20:05

Oseltamivir force-pushed the add-dsv4-fp4-mi355x-sglang-mtp branch from 28b8d15 to f392b74 Compare May 31, 2026 20:06

Oseltamivir added the full-sweep-enabled label May 31, 2026

claude Bot reviewed May 31, 2026

View reviewed changes

Comment thread .github/configs/amd-master.yaml

Oseltamivir removed the full-sweep-enabled label Jun 1, 2026

Oseltamivir marked this pull request as draft June 1, 2026 00:21

Oseltamivir marked this pull request as ready for review June 1, 2026 00:22

Oseltamivir marked this pull request as draft June 1, 2026 00:22

claude Bot reviewed Jun 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] Add DeepSeek-V4-Pro FP4 MI355X SGLang MTP recipe#1631

[AMD] Add DeepSeek-V4-Pro FP4 MI355X SGLang MTP recipe#1631
Oseltamivir wants to merge 3 commits into
mainfrom
add-dsv4-fp4-mi355x-sglang-mtp

Oseltamivir commented May 31, 2026 •

edited by cursor Bot

Loading

Uh oh!

github-actions Bot commented May 31, 2026

Uh oh!

github-actions Bot commented May 31, 2026

Uh oh!

Uh oh!

github-actions Bot commented May 31, 2026

Uh oh!

github-actions Bot commented Jun 1, 2026

Uh oh!

Oseltamivir commented Jun 1, 2026

Uh oh!

claude Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

claude Bot Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Oseltamivir commented May 31, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

⚠️ Image dependency (reviewer note)

Notes

Uh oh!

github-actions Bot commented May 31, 2026

Uh oh!

github-actions Bot commented May 31, 2026

Uh oh!

Uh oh!

github-actions Bot commented May 31, 2026

Uh oh!

github-actions Bot commented Jun 1, 2026

Uh oh!

Oseltamivir commented Jun 1, 2026

Parked as draft — blocked on a SGLang image

Uh oh!

claude Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review of PR #1631

Summary

Uh oh!

claude Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Oseltamivir commented May 31, 2026 •

edited by cursor Bot

Loading

claude Bot commented Jun 1, 2026 •

edited

Loading