Skip to content

[AMD] Add DeepSeek-V4-Pro FP4 MI355X SGLang MTP recipe#1631

Draft
Oseltamivir wants to merge 3 commits into
mainfrom
add-dsv4-fp4-mi355x-sglang-mtp
Draft

[AMD] Add DeepSeek-V4-Pro FP4 MI355X SGLang MTP recipe#1631
Oseltamivir wants to merge 3 commits into
mainfrom
add-dsv4-fp4-mi355x-sglang-mtp

Conversation

@Oseltamivir
Copy link
Copy Markdown
Collaborator

@Oseltamivir Oseltamivir commented May 31, 2026

Summary

Adds the MTP speculative-decoding sibling of dsv4-fp4-mi355x-sglang, the SGLang counterpart to the vLLM MTP recipe in #1630.

Follows sgl-project/sglang#26383 ("[AMD][DSV4] DSV4 MTP graph + sparse triton attn optimizations", merged 2026-05-27, deaba74) — the SGLang analog of vllm#43385. That PR fixes the ROCm HIP-radix backend's per-step draft out_cache_loc slicing under CUDA graph (the bug behind the false-EOS / truncated-generation symptom in sgl#20404) and validates GSM8K 0.950 with MTP on.

Changes

  • benchmarks/single_node/dsv4_fp4_mi355x_sglang_mtp.sh — mirrors the base MI355X SGLang recipe (compressed attention backend, DSv4 env block, deepseekv4 parsers, chat template) and adds:
    • EAGLE spec flags, DP-attn-conditional: (steps 2, topk 1, draft 3) for the DP-attention path (sgl#26383's accuracy config), (3,1,4) for the TP-only low-conc path (matching dsr1_fp4_mi355x_mtp.sh).
    • The two #26383 MTP env knobs (SGLANG_OPT_USE_TRITON_FUSED_MHC, SGLANG_OPT_C4_SPARSE_TOPK=512).
    • --dsv4 chat encoding on the benchmark (DSv4-Pro tokenizer has no jinja template; --use-chat-template would crash).
  • .github/configs/amd-master.yamldsv4-fp4-mi355x-sglang-mtp entry mirroring the base search space (DP-attn on/off, both 1k1k and 8k1k) with spec-decoding: mtp.
  • perf-changelog.yaml — sweep trigger entry.

⚠️ Image dependency (reviewer note)

MTP correctness requires the sglang build in the image to carry #26383. The pinned image rocm/sgl-dev:rocm720-mi35x-**f96ac98**-20260527-DSv4 is the newest amd/deepseek_v4 build, but its source SHA f96ac98 is the same as the May 26 build — #26383 merged to main (deaba74), and these images track the amd/deepseek_v4 branch, so it's unconfirmed from the tag alone that the fix is present.

Mitigation: the matrix runs lm-eval on the high-concurrency points, so the first sweep validates GSM8K before any throughput number is trusted. If the eval gate shows the #20404-class regression, bump the image tag to a post-#26383 amd/deepseek_v4 build.

Notes

  • No sweep label applied yet — add sweep-enabled / full-sweep-enabled to run benchmarks.
  • gfx950 caveat: #26383's CI ran on MI325 (gfx942), so MI355X is enabled via is_hip() but not upstream-validated — another reason the eval gate matters here.

Note

Medium Risk
New speculative-decoding serving recipe and sweep config with a hard dependency on a not-yet-available container image; throughput numbers are untrusted until GSM8K eval on high-conc points passes, but changes are benchmark-only (no prod auth/data paths).

Overview
Adds an MTP/EAGLE speculative-decoding benchmark path for DeepSeek-V4-Pro FP4 on MI355X SGLang, alongside the existing non-MTP dsv4-fp4-mi355x-sglang config.

A new dsv4_fp4_mi355x_sglang_mtp.sh script mirrors the base DSv4 FP4 serving stack and layers EAGLE flags with DP-attention–dependent chains (2/1/3 under DP-attn, 3/1/4 for TP-only low conc, aligned with dsr1_fp4_mi355x_mtp.sh), sgl#26383 MTP env knobs, and --dsv4 on throughput so MTP runs use DSv4 chat framing instead of raw tokens. Optional lm-eval still runs when RUN_EVAL is set on high-concurrency matrix points.

.github/configs/amd-master.yaml gains dsv4-fp4-mi355x-sglang-mtp: same 1k/1k and 8k/1k search space as the base entry (TP8, DP-attn on/off, conc ranges) with spec-decoding: mtp routing to the new script. The config is pinned to rocm/sgl-dev:...-f96ac98-20260527-DSv4 and documented as blocked until a -DSv4 image includes sgl-project/sglang#26383 (DSv4-Pro load needs deep_gemm; MTP correctness needs the ROCm MTP graph fix).

perf-changelog.yaml records the new config key for sweep triggering (draft PR #1631).

Reviewed by Cursor Bugbot for commit 19f0921. Bugbot is set up for automated code reviews on this repo. Configure here.

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

MTP speculative-decoding sibling of dsv4-fp4-mi355x-sglang, per
sgl-project/sglang#26383 ([AMD][DSV4] DSV4 MTP graph + sparse triton attn
optimizations, merged 2026-05-27), which fixes the ROCm HIP-radix MTP
CUDA-graph bug and validates GSM8K 0.950 with MTP on.

- benchmarks/single_node/dsv4_fp4_mi355x_sglang_mtp.sh: mirrors the base
  MI355X SGLang recipe and adds EAGLE spec flags (DP-attn (2,1,3) per the
  PR's accuracy config; TP-only (3,1,4)), the two #26383 MTP env knobs, and
  --dsv4 chat encoding on the benchmark.
- .github/configs/amd-master.yaml: dsv4-fp4-mi355x-sglang-mtp entry mirroring
  the base search space (dp-attn on/off) with spec-decoding: mtp, pinned to
  the newest amd/deepseek_v4 image (SHA f96ac98).

MTP correctness depends on the image carrying #26383; the matrix lm-eval
gates the first sweep (cf. sgl issue #20404).
@Oseltamivir Oseltamivir force-pushed the add-dsv4-fp4-mi355x-sglang-mtp branch from 28b8d15 to f392b74 Compare May 31, 2026 20:06
Comment thread .github/configs/amd-master.yaml
@github-actions
Copy link
Copy Markdown
Contributor

The -DSv4 branch image (f96ac98-20260527) predates sgl#26383 and crashes
during MTP CUDA-graph capture (kv_score dtype mismatch in compress_state,
canary run 26723126211). #26383 merged to sglang main, and the -DSv4 branch
builds ended on 2026-05-27, so re-pin to the newest mainline ROCm nightly
(v0.5.12.post1-rocm720-mi35x-20260531) where the fix lives. The base STP entry
stays on f96ac98 until the nightly is confirmed to serve DSv4-Pro FP4; the
matrix lm-eval gates correctness.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 1, 2026

…lineage

The newest mainline nightly (v0.5.12.post1-rocm720-mi35x-20260531) carries
sgl#26383 but omits deep_gemm, so DSv4-Pro weight load fails in
_setup_fp8_wo_a_scales (run 26727984372). The -DSv4 builds bundle deep_gemm but
ended at f96ac98 (2026-05-27), pre-#26383, which crashes at MTP graph capture
(run 26723126211). No published image has both, so re-pin to the latest -DSv4
build (correct lineage) and park PR until a -DSv4 image carrying #26383 lands.
@Oseltamivir Oseltamivir marked this pull request as draft June 1, 2026 00:21
@Oseltamivir
Copy link
Copy Markdown
Collaborator Author

Parked as draft — blocked on a SGLang image

Validated the recipe against both available image lineages on MI355X; neither can run DSv4-Pro MTP today, for two different reasons:

Image Loads DSv4-Pro? Has sgl#26383? Failure
rocm/sgl-dev:…-DSv4 (newest f96ac98, May 27) ✅ bundles deep_gemm ❌ pre-#26383 MTP CUDA-graph capture crash — kv_score dtype mismatch in compress_state (run 26723126211)
rocm/sgl-dev:v0.5.12.post1-…-20260531 (mainline nightly) no deep_gemm DSv4-Pro weight load fails — _setup_fp8_wo_a_scales → from deep_gemm import … (run 26727984372)

The -DSv4 builds are the only lineage that bundles deep_gemm (required for DSv4-Pro FP8 weight-scale setup), and they stopped at f96ac98 on May 27 — which predates #26383's MTP graph fix. No flag works around either failure (deep_gemm is a compiled kernel lib; the dtype crash is in image code).

Unblock: bump the image to the first rocm/sgl-dev:…-DSv4 build that carries #26383, then flip back to ready + re-add full-sweep-enabled. Recipe is pinned to the latest -DSv4 build so that's a one-line change. Removed the sweep label and converted to draft so it stops consuming MI355X nodes until then.

@Oseltamivir Oseltamivir marked this pull request as ready for review June 1, 2026 00:22
@Oseltamivir Oseltamivir marked this pull request as draft June 1, 2026 00:22
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Jun 1, 2026

Claude finished @Oseltamivir's task in 2m 37s —— View job


Review of PR #1631

  • Read CLAUDE.md / AGENTS.md and repo guidelines
  • Read changed files and gather context
  • Validate master config and perf-changelog
  • Validate benchmark script code style
  • Validate MTP requirements
  • Post review findings

Summary

No blocking issues. The recipe is well-structured, mirrors the base dsv4-fp4-mi355x-sglang config correctly, and the image-blocking situation is thoroughly documented with the PR appropriately parked as a draft.

Previous review finding resolved: The perf-changelog.yaml entry (flagged as missing in the earlier review) is now present and correctly appended at the end of the file.

Checklist:

  • amd-master.yaml + perf-changelog.yaml both modified — ✅
  • perf-changelog.yaml entry appended at END (line 3345+) — ✅
  • Container image rocm/sgl-dev:rocm720-mi35x-f96ac98-20260527-DSv4 is publicly accessible Docker Hub format — ✅
  • Server launch command properly formatted with line continuations — ✅
  • Expert parallelism conditional on EP_SIZE — ✅
  • Model prefix dsv4 matches existing configs — ✅

One warning posted inline: --dsv4 used instead of --use-chat-template (AGENTS.md line 56 requirement). The technical justification is sound (DSv4-Pro tokenizer lacks a jinja template, so --use-chat-template would crash), but it deviates from the documented invariant that lets automated audits verify MTP scripts use chat-formatted inputs.

--max-concurrency "$CONC" \
--result-filename "$RESULT_FILENAME" \
--result-dir /workspace/ \
--dsv4
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 WARNING: AGENTS.md line 56 requires all *_mtp.sh scripts to pass --use-chat-template to run_benchmark_serving. This script uses --dsv4 instead.

Why it matters: The documented invariant exists so reviewers can mechanically verify MTP scripts use chat-formatted inputs (which EAGLE acceptance rates depend on). Deviating — even for a good reason — means future audits that grep for --use-chat-template across *_mtp.sh files will flag this as non-compliant.

Mitigation: The justification in lines 182-187 is solid (DSv4-Pro tokenizer has no jinja chat_template, so --use-chat-template crashes; --dsv4 provides equivalent chat framing via encoding_dsv4.py). This is a genuine technical exception, not an oversight. Consider adding a one-line note to AGENTS.md documenting the DSv4 exception so future tooling/audits don't re-flag it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant