From f392b748852954d3d6b273cf21396d08941e01a3 Mon Sep 17 00:00:00 2001 From: Oseltamivir <58582368+Oseltamivir@users.noreply.github.com> Date: Sun, 31 May 2026 13:05:34 -0700 Subject: [PATCH 1/3] [AMD] Add DeepSeek-V4-Pro FP4 MI355X SGLang MTP recipe MTP speculative-decoding sibling of dsv4-fp4-mi355x-sglang, per sgl-project/sglang#26383 ([AMD][DSV4] DSV4 MTP graph + sparse triton attn optimizations, merged 2026-05-27), which fixes the ROCm HIP-radix MTP CUDA-graph bug and validates GSM8K 0.950 with MTP on. - benchmarks/single_node/dsv4_fp4_mi355x_sglang_mtp.sh: mirrors the base MI355X SGLang recipe and adds EAGLE spec flags (DP-attn (2,1,3) per the PR's accuracy config; TP-only (3,1,4)), the two #26383 MTP env knobs, and --dsv4 chat encoding on the benchmark. - .github/configs/amd-master.yaml: dsv4-fp4-mi355x-sglang-mtp entry mirroring the base search space (dp-attn on/off) with spec-decoding: mtp, pinned to the newest amd/deepseek_v4 image (SHA f96ac98). MTP correctness depends on the image carrying #26383; the matrix lm-eval gates the first sweep (cf. sgl issue #20404). --- .github/configs/amd-master.yaml | 29 +++ .../single_node/dsv4_fp4_mi355x_sglang_mtp.sh | 206 ++++++++++++++++++ perf-changelog.yaml | 6 + 3 files changed, 241 insertions(+) create mode 100755 benchmarks/single_node/dsv4_fp4_mi355x_sglang_mtp.sh diff --git a/.github/configs/amd-master.yaml b/.github/configs/amd-master.yaml index 0b7336fb7..78abfd3e5 100644 --- a/.github/configs/amd-master.yaml +++ b/.github/configs/amd-master.yaml @@ -2166,6 +2166,35 @@ dsv4-fp4-mi355x-sglang: - { tp: 8, dp-attn: true, conc-start: 64, conc-end: 2048 } - { tp: 8, dp-attn: false, conc-start: 1, conc-end: 32 } +# MTP variant of dsv4-fp4-mi355x-sglang. Mirrors the base search space and adds +# spec-decoding: mtp, which routes to dsv4_fp4_mi355x_sglang_mtp.sh (EAGLE +# speculative decoding), per sgl-project/sglang#26383 ([AMD][DSV4] DSV4 MTP +# graph + sparse triton attn optimizations, merged 2026-05-27). That PR fixes +# the ROCm HIP-radix MTP CUDA-graph bug (the false-EOS symptom in sgl #20404) +# and validates GSM8K 0.950 with MTP on. Image pins the newest amd/deepseek_v4 +# build (SHA f96ac98); MTP correctness depends on the build carrying #26383, so +# the matrix lm-eval (RUN_EVAL on the high-conc points) gates the first sweep. +dsv4-fp4-mi355x-sglang-mtp: + image: rocm/sgl-dev:rocm720-mi35x-f96ac98-20260527-DSv4 + model: deepseek-ai/DeepSeek-V4-Pro + model-prefix: dsv4 + runner: mi355x + precision: fp4 + framework: sglang + multinode: false + scenarios: + fixed-seq-len: + - isl: 1024 + osl: 1024 + search-space: + - { tp: 8, dp-attn: true, conc-start: 64, conc-end: 2048, spec-decoding: mtp } + - { tp: 8, dp-attn: false, conc-start: 1, conc-end: 32, spec-decoding: mtp } + - isl: 8192 + osl: 1024 + search-space: + - { tp: 8, dp-attn: true, conc-start: 64, conc-end: 2048, spec-decoding: mtp } + - { tp: 8, dp-attn: false, conc-start: 1, conc-end: 32, spec-decoding: mtp } + # DSv4 on MI355X via vLLM, using the official vllm/vllm-openai-rocm # nightly image. DSv4 base ROCm support (vllm-project/vllm#40871) merged # on 2026-05-05, so any nightly built after that includes the diff --git a/benchmarks/single_node/dsv4_fp4_mi355x_sglang_mtp.sh b/benchmarks/single_node/dsv4_fp4_mi355x_sglang_mtp.sh new file mode 100755 index 000000000..87cc92d21 --- /dev/null +++ b/benchmarks/single_node/dsv4_fp4_mi355x_sglang_mtp.sh @@ -0,0 +1,206 @@ +#!/usr/bin/env bash + +# DeepSeek-V4-Pro on MI355X via SGLang — MTP variant of dsv4_fp4_mi355x_sglang.sh. +# Adds EAGLE/MTP speculative decoding per sgl-project/sglang#26383 +# ([AMD][DSV4] DSV4 MTP graph + sparse triton attn optimizations, merged +# 2026-05-27, commit deaba74), which fixes the ROCm HIP-radix backend's +# per-step draft out_cache_loc slicing under CUDA graph (the bug behind the +# false-EOS / truncated-generation symptom in sgl issue #20404) and validates +# GSM8K 0.950 with MTP on. The EAGLE chain follows that PR's accuracy config +# for the DP-attention path (steps=2, topk=1, draft=3); the TP-only +# low-concurrency path uses the (3,1,4) chain shared with dsr1_fp4_mi355x_mtp.sh. +# +# IMPORTANT (image dependency): MTP correctness requires the sglang build in +# the image to carry #26383. The amd/deepseek_v4 branch images are tagged by +# SHA, so a build predating the fix will silently regress (cf. #20404). The +# matrix runs lm-eval on the high-concurrency points (RUN_EVAL), so the first +# sweep validates GSM8K before any throughput number is trusted; bump the image +# tag in amd-master.yaml if the eval gate fails. + +source "$(dirname "$0")/../benchmark_lib.sh" + +check_env_vars \ + MODEL \ + TP \ + DP_ATTENTION \ + EP_SIZE \ + CONC \ + ISL \ + OSL \ + RANDOM_RANGE_RATIO \ + RESULT_FILENAME \ + MAX_MODEL_LEN + +if [[ -n "$SLURM_JOB_ID" ]]; then + echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME" +fi + +if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi + +# sglang ships in the image at the SHA encoded in the image tag (built +# from the amd/deepseek_v4 branch in sgl-project/sglang). To bump sglang, +# bump the image tag in .github/configs/amd-master.yaml. + +# Transformers in the container doesn't recognize the `deepseek_v4` model_type. +# PR #23608's fallback in hf_transformers_utils.get_config tries to handle this +# by writing a patched config to /tmp, but in practice isn't catching the error +# in this image. Patch the cached config.json directly instead: set model_type +# to `deepseek_v3` so AutoConfig.from_pretrained succeeds, and keep +# architectures=['DeepseekV4ForCausalLM'] so SGLang dispatches to its native +# DSv4 model class (python/sglang/srt/models/deepseek_v4.py). +python3 << PYEOF +import json +from huggingface_hub import hf_hub_download +path = hf_hub_download(repo_id="$MODEL", filename="config.json") +with open(path) as f: + config = json.load(f) +if config.get("model_type") == "deepseek_v4": + config["model_type"] = "deepseek_v3" + with open(path, "w") as f: + json.dump(config, f, indent=2) + print(f"Patched {path}: model_type deepseek_v4 -> deepseek_v3") +else: + print(f"No patch needed: model_type is {config.get('model_type')!r}") +PYEOF + +# DSv4 FP4-experts path. Tracks the env block in python/run_dsv4.sh on the +# amd/deepseek_v4 branch (HEAD's active block is FP8; we override the two +# FP4-specific flags below): +# SGLANG_DSV4_FP4_EXPERTS=True -> route experts through the FP4 kernels +# SGLANG_FORCE_TRITON_MOE_FP8=0 -> dispatch MoE through aiter and apply +# the swiglu_limit clamp in the triton +# MoE fallback path. +export SGLANG_REASONING_EFFORT=max +export SGLANG_OPT_USE_FUSED_COMPRESS=true +export SGLANG_OPT_USE_OLD_COMPRESSOR=false +export SGLANG_OPT_USE_TILELANG_SWA_PREPARE=false +export SGLANG_OPT_USE_JIT_KERNEL_FUSED_TOPK=false +export SGLANG_OPT_USE_FUSED_HASH_TOPK=true +export SGLANG_OPT_DEEPGEMM_HC_PRENORM=false +export SGLANG_OPT_USE_TILELANG_MHC_PRE=false +export SGLANG_OPT_USE_TILELANG_MHC_POST=false +export SGLANG_OPT_USE_AITER_MHC_PRE=true +export SGLANG_OPT_USE_AITER_MHC_POST=true +export SGLANG_ENABLE_THINKING=1 +export SGLANG_USE_AITER=1 +export SGLANG_USE_ROCM700A=1 +export SGLANG_TOPK_TRANSFORM_512_TORCH=0 +export SGLANG_FP8_PAGED_MQA_LOGITS_TORCH=1 +export SGLANG_DSV4_FP4_EXPERTS=True +export SGLANG_OPT_DPSK_V4_RADIX=1 +export SGLANG_OPT_USE_OVERLAP_STORE_CACHE=false +export SGLANG_OPT_USE_FUSED_STORE_CACHE=true +export SGLANG_FORCE_TRITON_MOE_FP8=0 +export SGLANG_HACK_FLASHMLA_BACKEND=triton +export SGLANG_OPT_USE_TILELANG_INDEXER=true +export SGLANG_OPT_USE_TRITON_SWA_PREPARE=true +export AITER_BF16_FP8_MOE_BOUND=0 +export SGLANG_OPT_FUSE_WQA_WKV=true +export SGLANG_OPT_USE_FUSED_PAGED_COMPRESS=true +export SGLANG_OPT_USE_MULTI_STREAM_OVERLAP=0 + +# MTP-specific knobs landed alongside the graph fix in sgl#26383: +# SGLANG_OPT_USE_TRITON_FUSED_MHC -> fused Triton mhc_post_pre for low conc +# (defaults True in post-#26383 images; +# set explicitly so the recipe is auditable) +# SGLANG_OPT_C4_SPARSE_TOPK -> sparse-attention top-k used in the PR's +# DSv4 MTP accuracy run +export SGLANG_OPT_USE_TRITON_FUSED_MHC=1 +export SGLANG_OPT_C4_SPARSE_TOPK=512 + +SERVER_LOG=/workspace/server.log +PORT=${PORT:-8888} + +EVAL_CONTEXT_ARGS="" +if [ "${EVAL_ONLY}" = "true" ]; then + setup_eval_context + EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN" +fi +# Start GPU monitoring (power, temperature, clocks every second) +start_gpu_monitor + +PARALLEL_ARGS=( + --tensor-parallel-size "$TP" +) +# EAGLE chain is selected by DP_ATTENTION. The DP-attention path mirrors the +# sgl#26383 DSv4 ROCm accuracy config (steps=2, topk=1, draft=3); the TP-only +# low-concurrency fallback uses the longer (3,1,4) chain that low batch sizes +# benefit from, matching dsr1_fp4_mi355x_mtp.sh. +SPEC_FLAGS=( + --speculative-algorithm EAGLE + --speculative-num-steps 3 + --speculative-eagle-topk 1 + --speculative-num-draft-tokens 4 +) +if [ "${DP_ATTENTION}" = "true" ]; then + PARALLEL_ARGS+=( + --dp "$TP" + --enable-dp-attention + --enable-prefill-delayer + ) + SPEC_FLAGS=( + --speculative-algorithm EAGLE + --speculative-num-steps 2 + --speculative-eagle-topk 1 + --speculative-num-draft-tokens 3 + ) +fi +if [ "${EP_SIZE:-1}" -gt 1 ]; then + PARALLEL_ARGS+=(--ep-size "$EP_SIZE") +fi + +set -x +python3 -m sglang.launch_server \ + --model-path $MODEL \ + --host=0.0.0.0 \ + --port $PORT \ + "${PARALLEL_ARGS[@]}" \ + "${SPEC_FLAGS[@]}" \ + --trust-remote-code \ + --disable-radix-cache \ + --attention-backend compressed \ + --max-running-requests ${CONC} \ + --mem-fraction-static 0.90 \ + --swa-full-tokens-ratio 0.15 \ + --page-size 256 \ + --context-length $MAX_MODEL_LEN \ + --chunked-prefill-size 8192 \ + --disable-shared-experts-fusion \ + --tool-call-parser deepseekv4 \ + --reasoning-parser deepseek-v4 \ + --chat-template "$(dirname "$0")/chat_templates/deepseek_v4_thinking.jinja" \ + --watchdog-timeout 1800 $EVAL_CONTEXT_ARGS > $SERVER_LOG 2>&1 & + +SERVER_PID=$! + +# Wait for server to be ready +wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID" + +# --dsv4 routes prompts through encoding_dsv4.py, emitting the +# ... framing DeepSeek-V4-Pro expects. EAGLE/MTP +# acceptance silently regresses on raw random tokens, so MTP benchmarks must +# use chat-formatted inputs (AGENTS.md). The DSv4-Pro tokenizer ships without a +# jinja chat_template, so plain --use-chat-template would crash; --dsv4 handles +# the framing directly. +run_benchmark_serving \ + --model "$MODEL" \ + --port "$PORT" \ + --backend vllm \ + --input-len "$ISL" \ + --output-len "$OSL" \ + --random-range-ratio "$RANDOM_RANGE_RATIO" \ + --num-prompts "$((CONC * 10))" \ + --max-concurrency "$CONC" \ + --result-filename "$RESULT_FILENAME" \ + --result-dir /workspace/ \ + --dsv4 + +# After throughput, run evaluation only if RUN_EVAL is true +if [ "${RUN_EVAL}" = "true" ]; then + run_eval --framework lm-eval --port "$PORT" + append_lm_eval_summary +fi + +# Stop GPU monitoring +stop_gpu_monitor +set +x diff --git a/perf-changelog.yaml b/perf-changelog.yaml index 5b7d56cd1..822e0e4ac 100644 --- a/perf-changelog.yaml +++ b/perf-changelog.yaml @@ -3342,3 +3342,9 @@ description: - "Update vLLM ROCm image from v0.21.0 to v0.22.0" pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1616 + +- config-keys: + - dsv4-fp4-mi355x-sglang-mtp + description: + - "Add MTP/EAGLE speculative-decoding sibling for dsv4-fp4-mi355x-sglang (model: deepseek-ai/DeepSeek-V4-Pro) on rocm/sgl-dev:rocm720-mi35x-f96ac98-20260527-DSv4, per sgl-project/sglang#26383" + pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1631 From 8f10c8ff623e2e980ddb6708f45e86e5ca3d0d6d Mon Sep 17 00:00:00 2001 From: Oseltamivir <58582368+Oseltamivir@users.noreply.github.com> Date: Sun, 31 May 2026 16:47:10 -0700 Subject: [PATCH 2/3] [AMD] Re-pin dsv4 mi355x sglang MTP to newest mainline nightly The -DSv4 branch image (f96ac98-20260527) predates sgl#26383 and crashes during MTP CUDA-graph capture (kv_score dtype mismatch in compress_state, canary run 26723126211). #26383 merged to sglang main, and the -DSv4 branch builds ended on 2026-05-27, so re-pin to the newest mainline ROCm nightly (v0.5.12.post1-rocm720-mi35x-20260531) where the fix lives. The base STP entry stays on f96ac98 until the nightly is confirmed to serve DSv4-Pro FP4; the matrix lm-eval gates correctness. --- .github/configs/amd-master.yaml | 18 ++++++++++++------ .../single_node/dsv4_fp4_mi355x_sglang_mtp.sh | 13 +++++++------ perf-changelog.yaml | 2 +- 3 files changed, 20 insertions(+), 13 deletions(-) diff --git a/.github/configs/amd-master.yaml b/.github/configs/amd-master.yaml index 78abfd3e5..a44076a06 100644 --- a/.github/configs/amd-master.yaml +++ b/.github/configs/amd-master.yaml @@ -2169,13 +2169,19 @@ dsv4-fp4-mi355x-sglang: # MTP variant of dsv4-fp4-mi355x-sglang. Mirrors the base search space and adds # spec-decoding: mtp, which routes to dsv4_fp4_mi355x_sglang_mtp.sh (EAGLE # speculative decoding), per sgl-project/sglang#26383 ([AMD][DSV4] DSV4 MTP -# graph + sparse triton attn optimizations, merged 2026-05-27). That PR fixes -# the ROCm HIP-radix MTP CUDA-graph bug (the false-EOS symptom in sgl #20404) -# and validates GSM8K 0.950 with MTP on. Image pins the newest amd/deepseek_v4 -# build (SHA f96ac98); MTP correctness depends on the build carrying #26383, so -# the matrix lm-eval (RUN_EVAL on the high-conc points) gates the first sweep. +# graph + sparse triton attn optimizations, merged to main 2026-05-27). That PR +# fixes the ROCm HIP-radix MTP CUDA-graph bug (the false-EOS symptom in sgl +# #20404) and validates GSM8K 0.950 with MTP on. +# +# Image: the amd/deepseek_v4 -DSv4 branch builds ended at f96ac98 (2026-05-27), +# which predates #26383 and crashes during MTP graph capture (kv_score dtype +# mismatch in compress_state, run 26723126211). #26383 merged to main, so this +# pins the newest mainline ROCm nightly (v0.5.12.post1-rocm720-mi35x-20260531) +# where the fix lives. Lineage differs from the base STP entry (still on +# f96ac98) until the nightly is confirmed to serve DSv4-Pro FP4 cleanly; the +# matrix lm-eval (RUN_EVAL on the high-conc points) gates correctness. dsv4-fp4-mi355x-sglang-mtp: - image: rocm/sgl-dev:rocm720-mi35x-f96ac98-20260527-DSv4 + image: rocm/sgl-dev:v0.5.12.post1-rocm720-mi35x-20260531 model: deepseek-ai/DeepSeek-V4-Pro model-prefix: dsv4 runner: mi355x diff --git a/benchmarks/single_node/dsv4_fp4_mi355x_sglang_mtp.sh b/benchmarks/single_node/dsv4_fp4_mi355x_sglang_mtp.sh index 87cc92d21..73f181b7a 100755 --- a/benchmarks/single_node/dsv4_fp4_mi355x_sglang_mtp.sh +++ b/benchmarks/single_node/dsv4_fp4_mi355x_sglang_mtp.sh @@ -10,12 +10,13 @@ # for the DP-attention path (steps=2, topk=1, draft=3); the TP-only # low-concurrency path uses the (3,1,4) chain shared with dsr1_fp4_mi355x_mtp.sh. # -# IMPORTANT (image dependency): MTP correctness requires the sglang build in -# the image to carry #26383. The amd/deepseek_v4 branch images are tagged by -# SHA, so a build predating the fix will silently regress (cf. #20404). The -# matrix runs lm-eval on the high-concurrency points (RUN_EVAL), so the first -# sweep validates GSM8K before any throughput number is trusted; bump the image -# tag in amd-master.yaml if the eval gate fails. +# IMPORTANT (image dependency): MTP requires the sglang build to carry #26383. +# The amd/deepseek_v4 -DSv4 branch builds ended at f96ac98 (2026-05-27), which +# predates the fix and hard-crashes during MTP CUDA-graph capture (kv_score +# dtype mismatch in compress_state, run 26723126211). #26383 merged to main, so +# amd-master.yaml pins the newest mainline ROCm nightly instead. The matrix runs +# lm-eval on the high-concurrency points (RUN_EVAL), so the first sweep validates +# GSM8K before any throughput number is trusted; bump the image if it regresses. source "$(dirname "$0")/../benchmark_lib.sh" diff --git a/perf-changelog.yaml b/perf-changelog.yaml index 822e0e4ac..77959d2e1 100644 --- a/perf-changelog.yaml +++ b/perf-changelog.yaml @@ -3346,5 +3346,5 @@ - config-keys: - dsv4-fp4-mi355x-sglang-mtp description: - - "Add MTP/EAGLE speculative-decoding sibling for dsv4-fp4-mi355x-sglang (model: deepseek-ai/DeepSeek-V4-Pro) on rocm/sgl-dev:rocm720-mi35x-f96ac98-20260527-DSv4, per sgl-project/sglang#26383" + - "Add MTP/EAGLE speculative-decoding sibling for dsv4-fp4-mi355x-sglang (model: deepseek-ai/DeepSeek-V4-Pro) on rocm/sgl-dev:v0.5.12.post1-rocm720-mi35x-20260531 (mainline nightly carrying sgl#26383 DSv4 MTP graph fix), per sgl-project/sglang#26383" pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1631 From 19f0921802b1c8ed135c2ee55a8b98799406d657 Mon Sep 17 00:00:00 2001 From: Oseltamivir <58582368+Oseltamivir@users.noreply.github.com> Date: Sun, 31 May 2026 17:21:38 -0700 Subject: [PATCH 3/3] [AMD] Park dsv4 mi355x sglang MTP: blocked on image, revert to -DSv4 lineage The newest mainline nightly (v0.5.12.post1-rocm720-mi35x-20260531) carries sgl#26383 but omits deep_gemm, so DSv4-Pro weight load fails in _setup_fp8_wo_a_scales (run 26727984372). The -DSv4 builds bundle deep_gemm but ended at f96ac98 (2026-05-27), pre-#26383, which crashes at MTP graph capture (run 26723126211). No published image has both, so re-pin to the latest -DSv4 build (correct lineage) and park PR until a -DSv4 image carrying #26383 lands. --- .github/configs/amd-master.yaml | 17 +++++++++-------- .../single_node/dsv4_fp4_mi355x_sglang_mtp.sh | 16 +++++++++------- perf-changelog.yaml | 2 +- 3 files changed, 19 insertions(+), 16 deletions(-) diff --git a/.github/configs/amd-master.yaml b/.github/configs/amd-master.yaml index a44076a06..4d6c246b6 100644 --- a/.github/configs/amd-master.yaml +++ b/.github/configs/amd-master.yaml @@ -2173,15 +2173,16 @@ dsv4-fp4-mi355x-sglang: # fixes the ROCm HIP-radix MTP CUDA-graph bug (the false-EOS symptom in sgl # #20404) and validates GSM8K 0.950 with MTP on. # -# Image: the amd/deepseek_v4 -DSv4 branch builds ended at f96ac98 (2026-05-27), -# which predates #26383 and crashes during MTP graph capture (kv_score dtype -# mismatch in compress_state, run 26723126211). #26383 merged to main, so this -# pins the newest mainline ROCm nightly (v0.5.12.post1-rocm720-mi35x-20260531) -# where the fix lives. Lineage differs from the base STP entry (still on -# f96ac98) until the nightly is confirmed to serve DSv4-Pro FP4 cleanly; the -# matrix lm-eval (RUN_EVAL on the high-conc points) gates correctness. +# BLOCKED on an image: no published build both loads DSv4-Pro AND has #26383. +# - rocm/sgl-dev:...-DSv4 builds are the only lineage bundling deep_gemm (so +# the only ones that load DSv4-Pro), but they ended at f96ac98 (2026-05-27), +# pre-#26383 -> MTP graph capture crashes (kv_score dtype, run 26723126211). +# - mainline v0.5.12.post1 ROCm nightlies carry #26383 but omit deep_gemm, so +# DSv4-Pro weight load fails in _setup_fp8_wo_a_scales (run 26727984372). +# Pinned to the latest -DSv4 build as the correct lineage placeholder. Unblock by +# bumping to the first -DSv4 image that carries #26383. PR #1631 parked as draft. dsv4-fp4-mi355x-sglang-mtp: - image: rocm/sgl-dev:v0.5.12.post1-rocm720-mi35x-20260531 + image: rocm/sgl-dev:rocm720-mi35x-f96ac98-20260527-DSv4 model: deepseek-ai/DeepSeek-V4-Pro model-prefix: dsv4 runner: mi355x diff --git a/benchmarks/single_node/dsv4_fp4_mi355x_sglang_mtp.sh b/benchmarks/single_node/dsv4_fp4_mi355x_sglang_mtp.sh index 73f181b7a..a95c27fbb 100755 --- a/benchmarks/single_node/dsv4_fp4_mi355x_sglang_mtp.sh +++ b/benchmarks/single_node/dsv4_fp4_mi355x_sglang_mtp.sh @@ -10,13 +10,15 @@ # for the DP-attention path (steps=2, topk=1, draft=3); the TP-only # low-concurrency path uses the (3,1,4) chain shared with dsr1_fp4_mi355x_mtp.sh. # -# IMPORTANT (image dependency): MTP requires the sglang build to carry #26383. -# The amd/deepseek_v4 -DSv4 branch builds ended at f96ac98 (2026-05-27), which -# predates the fix and hard-crashes during MTP CUDA-graph capture (kv_score -# dtype mismatch in compress_state, run 26723126211). #26383 merged to main, so -# amd-master.yaml pins the newest mainline ROCm nightly instead. The matrix runs -# lm-eval on the high-concurrency points (RUN_EVAL), so the first sweep validates -# GSM8K before any throughput number is trusted; bump the image if it regresses. +# IMPORTANT (blocked on image): MTP needs a build that BOTH loads DSv4-Pro AND +# carries sgl#26383. None exists yet: +# - rocm/sgl-dev:...-DSv4 builds are the only lineage bundling deep_gemm (so +# the only ones that load DSv4-Pro), but ended at f96ac98 (2026-05-27, +# pre-#26383) and crash at MTP graph capture (kv_score dtype, run 26723126211). +# - mainline v0.5.12.post1 nightlies carry #26383 but omit deep_gemm and fail +# DSv4-Pro weight load in _setup_fp8_wo_a_scales (run 26727984372). +# amd-master.yaml pins the latest -DSv4 build; bump to the first -DSv4 image with +# #26383 to unblock. RUN_EVAL on the high-conc points then gates accuracy. source "$(dirname "$0")/../benchmark_lib.sh" diff --git a/perf-changelog.yaml b/perf-changelog.yaml index 77959d2e1..b0412a384 100644 --- a/perf-changelog.yaml +++ b/perf-changelog.yaml @@ -3346,5 +3346,5 @@ - config-keys: - dsv4-fp4-mi355x-sglang-mtp description: - - "Add MTP/EAGLE speculative-decoding sibling for dsv4-fp4-mi355x-sglang (model: deepseek-ai/DeepSeek-V4-Pro) on rocm/sgl-dev:v0.5.12.post1-rocm720-mi35x-20260531 (mainline nightly carrying sgl#26383 DSv4 MTP graph fix), per sgl-project/sglang#26383" + - "Add MTP/EAGLE speculative-decoding sibling for dsv4-fp4-mi355x-sglang (model: deepseek-ai/DeepSeek-V4-Pro) on rocm/sgl-dev:rocm720-mi35x-f96ac98-20260527-DSv4, per sgl-project/sglang#26383. Blocked pending a -DSv4 image that carries #26383 (PR draft)." pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1631