MXFP4 MoE tuning harness: legality filter, measurement, ledger + validated baseline (#708)#736
Draft
jhinpan wants to merge 3 commits into
Draft
MXFP4 MoE tuning harness: legality filter, measurement, ledger + validated baseline (#708)#736jhinpan wants to merge 3 commits into
jhinpan wants to merge 3 commits into
Conversation
…t guardrail (ROCm#708) Measurement + verification infrastructure for tuning the MXFP4 (per-1x32 fp4) MoE 2-stage GEMM on gfx950/MI350X, toward ROCm#708 (low MFU at large shapes, long latency at small tokens). Infrastructure only -- no production kernel logic changes. - kernels/moe_tuning.py: pre-compile legality filter for stage1/stage2 tile configs (LDS footprint, divisibility, MX-FP4 floors); mirrors builder LDS sizing (stage1 full lds_stride vs stage2 fp4-halved). - kernels/moe_tuning_spec.py: locked spec constants + win/no-regression predicates (win margins, regime-aware band, token grid, MFU denominator, metric formula). - scripts/moe_tuning_harness.py: provenance-complete measurement harness (verified clock pinning, idle check, faithful timed-loop median+p95) + fail-closed candidate sweep CLI (illegal/unmeasured configs recorded as rejections). - scripts/moe_tuning_ledger.py: attempt ledger + full-coverage Pareto comparator with a single claimable_win gate (coverage + no-regression + win + AOT/correctness hard gate) and integrity scans (duplicate / replay / supersede-link). - scripts/aiter_strict_point.py: strict AOT-checked model-correct aiter e2e + correctness guardrail (logits_diff <= 0.01). - scripts/sync_aiter_flydsl_kernels.sh: overlay FlyDSL MoE kernels onto aiter's vendored copies for the e2e guardrail. - docs/mxfp4_moe_tuning.md + docs/baseline_523ca1c7_validated.csv: docs + a validated locked a4w4 baseline reference table. - Host-side unit tests (no GPU required): 94 passed, 4 skipped (committed-ledger scans skip without a ledger). black + ruff clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…Cm#708) - tests/test_common.py + tests/kernels/test_moe_gemm.py: capture and print the per-iteration timed-loop p95 alongside the median for MoE stage1/stage2 (additive observability; no kernel logic change). - scripts/run_benchmark.sh: add the ROCm#708 MXFP4 MoE target shapes (DeepSeek V3, Kimi K2, GPT-OSS a4w4; plus a8w4 rows) bracketing the small-token latency and large-shape MFU regimes; document the model->shape mapping. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Measurement + verification infrastructure for tuning the MXFP4 (per-1×32 microscale fp4) MoE 2-stage GEMM on gfx950 / MI350X, toward #708 (low MFU at large shapes, long latency at small tokens).
Refs #708.
What's here
kernels/moe_tuning.py— pre-compile legality filter for stage1/stage2 tile configs (LDS footprint, divisibility, MX-FP4 floors); mirrors the builders' real LDS sizing (stage1 full stride vs stage2 fp4-halved).kernels/moe_tuning_spec.py— locked spec constants + win / no-regression predicates (win margins, regime-aware band, token grid, MFU denominator, metric formula).scripts/moe_tuning_harness.py— provenance-complete measurement harness (verified clock pinning, idle-GPU check, faithful timed-loop median+p95) + a fail-closed candidate-sweep CLI (illegal / unmeasured configs are recorded as machine-readable rejections, never silently skipped).scripts/moe_tuning_ledger.py— attempt ledger + full-coverage Pareto comparator with a singleclaimable_wingate (coverage + no kernel-path/e2e regression + a real win + AOT/correctness hard gate) and integrity scans (duplicate / replayable-command / supersede-link).scripts/aiter_strict_point.py— strict, AOT-checked, model-correct aiter fused-MoE e2e + correctness guardrail (logits_diff <= 0.01).scripts/sync_aiter_flydsl_kernels.sh— overlay current FlyDSL MoE kernels onto aiter's vendored copies so the e2e guardrail runs against the sources being tuned.docs/mxfp4_moe_tuning.md+docs/baseline_523ca1c7_validated.csv— docs and a validated locked a4w4 baseline reference table.run_benchmark.shgains the [Issue]: MXFP4 MoE low MFU at large shapes and long latency at small tokens #708 target shapes (DeepSeek V3 / Kimi K2 / GPT-OSS).No production kernel logic changes: the two
kernels/files are new tuning-support modules; the kernel test-harness edits are additive p95 observability only.Scope notes
Testing
black + ruff clean on the added/changed Python.
cc @coderfeli (assigned task / #708 reporter)
🤖 Generated with Claude Code