MXFP4 MoE tuning harness: legality filter, measurement, ledger + validated baseline (#708) by jhinpan · Pull Request #736 · ROCm/FlyDSL

jhinpan · 2026-06-25T01:55:08Z

Summary

Measurement + verification infrastructure for tuning the MXFP4 (per-1×32 microscale fp4) MoE 2-stage GEMM on gfx950 / MI350X, toward #708 (low MFU at large shapes, long latency at small tokens).

This is infrastructure only — it does NOT change any production kernel logic and does not yet contain a performance win. It is the legality / measurement / bookkeeping foundation a tuning campaign runs on top of, plus a validated locked baseline. Opening it first so the harness and baseline can be reviewed independently of the (separate, upcoming) tuning changes.

Refs #708.

What's here

kernels/moe_tuning.py — pre-compile legality filter for stage1/stage2 tile configs (LDS footprint, divisibility, MX-FP4 floors); mirrors the builders' real LDS sizing (stage1 full stride vs stage2 fp4-halved).
kernels/moe_tuning_spec.py — locked spec constants + win / no-regression predicates (win margins, regime-aware band, token grid, MFU denominator, metric formula).
scripts/moe_tuning_harness.py — provenance-complete measurement harness (verified clock pinning, idle-GPU check, faithful timed-loop median+p95) + a fail-closed candidate-sweep CLI (illegal / unmeasured configs are recorded as machine-readable rejections, never silently skipped).
scripts/moe_tuning_ledger.py — attempt ledger + full-coverage Pareto comparator with a single claimable_win gate (coverage + no kernel-path/e2e regression + a real win + AOT/correctness hard gate) and integrity scans (duplicate / replayable-command / supersede-link).
scripts/aiter_strict_point.py — strict, AOT-checked, model-correct aiter fused-MoE e2e + correctness guardrail (logits_diff <= 0.01).
scripts/sync_aiter_flydsl_kernels.sh — overlay current FlyDSL MoE kernels onto aiter's vendored copies so the e2e guardrail runs against the sources being tuned.
docs/mxfp4_moe_tuning.md + docs/baseline_523ca1c7_validated.csv — docs and a validated locked a4w4 baseline reference table.
observability: timed-loop p95 printed alongside the median for MoE stage1/stage2; run_benchmark.sh gains the [Issue]: MXFP4 MoE low MFU at large shapes and long latency at small tokens #708 target shapes (DeepSeek V3 / Kimi K2 / GPT-OSS).

No production kernel logic changes: the two kernels/ files are new tuning-support modules; the kernel test-harness edits are additive p95 observability only.

Scope notes

Targets the a4w4 (fp4×fp4) path. a8w4 (fp8×fp4) e2e correctness is currently environment-blocked by an aiter non-fp4-activation wrapper/layout contract mismatch (not a FlyDSL kernel bug — this repo's own a8w4 MoE test passes with the reference check on); it is quarantined for win claims.
The tile/lever tuning that produces MFU/latency wins runs on top of this harness and will be submitted separately against [Issue]: MXFP4 MoE low MFU at large shapes and long latency at small tokens #708.

Testing

python3 -m pytest tests/unit/test_moe_tuning_harness.py tests/unit/test_moe_tuning_legality.py -q
# 94 passed, 4 skipped (committed-ledger scans skip when no ledger is present)

black + ruff clean on the added/changed Python.

cc @coderfeli (assigned task / #708 reporter)

🤖 Generated with Claude Code

…t guardrail (ROCm#708) Measurement + verification infrastructure for tuning the MXFP4 (per-1x32 fp4) MoE 2-stage GEMM on gfx950/MI350X, toward ROCm#708 (low MFU at large shapes, long latency at small tokens). Infrastructure only -- no production kernel logic changes. - kernels/moe_tuning.py: pre-compile legality filter for stage1/stage2 tile configs (LDS footprint, divisibility, MX-FP4 floors); mirrors builder LDS sizing (stage1 full lds_stride vs stage2 fp4-halved). - kernels/moe_tuning_spec.py: locked spec constants + win/no-regression predicates (win margins, regime-aware band, token grid, MFU denominator, metric formula). - scripts/moe_tuning_harness.py: provenance-complete measurement harness (verified clock pinning, idle check, faithful timed-loop median+p95) + fail-closed candidate sweep CLI (illegal/unmeasured configs recorded as rejections). - scripts/moe_tuning_ledger.py: attempt ledger + full-coverage Pareto comparator with a single claimable_win gate (coverage + no-regression + win + AOT/correctness hard gate) and integrity scans (duplicate / replay / supersede-link). - scripts/aiter_strict_point.py: strict AOT-checked model-correct aiter e2e + correctness guardrail (logits_diff <= 0.01). - scripts/sync_aiter_flydsl_kernels.sh: overlay FlyDSL MoE kernels onto aiter's vendored copies for the e2e guardrail. - docs/mxfp4_moe_tuning.md + docs/baseline_523ca1c7_validated.csv: docs + a validated locked a4w4 baseline reference table. - Host-side unit tests (no GPU required): 94 passed, 4 skipped (committed-ledger scans skip without a ledger). black + ruff clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…Cm#708) - tests/test_common.py + tests/kernels/test_moe_gemm.py: capture and print the per-iteration timed-loop p95 alongside the median for MoE stage1/stage2 (additive observability; no kernel logic change). - scripts/run_benchmark.sh: add the ROCm#708 MXFP4 MoE target shapes (DeepSeek V3, Kimi K2, GPT-OSS a4w4; plus a8w4 rows) bracketing the small-token latency and large-shape MFU regimes; document the model->shape mapping. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

jhinpan and others added 2 commits June 25, 2026 01:54

Copilot AI review requested due to automatic review settings June 25, 2026 01:55

This was referenced Jun 25, 2026

[Issue]: MXFP4 MoE low MFU at large shapes and long latency at small tokens #708

Open

[draft] MXFP4 MoE tuning: infrastructure + validated baseline/repeatability (checkpoint, no perf win yet) jhinpan/FlyDSL-lab#1

Closed

Copilot started reviewing on behalf of jhinpan June 25, 2026 01:56 View session

Copilot AI reviewed Jun 25, 2026

Merge branch 'main' into feat/mxfp4-moe-tuning-harness

1e0ccf0

jhinpan marked this pull request as draft June 25, 2026 19:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MXFP4 MoE tuning harness: legality filter, measurement, ledger + validated baseline (#708)#736

MXFP4 MoE tuning harness: legality filter, measurement, ledger + validated baseline (#708)#736
jhinpan wants to merge 3 commits into
ROCm:mainfrom
jhinpan:feat/mxfp4-moe-tuning-harness

jhinpan commented Jun 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

jhinpan commented Jun 25, 2026

Summary

What's here

Scope notes

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants