Skip to content

MXFP4 MoE tuning harness: legality filter, measurement, ledger + validated baseline (#708)#736

Draft
jhinpan wants to merge 3 commits into
ROCm:mainfrom
jhinpan:feat/mxfp4-moe-tuning-harness
Draft

MXFP4 MoE tuning harness: legality filter, measurement, ledger + validated baseline (#708)#736
jhinpan wants to merge 3 commits into
ROCm:mainfrom
jhinpan:feat/mxfp4-moe-tuning-harness

Conversation

@jhinpan

@jhinpan jhinpan commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Summary

Measurement + verification infrastructure for tuning the MXFP4 (per-1×32 microscale fp4) MoE 2-stage GEMM on gfx950 / MI350X, toward #708 (low MFU at large shapes, long latency at small tokens).

This is infrastructure only — it does NOT change any production kernel logic and does not yet contain a performance win. It is the legality / measurement / bookkeeping foundation a tuning campaign runs on top of, plus a validated locked baseline. Opening it first so the harness and baseline can be reviewed independently of the (separate, upcoming) tuning changes.

Refs #708.

What's here

  • kernels/moe_tuning.py — pre-compile legality filter for stage1/stage2 tile configs (LDS footprint, divisibility, MX-FP4 floors); mirrors the builders' real LDS sizing (stage1 full stride vs stage2 fp4-halved).
  • kernels/moe_tuning_spec.py — locked spec constants + win / no-regression predicates (win margins, regime-aware band, token grid, MFU denominator, metric formula).
  • scripts/moe_tuning_harness.py — provenance-complete measurement harness (verified clock pinning, idle-GPU check, faithful timed-loop median+p95) + a fail-closed candidate-sweep CLI (illegal / unmeasured configs are recorded as machine-readable rejections, never silently skipped).
  • scripts/moe_tuning_ledger.py — attempt ledger + full-coverage Pareto comparator with a single claimable_win gate (coverage + no kernel-path/e2e regression + a real win + AOT/correctness hard gate) and integrity scans (duplicate / replayable-command / supersede-link).
  • scripts/aiter_strict_point.py — strict, AOT-checked, model-correct aiter fused-MoE e2e + correctness guardrail (logits_diff <= 0.01).
  • scripts/sync_aiter_flydsl_kernels.sh — overlay current FlyDSL MoE kernels onto aiter's vendored copies so the e2e guardrail runs against the sources being tuned.
  • docs/mxfp4_moe_tuning.md + docs/baseline_523ca1c7_validated.csv — docs and a validated locked a4w4 baseline reference table.
  • observability: timed-loop p95 printed alongside the median for MoE stage1/stage2; run_benchmark.sh gains the [Issue]: MXFP4 MoE low MFU at large shapes and long latency at small tokens #708 target shapes (DeepSeek V3 / Kimi K2 / GPT-OSS).

No production kernel logic changes: the two kernels/ files are new tuning-support modules; the kernel test-harness edits are additive p95 observability only.

Scope notes

  • Targets the a4w4 (fp4×fp4) path. a8w4 (fp8×fp4) e2e correctness is currently environment-blocked by an aiter non-fp4-activation wrapper/layout contract mismatch (not a FlyDSL kernel bug — this repo's own a8w4 MoE test passes with the reference check on); it is quarantined for win claims.
  • The tile/lever tuning that produces MFU/latency wins runs on top of this harness and will be submitted separately against [Issue]: MXFP4 MoE low MFU at large shapes and long latency at small tokens #708.

Testing

python3 -m pytest tests/unit/test_moe_tuning_harness.py tests/unit/test_moe_tuning_legality.py -q
# 94 passed, 4 skipped (committed-ledger scans skip when no ledger is present)

black + ruff clean on the added/changed Python.

cc @coderfeli (assigned task / #708 reporter)

🤖 Generated with Claude Code

jhinpan and others added 2 commits June 25, 2026 01:54
…t guardrail (ROCm#708)

Measurement + verification infrastructure for tuning the MXFP4 (per-1x32 fp4) MoE
2-stage GEMM on gfx950/MI350X, toward ROCm#708 (low MFU at large shapes,
long latency at small tokens). Infrastructure only -- no production kernel logic
changes.

- kernels/moe_tuning.py: pre-compile legality filter for stage1/stage2 tile
  configs (LDS footprint, divisibility, MX-FP4 floors); mirrors builder LDS sizing
  (stage1 full lds_stride vs stage2 fp4-halved).
- kernels/moe_tuning_spec.py: locked spec constants + win/no-regression predicates
  (win margins, regime-aware band, token grid, MFU denominator, metric formula).
- scripts/moe_tuning_harness.py: provenance-complete measurement harness (verified
  clock pinning, idle check, faithful timed-loop median+p95) + fail-closed
  candidate sweep CLI (illegal/unmeasured configs recorded as rejections).
- scripts/moe_tuning_ledger.py: attempt ledger + full-coverage Pareto comparator
  with a single claimable_win gate (coverage + no-regression + win + AOT/correctness
  hard gate) and integrity scans (duplicate / replay / supersede-link).
- scripts/aiter_strict_point.py: strict AOT-checked model-correct aiter e2e +
  correctness guardrail (logits_diff <= 0.01).
- scripts/sync_aiter_flydsl_kernels.sh: overlay FlyDSL MoE kernels onto aiter's
  vendored copies for the e2e guardrail.
- docs/mxfp4_moe_tuning.md + docs/baseline_523ca1c7_validated.csv: docs + a
  validated locked a4w4 baseline reference table.
- Host-side unit tests (no GPU required): 94 passed, 4 skipped (committed-ledger
  scans skip without a ledger). black + ruff clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…Cm#708)

- tests/test_common.py + tests/kernels/test_moe_gemm.py: capture and print the
  per-iteration timed-loop p95 alongside the median for MoE stage1/stage2 (additive
  observability; no kernel logic change).
- scripts/run_benchmark.sh: add the ROCm#708 MXFP4 MoE target shapes (DeepSeek V3,
  Kimi K2, GPT-OSS a4w4; plus a8w4 rows) bracketing the small-token latency and
  large-shape MFU regimes; document the model->shape mapping.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@jhinpan jhinpan marked this pull request as draft June 25, 2026 19:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants