fixes for fused moe (qwen3.6, GLM5.1 + MSE calibration#1382
fixes for fused moe (qwen3.6, GLM5.1 + MSE calibration#1382
Conversation
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (4)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughAdds per-expert handling and safety fixes to MoE quantization/export, prevents FP8 scale underflow by clamping per-block scales, extends MSE calibration to discover per-expert quantizers, adds an NVFP4 experts-only PTQ recipe and tests, and tweaks LLM PTQ preview input decoding behavior. ChangesMoE Quantization Infrastructure & Tests
LLM PTQ Preview Input Handling
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/export/moe_utils.py`:
- Around line 98-103: The temporary mutation of w_quantizer_src._amax before
calling copy.deepcopy may leave the source quantizer with _amax == None if
deepcopy raises; change the code around copy.deepcopy(w_quantizer_src) to save
_saved_amax, set w_quantizer_src._amax = None, then perform deepcopy inside a
try block and restore w_quantizer_src._amax = _saved_amax in a finally block;
after deepcopy set w_quantizer._amax = gu_amax_cpu as before so the source state
is always restored even on exceptions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: c7efeb50-0d25-4ef7-8b84-e1a0a66662b4
📒 Files selected for processing (7)
examples/llm_ptq/hf_ptq.pymodelopt/torch/export/moe_utils.pymodelopt/torch/quantization/model_calib.pymodelopt/torch/quantization/model_quant.pymodelopt/torch/quantization/nn/modules/tensor_quantizer.pymodelopt/torch/quantization/qtensor/nvfp4_tensor.pymodelopt_recipes/general/ptq/nvfp4_experts_only_mse.yaml
cjluo-nv
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
This PR fixes several real bugs in the fused MoE quantization pipeline (MSE calibration discovery, FP8 scale underflow, uncalibrated expert export). The fixes are well-described in the PR body and address genuine correctness issues. However, there are several concerns:
-
Missing unit tests (critical): No tests are added for any of the bug fixes. The existing
test_fused_experts.pycovers registration/conversion/basic export but doesn't exercise MSE calibration for fused experts, FP8 scale clamping, or the invalid-amax patching logic. Given the complexity of the moe_utils.py changes and the project's known pattern of missing tests, this is a blocking concern. -
Threshold inconsistency:
_MIN_VALID_AMAX = 1e-4is below FP8 E4M3FN minimum (2^-9 ≈ 0.00195), meaning values between 1e-4 and 2e-3 pass the validity check but could still underflow. -
Hardcoded block_size=16: The fallback per-block amax computation in moe_utils.py hardcodes
16. If the actual block size differs, the shape will be wrong. -
Copyright year: New YAML file has
Copyright (c) 2024but LICENSE_HEADER says 2026.
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
tests/unit/torch/quantization/plugins/test_fused_experts.py (1)
728-737: ⚡ Quick winAssert the repaired quantizer state, not just the warning.
This still passes if the warning is emitted but the fallback leaves
_amaxwith the wrong per-block shape orglobal_amaxstale/zero. Those are the export failures this change is fixing, so it would be worth capturing the mockedwrapperobjects and asserting the repaired quantizer state directly.Possible tightening
+ captured = [] + + def _capture_export(wrapper, dtype): + captured.append((tuple(wrapper.weight.shape), wrapper.weight_quantizer)) + with ( - patch("modelopt.torch.export.unified_export_hf._export_quantized_weight"), + patch( + "modelopt.torch.export.unified_export_hf._export_quantized_weight", + side_effect=_capture_export, + ), warnings.catch_warnings(record=True) as caught, ): warnings.simplefilter("always") _export_fused_experts(converted, torch.float16) assert any("weight-derived per-block amax" in str(w.message) for w in caught), ( f"No fallback warning emitted for {'zero' if zero_amax else 'None'} amax — Bug 3 regression" ) + for weight_shape, weight_quantizer in captured: + assert weight_quantizer._amax is not None + assert weight_quantizer._amax.numel() == (weight_shape[0] * weight_shape[1]) // 16 + assert weight_quantizer.global_amax is not None + assert weight_quantizer.global_amax.item() > 0 self._cleanup_registry(expert_type)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/torch/quantization/plugins/test_fused_experts.py` around lines 728 - 737, The test currently only checks for a fallback warning; update it to also capture the mocked export wrapper(s) and assert the repaired quantizer state after calling _export_fused_experts(converted, torch.float16): specifically, patch and capture the wrapper returned by modelopt.torch.export.unified_export_hf._export_quantized_weight (or the outer wrapper used in the test) and then assert that each quantizer's internal _amax has the expected per-block shape (not a scalar) and that global_amax is updated/non-zero (or not stale) for the converted object’s experts; keep the existing warning assertion but add these direct state assertions on converted (and its quantizer instances) to ensure the fallback actually fixes the quantizer internals.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@modelopt/torch/export/moe_utils.py`:
- Around line 109-137: The invalid-_amax repair should only run for per-block
quantizers—before computing _block_size or reshaping weight_slice, check that
(getattr(w_quantizer, "block_sizes", None) or {}).get(-1) is not None and bail
out of this repair branch when it is None; do not fall back to a default block
size (remove the hardcoded 16 default), obtain _block_size from that block_sizes
entry, and only then compute per_block_fallback and assign into
w_quantizer._amax using weight_slice, per_block_fallback, invalid_mask as
currently done.
In `@tests/unit/torch/quantization/test_nvfp4_tensor.py`:
- Around line 32-42: The test's wsf2 is too small and makes per_block_scale
large instead of exercising the FP8-min clamp path; update the wsf2 fixture used
before calling NVFP4QTensor.get_weights_scaling_factor so that per_block_scale
becomes very small (below _FP8_E4M3FN_MIN) for the tiny_weight case — e.g.,
increase wsf2 by many orders of magnitude (replace the current wsf2 value with a
larger magnitude such as using 1e-2/(6.0 * 448.0) or similar) so per_block_scale
= per_block_amax / (6 * wsf2) triggers the FP8 underflow/clamp behavior checked
against _FP8_E4M3FN_MIN.
---
Nitpick comments:
In `@tests/unit/torch/quantization/plugins/test_fused_experts.py`:
- Around line 728-737: The test currently only checks for a fallback warning;
update it to also capture the mocked export wrapper(s) and assert the repaired
quantizer state after calling _export_fused_experts(converted, torch.float16):
specifically, patch and capture the wrapper returned by
modelopt.torch.export.unified_export_hf._export_quantized_weight (or the outer
wrapper used in the test) and then assert that each quantizer's internal _amax
has the expected per-block shape (not a scalar) and that global_amax is
updated/non-zero (or not stale) for the converted object’s experts; keep the
existing warning assertion but add these direct state assertions on converted
(and its quantizer instances) to ensure the fallback actually fixes the
quantizer internals.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: d9b0177c-879d-409f-a07b-6b174403d0a0
📒 Files selected for processing (5)
modelopt/torch/export/moe_utils.pymodelopt/torch/quantization/model_calib.pymodelopt_recipes/general/ptq/nvfp4_experts_only_mse.yamltests/unit/torch/quantization/plugins/test_fused_experts.pytests/unit/torch/quantization/test_nvfp4_tensor.py
✅ Files skipped from review due to trivial changes (1)
- modelopt_recipes/general/ptq/nvfp4_experts_only_mse.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
- modelopt/torch/quantization/model_calib.py
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1382 +/- ##
===========================================
+ Coverage 66.36% 76.85% +10.49%
===========================================
Files 471 471
Lines 50510 50768 +258
===========================================
+ Hits 33522 39019 +5497
+ Misses 16988 11749 -5239
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@modelopt/torch/quantization/model_calib.py`:
- Around line 429-436: The loop assumes each f"{param_name}_weight_quantizers"
ModuleList length equals the leading expert dimension of the parameter, which
can drift and cause indexing errors; modify the branch that processes
parent_module.named_parameters(recurse=False) to retrieve the parameter tensor
(via parent_module.get_parameter(param_name) or getattr), read its leading
dimension, and assert it equals len(qlist) (or raise a clear ValueError) before
iterating experts; reference the symbols parent_module, param_name, qlist
(f"{param_name}_weight_quantizers"), TensorQuantizer, expert_idx, and
weight_quantizers so the check fails fast with a descriptive message when sizes
mismatch.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 4a9c7d3a-e170-4c6e-80d6-2422c258738e
📒 Files selected for processing (3)
examples/llm_ptq/hf_ptq.pymodelopt/torch/quantization/model_calib.pymodelopt/torch/quantization/qtensor/nvfp4_tensor.py
🚧 Files skipped from review as they are similar to previous changes (1)
- modelopt/torch/quantization/qtensor/nvfp4_tensor.py
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
| torch.float8_e4m3fn | ||
| fp8_e4m3fn_min = 2**-9 # 0.001953125 — smallest positive subnormal | ||
| per_block_scale = ( | ||
| (per_block_scale * 448.0 / per_block_scale_max) |
There was a problem hiding this comment.
| (per_block_scale * 448.0 / per_block_scale_max) | |
| (per_block_scale.float() * 448.0 / per_block_scale_max) |
| # Clamp to the minimum positive FP8 E4M3FN subnormal (~0.00195 = 2^-9) before | ||
| # casting. Without this, blocks whose scale falls below the FP8 representable | ||
| # range silently underflow to 0, causing those blocks to produce zero output at | ||
| # inference even when the weights are non-trivial. | ||
| fp8_e4m3fn_min = 2**-9 # 0.001953125 — smallest positive subnormal | ||
| per_block_scale = per_block_scale.clamp(min=fp8_e4m3fn_min) | ||
| per_block_scale = per_block_scale.to(torch.float8_e4m3fn) |
There was a problem hiding this comment.
Can we create a helper method which does the FP8 quantization of per_tensor scale and use that here and here https://github.com/NVIDIA/Model-Optimizer/pull/1382/changes#r3191334011
| # Enqueue per-expert quantizers from {param}_weight_quantizers ModuleLists. | ||
| if _qfe_cls is not None and isinstance(parent_module, _qfe_cls): | ||
| for param_name, param in parent_module.named_parameters(recurse=False): | ||
| qlist = getattr(parent_module, f"{param_name}_weight_quantizers", None) | ||
| if not isinstance(qlist, nn.ModuleList): | ||
| continue | ||
| if len(qlist) != param.shape[0]: | ||
| warnings.warn( | ||
| f"Skipping {param_name}_weight_quantizers: list length {len(qlist)} " | ||
| f"does not match parameter leading dimension {param.shape[0]}. " | ||
| "This may indicate a misconfigured fused-experts module.", | ||
| stacklevel=2, | ||
| ) | ||
| continue | ||
| for expert_idx, wq in enumerate(qlist): | ||
| if isinstance(wq, TensorQuantizer) and wq.is_enabled: | ||
| if getattr(wq, "_calibrator", None) is not None: | ||
| weight_quantizers.append((parent_module, (param_name, expert_idx), wq)) | ||
|
|
There was a problem hiding this comment.
can we have a helper method get_weight_quantizers(module) which can support both MoE and regular weight quantizers? This will help avoid the code branching here
| cal = getattr(module, "_calibrator", None) | ||
| if cal and not getattr(module, "_dynamic", False): | ||
| if method in {"entropy"}: | ||
| if method == "entropy": |
There was a problem hiding this comment.
can we rebase this PR to the composable recipe PR - #1253
What does this PR do?
Type of change: Bug fix
Fixes several issues with NVFP4 MSE calibration and export for fused MoE expert modules (_QuantFusedExperts — used by Qwen3.6, GLM-5.1, and other HF transformers 5.0+ models that store expert weights as 3-D nn.Parameters).
The weight-quantizer discovery loop in mse_calibrate used the singular attribute name gate_up_proj_weight_quantizer to look up quantizers, but _QuantFusedExperts stores them in a plural nn.ModuleList named gate_up_proj_weight_quantizers. All 20,480 expert quantizers were silently skipped, resulting in "MSE weight calibration: 0it" and no MSE-optimized scales.
Fix: add a second pass that detects plural {param}_weight_quantizers ModuleLists and enqueues each per-expert quantizer with a (param_name, expert_idx) tuple; step 3 unpacks the tuple to extract the per-expert weight slice.
Per-block weight scales can silently underflow to 0 when cast to FP8 E4M3FN. The existing scale == 0 guard only catches exact float32 zeros; values in (0, 2^-9) pass through and become 0 after the FP8 cast. This affects both the dynamic recompute path (get_weights_scaling_factor) and the static calibrated path (get_weights_scaling_factor_from_quantizer).
Fix: clamp per-block scales to 2^-9 (smallest positive FP8 E4M3FN subnormal) before the FP8 cast in both paths.
Experts that receive no tokens during calibration have _amax = 0 or uninitialized values. The existing scalar fallback used 1e-4 which itself underflows to 0 in FP8 E4M3FN (1e-4 < 2^-9 ≈ 0.00195). Additionally, the per-block fallback tensor had shape (H*W, 1) instead of (H, W), causing a shape mismatch that silently bypassed the fallback and fell through to the bad scalar. Finally, a stale zero global_amax from an uncalibrated expert was not recomputed, causing division-by-zero in the FP8 scale formula.
Fix: reshape the per-block fallback correctly; raise the clamp floor to 2e-3; always recompute global_amax from the current (possibly patched) per-block _amax.
Additional fixes:
Usage
Testing
validated on Qwen3.6-35B-A3B (8× B200):
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: ✅ / ❌ / N/AAdditional Information
Summary by CodeRabbit
New Features
Bug Fixes
Improvements
Tests