Fix weight-only quantization and export for TEGroupedMLP (MoE models)#971
Fix weight-only quantization and export for TEGroupedMLP (MoE models)#971jQizhang wants to merge 2 commits intoNVIDIA:mainfrom
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughAdds a weight-calibration iterator API and uses it in calibration flow; introduces an internal proxy and export path to emit per-expert weights from grouped-linear/TEGroupedMLP when local_experts are absent. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/quantization/model_calib.py`:
- Around line 74-84: The QuantModule branch captures weights by calling
module.iter_weights_for_calibration() outside the
enable_weight_access_and_writeback context, so remapped/sharded/offloaded
weights may be stale; move the call into the context so iteration happens while
enable_weight_access_and_writeback(module, model) is active (i.e., enter the
with block first, then call module.iter_weights_for_calibration() and call
weight_quantizer(weight) inside that context). Keep the else branch behavior for
weight_attr_names/quantizer_attr_names unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: a8c54433-356b-4651-9f26-5e26affd01eb
📒 Files selected for processing (5)
modelopt/torch/export/unified_export_megatron.pymodelopt/torch/quantization/model_calib.pymodelopt/torch/quantization/nn/modules/quant_module.pymodelopt/torch/quantization/plugins/transformer_engine.pymodelopt/torch/quantization/utils.py
What does this PR do?
This PR fixes a critical issue where weight-only quantization fails for MoE models utilizing
TEGroupedMLP(e.g., Qwen3-30B-A3B).The Problem:
In
TEGroupedMLP, weights are stored per-expert asweight0,weight1, ...,weightN. During_QuantTEGroupedLinear._setup, the standardself.weightattribute is deleted.The existing
weight_only_quantizelogic expects to find aself.weightassociated with the quantizer. Because it couldn't find these "hidden" expert weights, theweight_quantizerfailed to calibrate, resulting in a missing_amaxattribute. This leads to the following crash during export/inference:The Solution:
iter_weights_for_calibrationin theQuantModulebase class._QuantTEGroupedLinearto yield all per-expert weights (weight0...weightN) that share the same quantizer. This ensures the calibrator "sees" all expert weights and calculates a valid_amax.GPTModelExporterto correctly handle the structure ofTEGroupedMLPduring HuggingFace format conversion, ensuring MoE checkpoints can be exported after quantization.2. Type of change
3. Usage / Reproduction
This issue is reproducible when running weight-only quantization on MoE models like Qwen3-30B-A3B:
4. Testing & Verification
QuantTEGroupedMLPnow correctly shows calculated_amaxvalues in the quantization statistics table instead of remainingdynamic.AttributeError.Summary by CodeRabbit
New Features
Improvements