[NVBug 6108145] Fix PTQ calibration and export for fused-experts MoE (Qwen3.5-MoE VLM)#1340
[NVBug 6108145] Fix PTQ calibration and export for fused-experts MoE (Qwen3.5-MoE VLM)#1340
Conversation
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
📝 WalkthroughWalkthroughAdds end-to-end support for HuggingFace fused-experts in PTQ: normalizes per-expert quantizer names, forces eager experts implementation at runtime, discovers representative weight quantizers, moves fused-expert export earlier, and adds tests plus a recipe tweak. Changes
Sequence Diagram(s)sequenceDiagram
participant Model as Model<br/><span style="background-color:rgba(52,152,219,0.5)">Model</span>
participant Plugin as Plugin<br/><span style="background-color:rgba(46,204,113,0.5)">force_eager_experts_impl_on_the_fly</span>
participant Calibrator as Calibrator<br/><span style="background-color:rgba(155,89,182,0.5)">PTQ Calibrator</span>
participant Exporter as Exporter<br/><span style="background-color:rgba(241,196,15,0.5)">Unified Exporter</span>
Model->>Plugin: detect fused-experts
Plugin-->>Model: set _experts_implementation = "eager" (recursively)
Model->>Calibrator: run calibration
Calibrator->>Model: invoke _QuantFusedExperts.forward (eager path)
Calibrator->>Calibrator: collect per-expert quantizer amax via hooks
Model->>Exporter: request export
Exporter->>Exporter: resolve representative_weight_quantizer for weight names
Exporter->>Exporter: detect fused-experts early
Exporter->>Exporter: call _export_fused_experts under fsdp2_aware_weight_update
Exporter-->>Caller: produce export metadata (includes fused-experts quant spec)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/export/unified_export_hf.py`:
- Around line 649-657: The elif branch that checks for the same attribute
gate_up_proj_weight_quantizers is dead code because the preceding if block
handles that case and continues; remove the unreachable elif block (the second
check for gate_up_proj_weight_quantizers and its body) so only the initial
handling using fsdp2_aware_weight_update and _export_fused_experts(sub_module,
dtype) remains, leaving no duplicate checks for gate_up_proj_weight_quantizers
in the loop.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 6c028304-1560-45eb-bcba-de04e7c03a20
📒 Files selected for processing (4)
examples/llm_ptq/hf_ptq.pymodelopt/torch/export/unified_export_hf.pymodelopt/torch/quantization/conversion.pymodelopt_recipes/general/ptq/nvfp4_experts_only-fp8_kv.yaml
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1340 +/- ##
==========================================
- Coverage 74.46% 65.19% -9.28%
==========================================
Files 464 485 +21
Lines 50089 53337 +3248
==========================================
- Hits 37300 34771 -2529
- Misses 12789 18566 +5777
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
cjluo-nv
left a comment
There was a problem hiding this comment.
This is a well-structured bug fix with good test coverage. The 4-part fix (eager impl forcing, quantizer name normalization, export path hoisting, YAML layerwise change) is logically coherent and well-tested. However, there's a duplicate code path in unified_export_hf.py that should be cleaned up.
Design: Despite the complexity gate firing, this is a bug fix within existing systems, not an architectural change. No new abstractions are introduced.
Tests: Comprehensive — covers force_eager_experts_impl_on_the_fly edge cases, and an end-to-end calibration test that guards the full pipeline (name normalization → wildcard matching → amax collection). Good.
Issue: The new early-exit block in _process_quantized_modules makes the existing elif hasattr(sub_module, "gate_up_proj_weight_quantizers") block (deeper in the same function, inside the get_quantization_format() != QUANTIZATION_NONE guard) dead code. One of these should be removed.
There was a problem hiding this comment.
🧹 Nitpick comments (2)
modelopt/torch/quantization/plugins/huggingface.py (1)
1441-1449: Harden recursive config traversal with a cycle guard.The recursive
_force()walk can loop forever on cyclic config graphs. Add a visited-set guard to make this robust.Proposed patch
nested_cfg_attrs = ("text_config", "vision_config", "audio_config", "speech_config") + visited_cfg_ids = set() def _force(cfg): if cfg is None: return + cfg_id = id(cfg) + if cfg_id in visited_cfg_ids: + return + visited_cfg_ids.add(cfg_id) if hasattr(cfg, "_experts_implementation"): cfg._experts_implementation = "eager" for sub in nested_cfg_attrs: if hasattr(cfg, sub): _force(getattr(cfg, sub))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 1441 - 1449, The recursive helper _force can loop on cyclic config graphs; modify _force to accept and maintain a visited set (e.g., of object ids) and skip recursing into cfg instances already seen, so each cfg is processed only once. Specifically, update the _force signature to take an optional visited set, add the current cfg to visited (use id(cfg) or cfg itself), return early if already visited, and keep the existing behavior of setting cfg._experts_implementation = "eager" and iterating nested_cfg_attrs; ensure recursive calls pass the same visited set. This change hardens _force and prevents infinite recursion on cycles while still touching the same symbols (_force, nested_cfg_attrs, cfg._experts_implementation).tests/unit/torch/quantization/plugins/test_fused_experts.py (1)
385-441: Make registry cleanup exception-safe in the calibration test.If an assertion fails before the last line, the temporary registry entry may leak into subsequent tests.
Proposed patch
def test_calibration_populates_all_expert_quantizers(self): """After PTQ, every input/weight quantizer on the fused-experts module has amax set.""" import modelopt.torch.quantization as mtq model = _TinyMoEModel() expert_type = type(model.moe.experts) self._cleanup_registry(expert_type) - - quant_cfg = { + try: + quant_cfg = { "quant_cfg": [ {"quantizer_name": "*", "enable": False}, { "quantizer_name": "*gate_up_proj_input_quantizer", "cfg": {"num_bits": 8, "axis": None}, @@ for idx in range(NUM_EXPERTS): assert experts.gate_up_proj_weight_quantizers[idx].amax is not None, ( f"gate_up_proj_weight_quantizers[{idx}].amax is None — " "plural ModuleList name normalization in _match_quantizer likely broken." ) assert experts.down_proj_weight_quantizers[idx].amax is not None, ( f"down_proj_weight_quantizers[{idx}].amax is None." ) - - self._cleanup_registry(expert_type) + finally: + self._cleanup_registry(expert_type)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/torch/quantization/plugins/test_fused_experts.py` around lines 385 - 441, The test creates a temporary registry entry via expert_type and calls self._cleanup_registry(expert_type) at the end but can leak if an assertion fails; wrap the main test actions (quant_cfg setup, forward_loop, mtq.quantize, and all asserts) in a try/finally and move the final self._cleanup_registry(expert_type) into the finally block so cleanup always runs; keep expert_type assigned before the try and leave the initial cleanup call (before quantization) as-is.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 1441-1449: The recursive helper _force can loop on cyclic config
graphs; modify _force to accept and maintain a visited set (e.g., of object ids)
and skip recursing into cfg instances already seen, so each cfg is processed
only once. Specifically, update the _force signature to take an optional visited
set, add the current cfg to visited (use id(cfg) or cfg itself), return early if
already visited, and keep the existing behavior of setting
cfg._experts_implementation = "eager" and iterating nested_cfg_attrs; ensure
recursive calls pass the same visited set. This change hardens _force and
prevents infinite recursion on cycles while still touching the same symbols
(_force, nested_cfg_attrs, cfg._experts_implementation).
In `@tests/unit/torch/quantization/plugins/test_fused_experts.py`:
- Around line 385-441: The test creates a temporary registry entry via
expert_type and calls self._cleanup_registry(expert_type) at the end but can
leak if an assertion fails; wrap the main test actions (quant_cfg setup,
forward_loop, mtq.quantize, and all asserts) in a try/finally and move the final
self._cleanup_registry(expert_type) into the finally block so cleanup always
runs; keep expert_type assigned before the try and leave the initial cleanup
call (before quantization) as-is.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 310a1d2b-745c-4df5-8317-038bf82c199f
📒 Files selected for processing (2)
modelopt/torch/quantization/plugins/huggingface.pytests/unit/torch/quantization/plugins/test_fused_experts.py
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
…standard export branches Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
cjluo-nv
left a comment
There was a problem hiding this comment.
All critical previous review comments have been addressed: dead code removed, import moved to top, control flow restructured, and singular/plural quantizer naming covered. The PR is a well-structured 4-part bug fix (eager impl forcing, quantizer name normalization, export path hoisting, YAML layerwise change) with comprehensive tests. No new abstractions — purely surgical fixes within existing systems. Code is correct and clean.
Complex PR: spans 6 directories (≥ 5). Looping in a human for approval.
What does this PR do?
Type of change: Bug fix
Fixes a 4-bug cascade that caused silent PTQ failure on Qwen3.5-MoE VLMs (Qwen3.6-35B-A3B): calibration
appeared to succeed but produced token-salad at inference. Root cause: HF's @use_experts_implementation
dispatches expert forward to torch._grouped_mm / torch.bmm, bypassing the F.linear hook that captures
activations — so gate_up_proj_input_quantizer / down_proj_input_quantizer never calibrated and no input_scale
tensors were emitted.
Changes:
vision_config / …) so per-expert F.linear calls are visible to the calibration hook.
→ weight_quantizer) before fnmatch, so wildcards like mlp.expertsweight_quantizer match fused-expert
quantizers.
get_quantization_format() gate so _export_fused_experts() runs even when the top-level format query returns
QUANTIZATION_NONE (happens for experts-only recipes).
breaks the layerwise walker).
Usage
Testing
Testing
End-to-end PTQ → vLLM deploy → NEL eval on Qwen3.6-35B-A3B (256 experts × 40 layers, 35B params):
Hook-call diagnostic: 0 → 6720 per-expert F.linear calls during calibration after the fix; 0 → 30720
input_scale tensors emitted in the exported checkpoint.
FP8 fused-MoE path still produces gibberish — separate follow-up (vLLM per-expert weight_scale handling).
vLLM full-FP8: the FlashInfer TRTLLM Fp8MoE loader doesn't stack the 256 per-expert scalar weight_scale tensors
into a [num_experts] per-expert vector — it ends up applying one expert's scale across all 256, so every
routed expert dequants with the wrong amplitude → coherent token stream collapses into multilingual gibberish.
SGLang full-FP8: qwen3_5.py::_make_packed_weight_loader rejects with AssertionError: Unexpected scalar for
tuple shard load: loaded_shard_id=(0,1,2), split_sizes=[1,1,1] — its packed-loader has no path for "N
independent per-tensor source scalars combining into one fused-shard parameter," so the fused QKV (or
in_proj_qkvz) load is structurally refused and the model never finishes loading.
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: ✅ / ❌ / N/AAdditional Information
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Tests