Refactor HF _QuantSparseMoe: config-driven token counting, NemotronH detection#970
Refactor HF _QuantSparseMoe: config-driven token counting, NemotronH detection#970realAsma wants to merge 3 commits intoasma/nemotron_mixedfrom
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. 🗂️ Base branches to auto review (3)
Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
46b685d to
558c17c
Compare
82366c6 to
449e700
Compare
558c17c to
2729ed6
Compare
9de3be9 to
1fc689e
Compare
…ection - Accept n_routed_experts alongside num_experts in _is_sparse_moe_block - Add layer_sync_moe_local_experts_amax to _QuantSparseMoe - Make token counting and force-all-token calibration config-driven (moe_count_expert_calib_tokens, moe_calib_experts_ratio) with lazy init; forward is zero-overhead pass-through when both are disabled Signed-off-by: realAsma <akuriparambi@nvidia.com> Made-with: Cursor Deduplicate layer_sync_moe_local_experts_amax into shared sync_moe_experts_input_amax Signed-off-by: realAsma <akuriparambi@nvidia.com> Made-with: Cursor
Signed-off-by: realAsma <akuriparambi@nvidia.com> Made-with: Cursor
2ce6547 to
9fae261
Compare
Signed-off-by: realAsma <akuriparambi@nvidia.com> Made-with: Cursor
What does this PR do?
Type of change: New feature
Overview: Extend
_QuantSparseMoeto support NemotronH-style MoE blocks (which usen_routed_expertsinstead ofnum_experts) and refactor the MoE calibration features to be config-driven and lazy-initialized.Key changes:
_is_sparse_moe_blockinplugins/huggingface.pynow acceptsn_routed_experts(NemotronH pattern) in addition tonum_experts_QuantSparseMoeis refactored: token counting and forced expert forwarding are now opt-in via config knobs (moe_calib_experts_ratio,moe_count_expert_calib_tokens). When both are off (default), forward is a zero-overhead pass-through._setup_QuantSparseMoegetslayer_sync_moe_local_experts_amaxto sync input quantizer amax across experts (same as Megatron path)sync_moe_experts_input_amaxutility intoutils.py, also fixing missing weight amax for experts that received no tokens during calibration. Megatron's_MegatronSequentialMLPnow calls this shared utility.SequentialQuantizerdelegatesamaxpropertyTesting
test_sparse_moe.pycovering default config, lazy init, token counting, top_k restoration, and end-to-end quantize with both features enabled.Before your PR is "Ready for review"