Refactor HF _QuantSparseMoe: config-driven token counting, NemotronH detection by realAsma · Pull Request #970 · NVIDIA/Model-Optimizer

realAsma · 2026-03-04T14:22:27Z

What does this PR do?

Type of change: New feature

Overview: Extend _QuantSparseMoe to support NemotronH-style MoE blocks (which use n_routed_experts instead of num_experts) and refactor the MoE calibration features to be config-driven and lazy-initialized.

Key changes:

_is_sparse_moe_block in plugins/huggingface.py now accepts n_routed_experts (NemotronH pattern) in addition to num_experts
_QuantSparseMoe is refactored: token counting and forced expert forwarding are now opt-in via config knobs (moe_calib_experts_ratio, moe_count_expert_calib_tokens). When both are off (default), forward is a zero-overhead pass-through.
Token counting buffer and gate hook are lazy-initialized on first use instead of eagerly in _setup
_QuantSparseMoe gets layer_sync_moe_local_experts_amax to sync input quantizer amax across experts (same as Megatron path)
Extract shared sync_moe_experts_input_amax utility into utils.py, also fixing missing weight amax for experts that received no tokens during calibration. Megatron's _MegatronSequentialMLP now calls this shared utility.
SequentialQuantizer delegates amax property

Testing

Updated and added unit tests in test_sparse_moe.py covering default config, lazy init, token counting, top_k restoration, and end-to-end quantize with both features enabled.

Before your PR is "Ready for review"

Is this change backward compatible?: Yes
Did you write any new necessary tests?: Yes
Did you add or update any necessary documentation?: No
Did you update Changelog?: No

copy-pr-bot · 2026-03-04T14:22:38Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-03-04T14:22:40Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

🗂️ Base branches to auto review (3)

main
release/.*
feature/.*

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 95bb3fb9-4aaa-46ed-a202-f7e2caead8ff

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch asma/nemotronh_moe_support

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…ection - Accept n_routed_experts alongside num_experts in _is_sparse_moe_block - Add layer_sync_moe_local_experts_amax to _QuantSparseMoe - Make token counting and force-all-token calibration config-driven (moe_count_expert_calib_tokens, moe_calib_experts_ratio) with lazy init; forward is zero-overhead pass-through when both are disabled Signed-off-by: realAsma <akuriparambi@nvidia.com> Made-with: Cursor Deduplicate layer_sync_moe_local_experts_amax into shared sync_moe_experts_input_amax Signed-off-by: realAsma <akuriparambi@nvidia.com> Made-with: Cursor

Signed-off-by: realAsma <akuriparambi@nvidia.com> Made-with: Cursor

realAsma requested a review from a team as a code owner March 4, 2026 14:22

realAsma requested review from sugunav14 and removed request for a team March 4, 2026 14:22

realAsma marked this pull request as draft March 4, 2026 14:22

realAsma force-pushed the asma/nemotron_mixed branch 2 times, most recently from 46b685d to 558c17c Compare March 4, 2026 17:33

realAsma force-pushed the asma/nemotronh_moe_support branch from 82366c6 to 449e700 Compare March 4, 2026 17:37

realAsma requested a review from Fridah-nv March 6, 2026 17:46

realAsma force-pushed the asma/nemotron_mixed branch from 558c17c to 2729ed6 Compare March 6, 2026 19:11

realAsma force-pushed the asma/nemotronh_moe_support branch from 9de3be9 to 1fc689e Compare March 6, 2026 19:12

realAsma changed the title ~~[Draft] Refactor _QuantSparseMoe: config-driven token counting, NemotronH detection~~ Refactor HF _QuantSparseMoe: config-driven token counting, NemotronH detection Mar 6, 2026

realAsma marked this pull request as ready for review March 6, 2026 19:26

realAsma requested review from a team, cjluo-nv and meenchen March 6, 2026 19:26

realAsma added 2 commits March 6, 2026 21:14

Revert NVFP4_DEFAULT_CFG algorithm to plain "max"

9fae261

Signed-off-by: realAsma <akuriparambi@nvidia.com> Made-with: Cursor

realAsma force-pushed the asma/nemotronh_moe_support branch from 2ce6547 to 9fae261 Compare March 6, 2026 21:16

Restore deleted pre-existing comments in _QuantSparseMoe

bc965bd

Signed-off-by: realAsma <akuriparambi@nvidia.com> Made-with: Cursor

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor HF _QuantSparseMoe: config-driven token counting, NemotronH detection#970

Refactor HF _QuantSparseMoe: config-driven token counting, NemotronH detection#970
realAsma wants to merge 3 commits intoasma/nemotron_mixedfrom
asma/nemotronh_moe_support

realAsma commented Mar 4, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Mar 4, 2026

Uh oh!

coderabbitai bot commented Mar 4, 2026 •

edited

Loading

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

realAsma commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Testing

Before your PR is "Ready for review"

Uh oh!

copy-pr-bot bot commented Mar 4, 2026

Uh oh!

coderabbitai bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

realAsma commented Mar 4, 2026 •

edited

Loading

coderabbitai bot commented Mar 4, 2026 •

edited

Loading