Skip to content

Refactor HF _QuantSparseMoe: config-driven token counting, NemotronH detection#970

Open
realAsma wants to merge 3 commits intoasma/nemotron_mixedfrom
asma/nemotronh_moe_support
Open

Refactor HF _QuantSparseMoe: config-driven token counting, NemotronH detection#970
realAsma wants to merge 3 commits intoasma/nemotron_mixedfrom
asma/nemotronh_moe_support

Conversation

@realAsma
Copy link
Contributor

@realAsma realAsma commented Mar 4, 2026

What does this PR do?

Type of change: New feature

Overview: Extend _QuantSparseMoe to support NemotronH-style MoE blocks (which use n_routed_experts instead of num_experts) and refactor the MoE calibration features to be config-driven and lazy-initialized.

Key changes:

  • _is_sparse_moe_block in plugins/huggingface.py now accepts n_routed_experts (NemotronH pattern) in addition to num_experts
  • _QuantSparseMoe is refactored: token counting and forced expert forwarding are now opt-in via config knobs (moe_calib_experts_ratio, moe_count_expert_calib_tokens). When both are off (default), forward is a zero-overhead pass-through.
  • Token counting buffer and gate hook are lazy-initialized on first use instead of eagerly in _setup
  • _QuantSparseMoe gets layer_sync_moe_local_experts_amax to sync input quantizer amax across experts (same as Megatron path)
  • Extract shared sync_moe_experts_input_amax utility into utils.py, also fixing missing weight amax for experts that received no tokens during calibration. Megatron's _MegatronSequentialMLP now calls this shared utility.
  • SequentialQuantizer delegates amax property

Testing

  • Updated and added unit tests in test_sparse_moe.py covering default config, lazy init, token counting, top_k restoration, and end-to-end quantize with both features enabled.

Before your PR is "Ready for review"

  • Is this change backward compatible?: Yes
  • Did you write any new necessary tests?: Yes
  • Did you add or update any necessary documentation?: No
  • Did you update Changelog?: No

@realAsma realAsma requested a review from a team as a code owner March 4, 2026 14:22
@realAsma realAsma requested review from sugunav14 and removed request for a team March 4, 2026 14:22
@realAsma realAsma marked this pull request as draft March 4, 2026 14:22
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 4, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 4, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

🗂️ Base branches to auto review (3)
  • main
  • release/.*
  • feature/.*

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 95bb3fb9-4aaa-46ed-a202-f7e2caead8ff

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch asma/nemotronh_moe_support

Comment @coderabbitai help to get the list of available commands and usage tips.

@realAsma realAsma force-pushed the asma/nemotron_mixed branch 2 times, most recently from 46b685d to 558c17c Compare March 4, 2026 17:33
@realAsma realAsma force-pushed the asma/nemotronh_moe_support branch from 82366c6 to 449e700 Compare March 4, 2026 17:37
@realAsma realAsma requested a review from Fridah-nv March 6, 2026 17:46
@realAsma realAsma force-pushed the asma/nemotron_mixed branch from 558c17c to 2729ed6 Compare March 6, 2026 19:11
@realAsma realAsma force-pushed the asma/nemotronh_moe_support branch from 9de3be9 to 1fc689e Compare March 6, 2026 19:12
@realAsma realAsma changed the title [Draft] Refactor _QuantSparseMoe: config-driven token counting, NemotronH detection Refactor HF _QuantSparseMoe: config-driven token counting, NemotronH detection Mar 6, 2026
@realAsma realAsma marked this pull request as ready for review March 6, 2026 19:26
@realAsma realAsma requested review from a team, cjluo-nv and meenchen March 6, 2026 19:26
realAsma added 2 commits March 6, 2026 21:14
…ection

- Accept n_routed_experts alongside num_experts in _is_sparse_moe_block
- Add layer_sync_moe_local_experts_amax to _QuantSparseMoe
- Make token counting and force-all-token calibration config-driven
  (moe_count_expert_calib_tokens, moe_calib_experts_ratio) with lazy
  init; forward is zero-overhead pass-through when both are disabled

Signed-off-by: realAsma <akuriparambi@nvidia.com>
Made-with: Cursor

Deduplicate layer_sync_moe_local_experts_amax into shared sync_moe_experts_input_amax

Signed-off-by: realAsma <akuriparambi@nvidia.com>
Made-with: Cursor
Signed-off-by: realAsma <akuriparambi@nvidia.com>
Made-with: Cursor
@realAsma realAsma force-pushed the asma/nemotronh_moe_support branch from 2ce6547 to 9fae261 Compare March 6, 2026 21:16
Signed-off-by: realAsma <akuriparambi@nvidia.com>
Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant