Quantize lm_head + embedding for Nemotron-H, add NVFP4 W4A16 recipe#1327
Quantize lm_head + embedding for Nemotron-H, add NVFP4 W4A16 recipe#1327
Conversation
…cipe
Extends ModelOpt PTQ so the input token embedding and output LM head can
participate in NVFP4 quantization, and wires that up for Nemotron-H where
those two 131072x3136 tables are ~21% of parameters and leaving them in
bf16 wastes most of the compression.
Changes
- modelopt/torch/quantization/nn/modules/quant_embedding.py (new):
register nn.Embedding with QuantModuleRegistry. Weight-only wrapper that
inherits QuantLinearConvBase but disables the input_quantizer by default
(embedding inputs are integer indices, not activations).
- modelopt/torch/quantization/config.py: add NVFP4_DEFAULT_WEIGHT_ONLY_CFG
(W4A16) via the existing _nvfp4_selective_quant_cfg(..., weight_only=True)
helper; export via the `choices` set.
- modelopt/torch/export/unified_export_hf.py: _process_quantized_modules now
also walks quantized Embedding modules (previously is_quantlinear-only),
so the NVFP4 packing + scale registration path runs for them on export.
- examples/llm_ptq/hf_ptq.py: add `nvfp4_wo` qformat. For model_type ==
"nemotron_h", append cfg entries that re-enable *lm_head*weight_quantizer
and target the backbone embedding (*embeddings* / *embed_tokens*),
overriding the default *lm_head* disable in _default_disabled_quantizer_cfg.
- examples/llm_ptq/example_utils.py: environment workarounds so the example
runs on transformers 5.5.x's partial Nemotron-H port (idempotent, no-op
on fixed transformers):
* NemotronHConfig._pattern_to_list: add `-` -> `mlp`
* ALLOWED_LAYER_TYPES: add `"mlp"`
* NemotronHConfig.validate_layers_block_type: accept `"mlp"` (also
update __class_validators__ since @strict_dataclass snapshots it)
* MIXER_TYPES["mlp"]: adapter around NemotronHMLP that accepts the
layer_idx kwarg passed by NemotronHBlock
* NemotronHBlock.__init__: alias block_type=="mlp" -> "moe" so the
inline block_type_to_mask lookup in NemotronHModel.forward resolves
to None (dispatch is unaffected — the block's forward routes both
through the same `else` branch)
* generation_config: set do_sample=True when sampling hyperparams are
set, so export's save_pretrained passes 5.x strict validation
Validated end-to-end on nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16:
python examples/llm_ptq/hf_ptq.py \
--pyt_ckpt_path nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16 \
--qformat nvfp4_wo --kv_cache_qformat none \
--trust_remote_code --dataset cnn_dailymail \
--calib_size 16 --calib_seq 256 --batch_size 1 --skip_generate \
--export_path /tmp/nemotron_3_nano_4b_nvfp4_wo
Produces a 2.1 GB unified HF checkpoint (vs 7.5 GB bf16), with
model.embeddings and lm_head both exported as packed NVFP4 uint8 +
FP8 per-block scales + FP32 global scales.
Follow-ups (separate PRs):
- compressed-tensors conversion script for vLLM consumption (rename
weight -> weight_packed, weight_scale_2 -> weight_global_scale, rewrite
config.json quantization_config to format=nvfp4-pack-quantized).
- offline vLLM inference script for the converted checkpoint.
- Nemotron-H config.json post-export cleanup (transformers 5.x strips
hybrid_override_pattern in favor of the derived layers_block_type,
which breaks reload via the checkpoint's remote configuration_nemotron_h.py
because layers_block_type there is a read-only property).
- optional --vllm-compat hf_ptq flag that also excludes Mamba in_proj
(output dim 17504 not divisible by 64, violating vLLM's Marlin repack
alignment) so the export is consumable by vLLM out of the box.
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
📝 WalkthroughWalkthroughAdds NVFP4 W4A16 weight‑only quantization end‑to‑end (config, selection, exporter packing, tests), registers nn.Embedding quantization and exports embedding weights, introduces Nemotron‑H mixer-type/runtime patches and a generation_config normalization context used during export, and exposes --exclude_modules in PTQ CLI and scripts. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant CLI
participant PTQ_Script
participant HF_PTQ
participant Quant_Config_System
participant Exporter
participant Runtime
User->>CLI: invoke script (QFORMAT=nvfp4_w4a16, EXCLUDE_MODULES)
CLI->>PTQ_Script: pass args (including --exclude_modules)
PTQ_Script->>HF_PTQ: call hf_ptq.py with args
HF_PTQ->>Quant_Config_System: build/validate quant config (apply NVFP4_W4A16_CFG, Nemotron-H extensions)
HF_PTQ->>Exporter: mono_quantize/export_quantized() wrapped by normalized_generation_config_for_export()
Exporter->>Runtime: write quantized safetensors (include embeddings, NVFP4 packing)
Exporter->>User: emit export path & warnings (nvfp4_w4a16 runtime notes)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (4 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
|
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
examples/llm_ptq/hf_ptq.py (1)
102-123:⚠️ Potential issue | 🟠 Major
nvfp4_wois still rejected byauto_quantize().The new choice is exposed here, but the hard-coded qformat allowlist in
auto_quantize()(Lines 325-344) was not updated.--auto_quantize_bits --qformat nvfp4_wo,...now fails the assertion even though the format is advertised.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/llm_ptq/hf_ptq.py` around lines 102 - 123, The qformat "nvfp4_wo" was added to QUANT_CFG_CHOICES but not included in the hard-coded qformat allowlist inside auto_quantize(), causing the assertion failure; update the allowlist in the auto_quantize() function to include "nvfp4_wo" (or extend the allowlist to derive keys from QUANT_CFG_CHOICES) so that --auto_quantize_bits --qformat nvfp4_wo is accepted; search for the auto_quantize function and add "nvfp4_wo" to the qformat/allowlist there (or replace the static list with QUANT_CFG_CHOICES.keys()) to keep them in sync.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/llm_ptq/example_utils.py`:
- Around line 853-867: The current code mutates the live model.generation_config
(gen_cfg) which makes the same model instance used by get_model()
non-deterministic; instead, create a copy of the generation_config (e.g., via
copy.deepcopy or by constructing a new GenerationConfig from the dict) and
modify the copy’s do_sample flag, leaving model.generation_config unchanged;
update the export/normalization logic around gen_cfg to use this gen_cfg_copy
(or a temporary variable) so previews/full_model.generate() remain deterministic
and only the exported metadata contains the normalized setting.
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 597-637: The helper _enable_lm_head_and_embedding_quantization
currently only appends a "*lm_head*weight_quantizer" override which causes
activation-quantized recipes (e.g., fp8/nvfp4 applied by mono_quantize) to
become mixed-format at lm_head; update this function so it either (A) checks the
applied recipe/weight_quantizer_cfg and only appends the lm_head weight override
for weight-only formats, or (B) when adding "*lm_head*weight_quantizer" also
append a corresponding "*lm_head*input_quantizer" entry that mirrors the base
input-quantizer entry (use copy.deepcopy of the existing input_quantizer config)
so lm_head keeps the same activation format as the rest of the model; reference
_enable_lm_head_and_embedding_quantization, the "quant_cfg" list entries, and
mono_quantize when implementing the conditional or mirrored input_quantizer
addition.
---
Outside diff comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 102-123: The qformat "nvfp4_wo" was added to QUANT_CFG_CHOICES but
not included in the hard-coded qformat allowlist inside auto_quantize(), causing
the assertion failure; update the allowlist in the auto_quantize() function to
include "nvfp4_wo" (or extend the allowlist to derive keys from
QUANT_CFG_CHOICES) so that --auto_quantize_bits --qformat nvfp4_wo is accepted;
search for the auto_quantize function and add "nvfp4_wo" to the
qformat/allowlist there (or replace the static list with
QUANT_CFG_CHOICES.keys()) to keep them in sync.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: a191d33d-6d3e-4cf2-abd0-8b9541d5908e
📒 Files selected for processing (6)
examples/llm_ptq/example_utils.pyexamples/llm_ptq/hf_ptq.pymodelopt/torch/export/unified_export_hf.pymodelopt/torch/quantization/config.pymodelopt/torch/quantization/nn/__init__.pymodelopt/torch/quantization/nn/modules/quant_embedding.py
| # Some model cards ship a generation_config.json that sets sampling hyperparameters | ||
| # (top_p, temperature) without ``do_sample=True`` (e.g. NVIDIA-Nemotron-3-Nano-4B-BF16). | ||
| # transformers 5.x strictly validates this on save_pretrained, so the export step | ||
| # fails with "GenerationConfig is invalid". Normalize by enabling do_sample whenever | ||
| # a sampling hyperparameter is set — this is only metadata, not behavior during | ||
| # calibration or export. | ||
| gen_cfg = getattr(model, "generation_config", None) | ||
| if gen_cfg is not None and not getattr(gen_cfg, "do_sample", False): | ||
| has_sampling_hyperparam = ( | ||
| getattr(gen_cfg, "top_p", None) not in (None, 1.0) | ||
| or getattr(gen_cfg, "top_k", None) not in (None, 0, 50) | ||
| or getattr(gen_cfg, "temperature", None) not in (None, 1.0) | ||
| ) | ||
| if has_sampling_hyperparam: | ||
| gen_cfg.do_sample = True |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cd examples/llm_ptq && head -n 1 example_utils.py && wc -l example_utils.pyRepository: NVIDIA/Model-Optimizer
Length of output: 186
🏁 Script executed:
cd examples/llm_ptq && sed -n '830,880p' example_utils.pyRepository: NVIDIA/Model-Optimizer
Length of output: 2298
🏁 Script executed:
cd examples/llm_ptq && grep -n "def get_model" example_utils.py | head -5Repository: NVIDIA/Model-Optimizer
Length of output: 84
🏁 Script executed:
cd examples/llm_ptq && rg "get_model\(" hf_ptq.py -A 5 -B 2Repository: NVIDIA/Model-Optimizer
Length of output: 430
🏁 Script executed:
cd examples/llm_ptq && rg "\.generate\(" hf_ptq.py -B 3 -A 3Repository: NVIDIA/Model-Optimizer
Length of output: 1010
🏁 Script executed:
cd examples/llm_ptq && sed -n '1,50p' hf_ptq.py | head -30Repository: NVIDIA/Model-Optimizer
Length of output: 1060
🏁 Script executed:
cd examples/llm_ptq && grep -n "full_model" hf_ptq.py | head -20Repository: NVIDIA/Model-Optimizer
Length of output: 1080
🏁 Script executed:
cd examples/llm_ptq && sed -n '300,350p' hf_ptq.pyRepository: NVIDIA/Model-Optimizer
Length of output: 1721
🏁 Script executed:
cd examples/llm_ptq && sed -n '334,345p' hf_ptq.pyRepository: NVIDIA/Model-Optimizer
Length of output: 465
🏁 Script executed:
cd examples/llm_ptq && sed -n '350,365p' hf_ptq.pyRepository: NVIDIA/Model-Optimizer
Length of output: 645
🏁 Script executed:
cd examples/llm_ptq && grep -n "generated_ids_before_ptq" hf_ptq.pyRepository: NVIDIA/Model-Optimizer
Length of output: 816
🏁 Script executed:
cd examples/llm_ptq && sed -n '257,270p' hf_ptq.pyRepository: NVIDIA/Model-Optimizer
Length of output: 653
🏁 Script executed:
cd examples/llm_ptq && sed -n '580,600p' hf_ptq.pyRepository: NVIDIA/Model-Optimizer
Length of output: 689
🏁 Script executed:
cd examples/llm_ptq && sed -n '880,925p' hf_ptq.pyRepository: NVIDIA/Model-Optimizer
Length of output: 1892
🏁 Script executed:
cd examples/llm_ptq && sed -n '1100,1125p' hf_ptq.pyRepository: NVIDIA/Model-Optimizer
Length of output: 1108
🏁 Script executed:
cd examples/llm_ptq && sed -n '970,1020p' hf_ptq.pyRepository: NVIDIA/Model-Optimizer
Length of output: 2331
🏁 Script executed:
cd examples/llm_ptq && sed -n '1180,1220p' hf_ptq.pyRepository: NVIDIA/Model-Optimizer
Length of output: 1472
Don't mutate the live generation_config in get_model().
The mutation persists on the returned model object, and both the before-PTQ and after-PTQ preview calls (full_model.generate() at lines 922 and 980 in hf_ptq.py) use that same model instance. For checkpoints with sampling hyperparameters, this makes the previews non-deterministic instead of deterministic, undermining PTQ smoke test comparisons. Normalize a copy during export instead.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/llm_ptq/example_utils.py` around lines 853 - 867, The current code
mutates the live model.generation_config (gen_cfg) which makes the same model
instance used by get_model() non-deterministic; instead, create a copy of the
generation_config (e.g., via copy.deepcopy or by constructing a new
GenerationConfig from the dict) and modify the copy’s do_sample flag, leaving
model.generation_config unchanged; update the export/normalization logic around
gen_cfg to use this gen_cfg_copy (or a temporary variable) so
previews/full_model.generate() remain deterministic and only the exported
metadata contains the normalized setting.
…mat ID Integrates the scaffolding from PR #1313 (NVFP4 W4A16 generic support) into this branch so the two PRs don't diverge on naming or export-path coverage. The embedding + Nemotron-H enablement in the previous commit is unchanged; this commit just adopts #1313's conventions for the pieces that overlap. Changes - modelopt/torch/quantization/config.py: rename the W4A16 recipe constant NVFP4_DEFAULT_WEIGHT_ONLY_CFG -> NVFP4_W4A16_CFG to match #1313. - modelopt/torch/export/model_config.py: add QUANTIZATION_NVFP4_W4A16 as a distinct format ID instead of relying on NVFP4 branches tolerating a disabled input_quantizer. - modelopt/torch/export/quant_utils.py: thread NVFP4_W4A16 through get_weight_scaling_factor, get_weight_scaling_factor_2, to_quantized_weight, and the nvfp4_w4a16 branch of process_layer_quant_config. Add explicit W4A16 detection in _get_quantization_from_layer when input_quantizer is absent/disabled. - modelopt/torch/export/unified_export_hf.py: add NVFP4_W4A16 to the weight_scale_2 registration and NVFP4 transpose lists. - modelopt/torch/export/convert_hf_config.py: add NVFP4_W4A16 mapping in _quant_algo_to_group_config and convert_hf_quant_config_format so the llm-compressor conversion emits a weight-only config group. - examples/llm_ptq/hf_ptq.py: rename qformat nvfp4_wo -> nvfp4_w4a16; add --exclude_modules CLI (composes with the Nemotron-H helpers added in the previous commit); emit a post-export vLLM deployment warning. - examples/llm_ptq/scripts/huggingface_example.sh: add nvfp4_w4a16 to the qformat allowlist, EXCLUDE_MODULES env pass-through, and a W4A16 export notice. - CHANGELOG.rst: document W4A16 (covers the overlap with #1313) and the Embedding / Nemotron-H enablement unique to this PR. - tests/gpu/torch/export/test_unified_hf_export_and_check_safetensors.py: add an nvfp4_w4a16 parametrize entry for the tiny_llama fixture. pre-commit (ruff, ruff-format, mypy, bandit, insert-license, rst-lint) passes on all touched files. Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
examples/llm_ptq/hf_ptq.py (1)
597-637:⚠️ Potential issue | 🟠 MajorKeep
lm_headquantization aligned with the base Nemotron-H recipe.This helper only re-enables
*lm_head*weight_quantizer. When Nemotron-H runs with an activation-aware recipe such asfp8ornvfp4,lm_headstops matching the rest of the model; for NVFP4,modelopt/torch/export/quant_utils.pynow even reclassifies it asnvfp4_w4a16because the input quantizer stays disabled. Either gate this helper to weight-only recipes or append a mirrored*lm_head*input_quantizerrule copied from the base config.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/llm_ptq/hf_ptq.py` around lines 597 - 637, The helper _enable_lm_head_and_embedding_quantization currently only re-enables the lm_head weight quantizer which desynchronizes lm_head when activation-aware recipes (e.g. fp8/nvfp4) are used; update it to either (A) only run when the active recipe is weight-only (check quant_cfg["algorithm"] or similar indicator) OR (B) also append a mirrored "*lm_head*input_quantizer" entry copied from the base/input quantizer config so lm_head keeps the same input quantization as the rest of the model; modify _enable_lm_head_and_embedding_quantization to perform one of these two fixes and ensure the new entry uses copy.deepcopy like the existing weight entries.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 686-697: The Nemotron-H opt-in currently adds enable/config
entries after quantize_main() has already appended user --exclude_modules rules
(so it can override user exclusions); update the flow so user exclusions are
respected by either moving the
_enable_lm_head_and_embedding_quantization(quant_cfg, weight_quantizer_cfg) call
to run before quantize_main()/before mono_quantize() applies exclude updates, or
(preferably) change _enable_lm_head_and_embedding_quantization to check
quant_cfg.exclude_modules (and any existing disable rules) and skip adding
enable/config entries for "lm_head" or "embeddings" if the user explicitly
excluded them; make this change around quantize_main(), mono_quantize(), and
_enable_lm_head_and_embedding_quantization so the user's --exclude_modules is
never silently undone.
In `@examples/llm_ptq/scripts/huggingface_example.sh`:
- Around line 130-132: The wrapper currently expands shell globs because
EXCLUDE_MODULES is injected unquoted into PTQ_ARGS before calling hf_ptq.py; fix
this by preserving the literal pattern when forwarding --exclude_modules: stop
building a single unquoted string and use a bash array for PTQ_ARGS (e.g.,
append the two separate elements "--exclude_modules" and "$EXCLUDE_MODULES") or
otherwise ensure the variable is quoted when added so hf_ptq.py receives the
exact pattern (references: EXCLUDE_MODULES, PTQ_ARGS, and the --exclude_modules
argument passed to hf_ptq.py).
In `@modelopt/torch/export/convert_hf_config.py`:
- Around line 191-198: The NVFP4_W4A16 branch sets config_group_details.targets
to only ["Linear"], which omits embeddings even though weight-only quantization
applies to Embedding layers; update the targets list in the NVFP4_W4A16 branch
(where quant_algo_value == "NVFP4_W4A16" and config_group_details is built) to
include "Embedding" (e.g., ["Linear", "Embedding"]) before assigning
new_config["config_groups"] so compressed-tensors exports match the actual
NVFP4_W4A16 quantization coverage.
---
Duplicate comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 597-637: The helper _enable_lm_head_and_embedding_quantization
currently only re-enables the lm_head weight quantizer which desynchronizes
lm_head when activation-aware recipes (e.g. fp8/nvfp4) are used; update it to
either (A) only run when the active recipe is weight-only (check
quant_cfg["algorithm"] or similar indicator) OR (B) also append a mirrored
"*lm_head*input_quantizer" entry copied from the base/input quantizer config so
lm_head keeps the same input quantization as the rest of the model; modify
_enable_lm_head_and_embedding_quantization to perform one of these two fixes and
ensure the new entry uses copy.deepcopy like the existing weight entries.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 7568812b-30c3-4c61-bbc4-d04ef8fc9364
📒 Files selected for processing (9)
CHANGELOG.rstexamples/llm_ptq/hf_ptq.pyexamples/llm_ptq/scripts/huggingface_example.shmodelopt/torch/export/convert_hf_config.pymodelopt/torch/export/model_config.pymodelopt/torch/export/quant_utils.pymodelopt/torch/export/unified_export_hf.pymodelopt/torch/quantization/config.pytests/gpu/torch/export/test_unified_hf_export_and_check_safetensors.py
✅ Files skipped from review due to trivial changes (2)
- modelopt/torch/export/model_config.py
- CHANGELOG.rst
🚧 Files skipped from review as they are similar to previous changes (1)
- modelopt/torch/export/unified_export_hf.py
- example_utils: swap in a normalized generation_config via context manager during export instead of mutating the live one in get_model() — preview generate() calls stay deterministic - hf_ptq: mirror *lm_head*input_quantizer for activation-aware recipes so lm_head doesn't silently downgrade to W4A16 under NVFP4/FP8 - hf_ptq: respect --exclude_modules in the Nemotron-H lm_head/embedding override so user exclusions aren't silently undone - hf_ptq: add nvfp4_w4a16 to the auto_quantize qformat allowlist for consistency with QUANT_CFG_CHOICES - huggingface_example.sh: pass --exclude_modules via a bash array (set -f) so wildcard patterns like *embed_tokens* reach argparse verbatim instead of being glob-expanded against the filesystem - convert_hf_config: include Embedding in the NVFP4_W4A16 target set so compressed-tensors consumers dispatch on quantized embedding weights Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
examples/llm_ptq/example_utils.py (1)
873-876: Consider documenting thetop_k=50check.The check
getattr(original, "top_k", None) not in (None, 0, 50)treatstop_k=50as a "default/unset" value. While 50 is indeed the transformers default, this implicit knowledge could benefit from a brief inline comment for maintainability.💡 Suggested clarification
has_sampling_hyperparam = ( getattr(original, "top_p", None) not in (None, 1.0) - or getattr(original, "top_k", None) not in (None, 0, 50) + or getattr(original, "top_k", None) not in (None, 0, 50) # 50 is transformers default or getattr(original, "temperature", None) not in (None, 1.0) )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/llm_ptq/example_utils.py` around lines 873 - 876, The condition using getattr(original, "top_k", None) not in (None, 0, 50) implicitly treats top_k==50 as a default/unset value; add a brief inline comment next to that expression (or expand the surrounding docstring) stating that 50 is the HuggingFace/transformers default so it should be treated as unset, e.g., “# 50 is transformers' default top_k, treat as unset”; ensure the comment references getattr(original, "top_k", None) so future readers understand why 50 is excluded.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@examples/llm_ptq/example_utils.py`:
- Around line 873-876: The condition using getattr(original, "top_k", None) not
in (None, 0, 50) implicitly treats top_k==50 as a default/unset value; add a
brief inline comment next to that expression (or expand the surrounding
docstring) stating that 50 is the HuggingFace/transformers default so it should
be treated as unset, e.g., “# 50 is transformers' default top_k, treat as
unset”; ensure the comment references getattr(original, "top_k", None) so future
readers understand why 50 is excluded.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 0eeb4272-c356-455a-9a92-0ae3a6fbd489
📒 Files selected for processing (4)
examples/llm_ptq/example_utils.pyexamples/llm_ptq/hf_ptq.pyexamples/llm_ptq/scripts/huggingface_example.shmodelopt/torch/export/convert_hf_config.py
| mts.export(full_model) | ||
|
|
||
|
|
||
| def _enable_lm_head_and_embedding_quantization( |
There was a problem hiding this comment.
Can we define this in the modelop_recipe if everything modelopt_recipes/models can be captured with our yaml recipe system?
| # For Nemotron-H (Mamba-2 + MLP + Attention hybrid, e.g. NVIDIA-Nemotron-3-Nano-4B), | ||
| # extend quantization coverage to the lm_head and the input token embedding. On this | ||
| # architecture those two 131072x3136 tables account for ~21% of parameters, so leaving | ||
| # them at bf16 wastes most of the NVFP4 memory benefit. | ||
| if model_type == "nemotron_h": | ||
| weight_quantizer_cfg = _extract_wildcard_quantizer_cfg(quant_cfg, "weight_quantizer") | ||
| if weight_quantizer_cfg is not None: | ||
| # ``input_quantizer_cfg`` is present only for activation-aware recipes (fp8, nvfp4, | ||
| # ...). For weight-only recipes (nvfp4_w4a16, fp8_pb_wo, ...) this returns None and | ||
| # ``lm_head`` stays weight-only along with the embedding. | ||
| input_quantizer_cfg = _extract_wildcard_quantizer_cfg(quant_cfg, "input_quantizer") | ||
| print( | ||
| "Nemotron-H detected: extending quantization to lm_head and input embedding " | ||
| "(backbone.embeddings)." | ||
| ) | ||
| _enable_lm_head_and_embedding_quantization( | ||
| quant_cfg, | ||
| weight_quantizer_cfg, | ||
| input_quantizer_cfg=input_quantizer_cfg, | ||
| user_excluded_modules=args.exclude_modules or None, | ||
| ) | ||
| else: | ||
| warnings.warn( | ||
| "Nemotron-H detected but quant_cfg has no wildcard '*weight_quantizer' entry; " | ||
| "skipping lm_head/embedding extension (model-specific or non-standard recipe)." | ||
| ) | ||
|
|
There was a problem hiding this comment.
Same as the previous comment, wondering if our recipe system can replace this ad hoc change to support specific models
Related PR
Aligns with #1313 (Support NVFP4 W4A16 quantization) — shares the
NVFP4_W4A16_CFGrecipe, thenvfp4_w4a16qformat, and theQUANTIZATION_NVFP4_W4A16format ID. This PR adds the embedding +lm_headquantization support on top.Summary
Extends ModelOpt PTQ so the input token embedding and output LM head can participate in NVFP4 quantization, and wires that up for Nemotron-H where those two
131072×3136tables are ~21% of parameters and leaving them in bf16 wastes most of the compression.Changes
Core quantization library
modelopt/torch/quantization/nn/modules/quant_embedding.py(new) — Registernn.EmbeddingwithQuantModuleRegistry. Weight-only wrapper that inheritsQuantLinearConvBasebut disables theinput_quantizerby default (embedding inputs are integer indices, not activations; output_quantizer is already disabled byQuantInputBase._setup).modelopt/torch/quantization/nn/__init__.py— Import the newquant_embeddingmodule so registration fires at library import time.Export
modelopt/torch/export/unified_export_hf.py—_process_quantized_modulesnow also walks quantizedEmbeddingmodules (previouslyis_quantlinear-only), so the NVFP4 packing + scale registration path in_export_quantized_weightruns for them on export.hf_ptq.pyexamplemodel_type == "nemotron_h", append cfg entries that re-enable*lm_head*weight_quantizerand target the backbone embedding (*embeddings*/*embed_tokens*), overriding the default*lm_head*disable in_default_disabled_quantizer_cfg. Guarded helpers (_enable_lm_head_and_embedding_quantization,_extract_weight_quantizer_cfg) so the override only fires when a standard*weight_quantizerentry is present.example_utils.py— environment workaroundsThese are idempotent workarounds for
transformers 5.5.x's partial Nemotron-H port; they no-op on a fixed transformers (e.g. inside the TRT Docker container's newer wheel):NemotronHConfig._pattern_to_list: add-→mlpALLOWED_LAYER_TYPES: add"mlp"NemotronHConfig.validate_layers_block_type: accept"mlp"(also update__class_validators__since huggingface_hub's@strict_dataclasssnapshots validators at class-creation time, so overwriting the method attribute alone isn't enough)MIXER_TYPES["mlp"]: adapter aroundNemotronHMLPthat accepts thelayer_idxkwarg passed byNemotronHBlockNemotronHBlock.__init__: aliasblock_type == "mlp"→"moe"so the inlineblock_type_to_masklookup inNemotronHModel.forwardresolves toNone(dispatch is unaffected — the block's forward routes both through the sameelsebranch that callsself.mixer(hidden_states))generation_config: setdo_sample=Truewhen sampling hyperparams are set, so export'ssave_pretrainedpasses transformers 5.x strict validationValidation
End-to-end on
nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16:python examples/llm_ptq/hf_ptq.py \ --pyt_ckpt_path nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16 \ --qformat nvfp4_w4a16 --kv_cache_qformat none \ --trust_remote_code --dataset cnn_dailymail \ --calib_size 16 --calib_seq 256 --batch_size 1 --skip_generate \ --export_path /tmp/nemotron_3_nano_4b_nvfp4_w4a16Produces a 2.2 GB unified HF checkpoint (vs 7.5 GB bf16), with
model.embeddings.weightandlm_head.weightboth stored as packed NVFP4 uint8 + FP8 per-block scales + FP32 global scale.hf_quant_config.jsonreportsquant_algo: NVFP4_W4A16,group_size: 16, andexclude_modulescontains only the 21 Mambaconv1dlayers (the default_default_disabled_quantizer_cfgentry for*mixer.conv1d*).pre-commit run --files <staged>passes (ruff, ruff-format, mypy, bandit, insert-license, rst-lint).Follow-ups (separate PRs)
*.weight → *.weight_packed,*.weight_scale_2 → *.weight_global_scale(inverted), and rewritesconfig.jsonquantization_configtoformat: nvfp4-pack-quantized/quant_method: compressed-tensors. Already prototyped out-of-tree; just needs cleanup + tests.vllm.LLMwith chat-template rendering,max_model_lencap,--enforce-eagerdefault for Mamba/SSM). Already prototyped out-of-tree.config.jsonpost-export cleanup: transformers 5.x stripshybrid_override_patternin favor of the derivedlayers_block_typelist, which breaks reload via the checkpoint's remoteconfiguration_nemotron_h.py(itslayers_block_typeis a read-only@property). The export path should restorehybrid_override_patternand setnum_hidden_layersexplicitly formodel_type == "nemotron_h".--vllm-compathf_ptqflag that additionally excludes Mambain_proj(output dim 17504 =intermediate + conv_dim + num_headsisn't divisible by 64, violating vLLM's Marlin repack alignment) and leaveslm_head/model.embeddingsin bf16 (vLLM'sParallelLMHead/VocabParallelEmbeddingdon't consume compressed-tensors scales), so the export is consumable by vLLM out of the box.example_utils.pymonkey-patches can be dropped.Test plan
nn.Embeddingregisters and is replaced withQuantEmbeddingundermtq.quantize(..., NVFP4_W4A16_CFG, forward_loop=None)on a toySequential(Embedding, Linear)model; verified forward pass on CUDA.nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16(see Validation above).tests/gpu/torch/export/fornn.Embeddingweight packing — follow-up once the conversion/load path lands.hf_ptq.py, and this PR doesn't change that.Summary by CodeRabbit
Release Notes
New Features
--exclude_modulesCLI flag to selectively exclude specific modules from quantizationTests