Skip to content

Quantize lm_head + embedding for Nemotron-H, add NVFP4 W4A16 recipe#1327

Open
ajrasane wants to merge 3 commits intomainfrom
ajrasane/nemotron-3-nano
Open

Quantize lm_head + embedding for Nemotron-H, add NVFP4 W4A16 recipe#1327
ajrasane wants to merge 3 commits intomainfrom
ajrasane/nemotron-3-nano

Conversation

@ajrasane
Copy link
Copy Markdown
Contributor

@ajrasane ajrasane commented Apr 22, 2026

Related PR

Aligns with #1313 (Support NVFP4 W4A16 quantization) — shares the NVFP4_W4A16_CFG recipe, the nvfp4_w4a16 qformat, and the QUANTIZATION_NVFP4_W4A16 format ID. This PR adds the embedding + lm_head quantization support on top.

Summary

Extends ModelOpt PTQ so the input token embedding and output LM head can participate in NVFP4 quantization, and wires that up for Nemotron-H where those two 131072×3136 tables are ~21% of parameters and leaving them in bf16 wastes most of the compression.

Changes

Core quantization library

  • modelopt/torch/quantization/nn/modules/quant_embedding.py (new) — Register nn.Embedding with QuantModuleRegistry. Weight-only wrapper that inherits QuantLinearConvBase but disables the input_quantizer by default (embedding inputs are integer indices, not activations; output_quantizer is already disabled by QuantInputBase._setup).
  • modelopt/torch/quantization/nn/__init__.py — Import the new quant_embedding module so registration fires at library import time.

Export

  • modelopt/torch/export/unified_export_hf.py_process_quantized_modules now also walks quantized Embedding modules (previously is_quantlinear-only), so the NVFP4 packing + scale registration path in _export_quantized_weight runs for them on export.

hf_ptq.py example

  • For model_type == "nemotron_h", append cfg entries that re-enable *lm_head*weight_quantizer and target the backbone embedding (*embeddings* / *embed_tokens*), overriding the default *lm_head* disable in _default_disabled_quantizer_cfg. Guarded helpers (_enable_lm_head_and_embedding_quantization, _extract_weight_quantizer_cfg) so the override only fires when a standard *weight_quantizer entry is present.

example_utils.py — environment workarounds

These are idempotent workarounds for transformers 5.5.x's partial Nemotron-H port; they no-op on a fixed transformers (e.g. inside the TRT Docker container's newer wheel):

  • NemotronHConfig._pattern_to_list: add -mlp
  • ALLOWED_LAYER_TYPES: add "mlp"
  • NemotronHConfig.validate_layers_block_type: accept "mlp" (also update __class_validators__ since huggingface_hub's @strict_dataclass snapshots validators at class-creation time, so overwriting the method attribute alone isn't enough)
  • MIXER_TYPES["mlp"]: adapter around NemotronHMLP that accepts the layer_idx kwarg passed by NemotronHBlock
  • NemotronHBlock.__init__: alias block_type == "mlp""moe" so the inline block_type_to_mask lookup in NemotronHModel.forward resolves to None (dispatch is unaffected — the block's forward routes both through the same else branch that calls self.mixer(hidden_states))
  • generation_config: set do_sample=True when sampling hyperparams are set, so export's save_pretrained passes transformers 5.x strict validation

Validation

End-to-end on nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16:

python examples/llm_ptq/hf_ptq.py \
    --pyt_ckpt_path nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16 \
    --qformat nvfp4_w4a16 --kv_cache_qformat none \
    --trust_remote_code --dataset cnn_dailymail \
    --calib_size 16 --calib_seq 256 --batch_size 1 --skip_generate \
    --export_path /tmp/nemotron_3_nano_4b_nvfp4_w4a16

Produces a 2.2 GB unified HF checkpoint (vs 7.5 GB bf16), with model.embeddings.weight and lm_head.weight both stored as packed NVFP4 uint8 + FP8 per-block scales + FP32 global scale. hf_quant_config.json reports quant_algo: NVFP4_W4A16, group_size: 16, and exclude_modules contains only the 21 Mamba conv1d layers (the default _default_disabled_quantizer_cfg entry for *mixer.conv1d*).

pre-commit run --files <staged> passes (ruff, ruff-format, mypy, bandit, insert-license, rst-lint).

Follow-ups (separate PRs)

  • Compressed-tensors conversion script for vLLM consumption: renames *.weight → *.weight_packed, *.weight_scale_2 → *.weight_global_scale (inverted), and rewrites config.json quantization_config to format: nvfp4-pack-quantized / quant_method: compressed-tensors. Already prototyped out-of-tree; just needs cleanup + tests.
  • Offline vLLM inference script for the converted checkpoint (CLI wrapping vllm.LLM with chat-template rendering, max_model_len cap, --enforce-eager default for Mamba/SSM). Already prototyped out-of-tree.
  • Nemotron-H config.json post-export cleanup: transformers 5.x strips hybrid_override_pattern in favor of the derived layers_block_type list, which breaks reload via the checkpoint's remote configuration_nemotron_h.py (its layers_block_type is a read-only @property). The export path should restore hybrid_override_pattern and set num_hidden_layers explicitly for model_type == "nemotron_h".
  • Optional --vllm-compat hf_ptq flag that additionally excludes Mamba in_proj (output dim 17504 = intermediate + conv_dim + num_heads isn't divisible by 64, violating vLLM's Marlin repack alignment) and leaves lm_head / model.embeddings in bf16 (vLLM's ParallelLMHead / VocabParallelEmbedding don't consume compressed-tensors scales), so the export is consumable by vLLM out of the box.
  • Upstream the transformers 5.5.x Nemotron-H fixes so the example_utils.py monkey-patches can be dropped.

Test plan

  • pre-commit (ruff, ruff-format, mypy, bandit, insert-license) passes on the staged files.
  • Smoke test that nn.Embedding registers and is replaced with QuantEmbedding under mtq.quantize(..., NVFP4_W4A16_CFG, forward_loop=None) on a toy Sequential(Embedding, Linear) model; verified forward pass on CUDA.
  • End-to-end PTQ + unified HF export on nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16 (see Validation above).
  • GPU unit/integration test under tests/gpu/torch/export/ for nn.Embedding weight packing — follow-up once the conversion/load path lands.
  • Multi-GPU / tensor-parallel export path — not exercised; Nemotron-H's accelerate-plus-multi-GPU path is already flagged as known-broken in hf_ptq.py, and this PR doesn't change that.

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced NVFP4 W4A16 weight-only quantization format for optimized model deployment
    • Added --exclude_modules CLI flag to selectively exclude specific modules from quantization
    • Extended quantization pipeline to support embedding layer quantization
    • Enhanced HuggingFace export pipeline with quantized embedding weight packing and Nemotron-H model compatibility
  • Tests

    • Added test coverage for NVFP4 W4A16 quantization format validation

…cipe

Extends ModelOpt PTQ so the input token embedding and output LM head can
participate in NVFP4 quantization, and wires that up for Nemotron-H where
those two 131072x3136 tables are ~21% of parameters and leaving them in
bf16 wastes most of the compression.

Changes
- modelopt/torch/quantization/nn/modules/quant_embedding.py (new):
  register nn.Embedding with QuantModuleRegistry. Weight-only wrapper that
  inherits QuantLinearConvBase but disables the input_quantizer by default
  (embedding inputs are integer indices, not activations).
- modelopt/torch/quantization/config.py: add NVFP4_DEFAULT_WEIGHT_ONLY_CFG
  (W4A16) via the existing _nvfp4_selective_quant_cfg(..., weight_only=True)
  helper; export via the `choices` set.
- modelopt/torch/export/unified_export_hf.py: _process_quantized_modules now
  also walks quantized Embedding modules (previously is_quantlinear-only),
  so the NVFP4 packing + scale registration path runs for them on export.
- examples/llm_ptq/hf_ptq.py: add `nvfp4_wo` qformat. For model_type ==
  "nemotron_h", append cfg entries that re-enable *lm_head*weight_quantizer
  and target the backbone embedding (*embeddings* / *embed_tokens*),
  overriding the default *lm_head* disable in _default_disabled_quantizer_cfg.
- examples/llm_ptq/example_utils.py: environment workarounds so the example
  runs on transformers 5.5.x's partial Nemotron-H port (idempotent, no-op
  on fixed transformers):
    * NemotronHConfig._pattern_to_list: add `-` -> `mlp`
    * ALLOWED_LAYER_TYPES: add `"mlp"`
    * NemotronHConfig.validate_layers_block_type: accept `"mlp"` (also
      update __class_validators__ since @strict_dataclass snapshots it)
    * MIXER_TYPES["mlp"]: adapter around NemotronHMLP that accepts the
      layer_idx kwarg passed by NemotronHBlock
    * NemotronHBlock.__init__: alias block_type=="mlp" -> "moe" so the
      inline block_type_to_mask lookup in NemotronHModel.forward resolves
      to None (dispatch is unaffected — the block's forward routes both
      through the same `else` branch)
    * generation_config: set do_sample=True when sampling hyperparams are
      set, so export's save_pretrained passes 5.x strict validation

Validated end-to-end on nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16:
  python examples/llm_ptq/hf_ptq.py \
      --pyt_ckpt_path nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16 \
      --qformat nvfp4_wo --kv_cache_qformat none \
      --trust_remote_code --dataset cnn_dailymail \
      --calib_size 16 --calib_seq 256 --batch_size 1 --skip_generate \
      --export_path /tmp/nemotron_3_nano_4b_nvfp4_wo
Produces a 2.1 GB unified HF checkpoint (vs 7.5 GB bf16), with
model.embeddings and lm_head both exported as packed NVFP4 uint8 +
FP8 per-block scales + FP32 global scales.

Follow-ups (separate PRs):
- compressed-tensors conversion script for vLLM consumption (rename
  weight -> weight_packed, weight_scale_2 -> weight_global_scale, rewrite
  config.json quantization_config to format=nvfp4-pack-quantized).
- offline vLLM inference script for the converted checkpoint.
- Nemotron-H config.json post-export cleanup (transformers 5.x strips
  hybrid_override_pattern in favor of the derived layers_block_type,
  which breaks reload via the checkpoint's remote configuration_nemotron_h.py
  because layers_block_type there is a read-only property).
- optional --vllm-compat hf_ptq flag that also excludes Mamba in_proj
  (output dim 17504 not divisible by 64, violating vLLM's Marlin repack
  alignment) so the export is consumable by vLLM out of the box.

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
@ajrasane ajrasane requested review from a team as code owners April 22, 2026 22:33
@ajrasane ajrasane requested review from meenchen and realAsma April 22, 2026 22:33
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 22, 2026

📝 Walkthrough

Walkthrough

Adds NVFP4 W4A16 weight‑only quantization end‑to‑end (config, selection, exporter packing, tests), registers nn.Embedding quantization and exports embedding weights, introduces Nemotron‑H mixer-type/runtime patches and a generation_config normalization context used during export, and exposes --exclude_modules in PTQ CLI and scripts.

Changes

Cohort / File(s) Summary
Nemotron‑H compatibility & model loading
examples/llm_ptq/example_utils.py
Adds _maybe_patch_transformers_nemotron_h_mixer_types() to register "mlp" mixer type, relax Nemotron‑H config parsing/validation, and provides normalized_generation_config_for_export(model) context manager invoked before HF config/model export.
PTQ CLI & example workflow
examples/llm_ptq/hf_ptq.py, examples/llm_ptq/scripts/huggingface_example.sh
Adds nvfp4_w4a16 qformat, introduces --exclude_modules CLI handling (and propagation from script), wraps export with generation-config normalization, and implements Nemotron‑H-specific quant config extensions for lm_head/embeddings during mono_quantize plus export-time warning for NVFP4_W4A16 runtime support.
Exporter constants & conversion
modelopt/torch/export/model_config.py, modelopt/torch/export/convert_hf_config.py, modelopt/torch/export/unified_export_hf.py
Adds QUANTIZATION_NVFP4_W4A16, maps it to a weights-only FP4 group in HF quant-config conversion, recognizes the format in unified exporter weight export/packing, and includes embeddings in the quantized-weight export path.
Quantization core: NVFP4 weight‑only flow
modelopt/torch/quantization/config.py, modelopt/torch/export/quant_utils.py
Adds NVFP4_W4A16_CFG (weight-only selective config) to choices and wires nvfp4_w4a16 into detection, scaling/packing, TRTs export config generation, and to_quantized_weight handling (treats absent input_quantizer as weight‑only).
Embedding quantization module & package export
modelopt/torch/quantization/nn/modules/quant_embedding.py, modelopt/torch/quantization/nn/__init__.py
Implements _QuantEmbedding/QuantEmbedding (registered with QuantModuleRegistry, per-row linear weight quantization, disables input fake‑quant for indices) and re-exports embedding quantization at package level.
Tests
tests/gpu/torch/export/test_unified_hf_export_and_check_safetensors.py
Adds nvfp4_w4a16 test entry to the unified HF export + safetensors verification matrix (sanity/presence checks for the exported artifact).
Changelog
CHANGELOG.rst
Documents NVFP4 W4A16 weight‑only quantization, embedding export support, Nemotron‑H runtime enablement, and new CLI --exclude_modules hook.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI
    participant PTQ_Script
    participant HF_PTQ
    participant Quant_Config_System
    participant Exporter
    participant Runtime
    User->>CLI: invoke script (QFORMAT=nvfp4_w4a16, EXCLUDE_MODULES)
    CLI->>PTQ_Script: pass args (including --exclude_modules)
    PTQ_Script->>HF_PTQ: call hf_ptq.py with args
    HF_PTQ->>Quant_Config_System: build/validate quant config (apply NVFP4_W4A16_CFG, Nemotron-H extensions)
    HF_PTQ->>Exporter: mono_quantize/export_quantized() wrapped by normalized_generation_config_for_export()
    Exporter->>Runtime: write quantized safetensors (include embeddings, NVFP4 packing)
    Exporter->>User: emit export path & warnings (nvfp4_w4a16 runtime notes)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 58.62% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Security Anti-Patterns ❓ Inconclusive Security audit completed for torch.load usage, numpy.load with pickling, hardcoded trust_remote_code=True, eval/exec calls, nosec comments, and new dependencies. No modified Python files or repository content was provided for analysis. Please provide the files or repository details to complete the security audit.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main changes: quantizing lm_head and embedding for Nemotron-H, and adding NVFP4 W4A16 recipe support.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ajrasane/nemotron-3-nano

Comment @coderabbitai help to get the list of available commands and usage tips.

@ajrasane ajrasane changed the title feat: quantize lm_head + embedding for Nemotron-H, add NVFP4 W4A16 recipe Quantize lm_head + embedding for Nemotron-H, add NVFP4 W4A16 recipe Apr 22, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 22, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1327/

Built to branch gh-pages at 2026-04-24 16:32 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
examples/llm_ptq/hf_ptq.py (1)

102-123: ⚠️ Potential issue | 🟠 Major

nvfp4_wo is still rejected by auto_quantize().

The new choice is exposed here, but the hard-coded qformat allowlist in auto_quantize() (Lines 325-344) was not updated. --auto_quantize_bits --qformat nvfp4_wo,... now fails the assertion even though the format is advertised.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/hf_ptq.py` around lines 102 - 123, The qformat "nvfp4_wo"
was added to QUANT_CFG_CHOICES but not included in the hard-coded qformat
allowlist inside auto_quantize(), causing the assertion failure; update the
allowlist in the auto_quantize() function to include "nvfp4_wo" (or extend the
allowlist to derive keys from QUANT_CFG_CHOICES) so that --auto_quantize_bits
--qformat nvfp4_wo is accepted; search for the auto_quantize function and add
"nvfp4_wo" to the qformat/allowlist there (or replace the static list with
QUANT_CFG_CHOICES.keys()) to keep them in sync.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/llm_ptq/example_utils.py`:
- Around line 853-867: The current code mutates the live model.generation_config
(gen_cfg) which makes the same model instance used by get_model()
non-deterministic; instead, create a copy of the generation_config (e.g., via
copy.deepcopy or by constructing a new GenerationConfig from the dict) and
modify the copy’s do_sample flag, leaving model.generation_config unchanged;
update the export/normalization logic around gen_cfg to use this gen_cfg_copy
(or a temporary variable) so previews/full_model.generate() remain deterministic
and only the exported metadata contains the normalized setting.

In `@examples/llm_ptq/hf_ptq.py`:
- Around line 597-637: The helper _enable_lm_head_and_embedding_quantization
currently only appends a "*lm_head*weight_quantizer" override which causes
activation-quantized recipes (e.g., fp8/nvfp4 applied by mono_quantize) to
become mixed-format at lm_head; update this function so it either (A) checks the
applied recipe/weight_quantizer_cfg and only appends the lm_head weight override
for weight-only formats, or (B) when adding "*lm_head*weight_quantizer" also
append a corresponding "*lm_head*input_quantizer" entry that mirrors the base
input-quantizer entry (use copy.deepcopy of the existing input_quantizer config)
so lm_head keeps the same activation format as the rest of the model; reference
_enable_lm_head_and_embedding_quantization, the "quant_cfg" list entries, and
mono_quantize when implementing the conditional or mirrored input_quantizer
addition.

---

Outside diff comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 102-123: The qformat "nvfp4_wo" was added to QUANT_CFG_CHOICES but
not included in the hard-coded qformat allowlist inside auto_quantize(), causing
the assertion failure; update the allowlist in the auto_quantize() function to
include "nvfp4_wo" (or extend the allowlist to derive keys from
QUANT_CFG_CHOICES) so that --auto_quantize_bits --qformat nvfp4_wo is accepted;
search for the auto_quantize function and add "nvfp4_wo" to the
qformat/allowlist there (or replace the static list with
QUANT_CFG_CHOICES.keys()) to keep them in sync.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: a191d33d-6d3e-4cf2-abd0-8b9541d5908e

📥 Commits

Reviewing files that changed from the base of the PR and between e56682e and 43c3454.

📒 Files selected for processing (6)
  • examples/llm_ptq/example_utils.py
  • examples/llm_ptq/hf_ptq.py
  • modelopt/torch/export/unified_export_hf.py
  • modelopt/torch/quantization/config.py
  • modelopt/torch/quantization/nn/__init__.py
  • modelopt/torch/quantization/nn/modules/quant_embedding.py

Comment thread examples/llm_ptq/example_utils.py Outdated
Comment on lines +853 to +867
# Some model cards ship a generation_config.json that sets sampling hyperparameters
# (top_p, temperature) without ``do_sample=True`` (e.g. NVIDIA-Nemotron-3-Nano-4B-BF16).
# transformers 5.x strictly validates this on save_pretrained, so the export step
# fails with "GenerationConfig is invalid". Normalize by enabling do_sample whenever
# a sampling hyperparameter is set — this is only metadata, not behavior during
# calibration or export.
gen_cfg = getattr(model, "generation_config", None)
if gen_cfg is not None and not getattr(gen_cfg, "do_sample", False):
has_sampling_hyperparam = (
getattr(gen_cfg, "top_p", None) not in (None, 1.0)
or getattr(gen_cfg, "top_k", None) not in (None, 0, 50)
or getattr(gen_cfg, "temperature", None) not in (None, 1.0)
)
if has_sampling_hyperparam:
gen_cfg.do_sample = True
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cd examples/llm_ptq && head -n 1 example_utils.py && wc -l example_utils.py

Repository: NVIDIA/Model-Optimizer

Length of output: 186


🏁 Script executed:

cd examples/llm_ptq && sed -n '830,880p' example_utils.py

Repository: NVIDIA/Model-Optimizer

Length of output: 2298


🏁 Script executed:

cd examples/llm_ptq && grep -n "def get_model" example_utils.py | head -5

Repository: NVIDIA/Model-Optimizer

Length of output: 84


🏁 Script executed:

cd examples/llm_ptq && rg "get_model\(" hf_ptq.py -A 5 -B 2

Repository: NVIDIA/Model-Optimizer

Length of output: 430


🏁 Script executed:

cd examples/llm_ptq && rg "\.generate\(" hf_ptq.py -B 3 -A 3

Repository: NVIDIA/Model-Optimizer

Length of output: 1010


🏁 Script executed:

cd examples/llm_ptq && sed -n '1,50p' hf_ptq.py | head -30

Repository: NVIDIA/Model-Optimizer

Length of output: 1060


🏁 Script executed:

cd examples/llm_ptq && grep -n "full_model" hf_ptq.py | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 1080


🏁 Script executed:

cd examples/llm_ptq && sed -n '300,350p' hf_ptq.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1721


🏁 Script executed:

cd examples/llm_ptq && sed -n '334,345p' hf_ptq.py

Repository: NVIDIA/Model-Optimizer

Length of output: 465


🏁 Script executed:

cd examples/llm_ptq && sed -n '350,365p' hf_ptq.py

Repository: NVIDIA/Model-Optimizer

Length of output: 645


🏁 Script executed:

cd examples/llm_ptq && grep -n "generated_ids_before_ptq" hf_ptq.py

Repository: NVIDIA/Model-Optimizer

Length of output: 816


🏁 Script executed:

cd examples/llm_ptq && sed -n '257,270p' hf_ptq.py

Repository: NVIDIA/Model-Optimizer

Length of output: 653


🏁 Script executed:

cd examples/llm_ptq && sed -n '580,600p' hf_ptq.py

Repository: NVIDIA/Model-Optimizer

Length of output: 689


🏁 Script executed:

cd examples/llm_ptq && sed -n '880,925p' hf_ptq.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1892


🏁 Script executed:

cd examples/llm_ptq && sed -n '1100,1125p' hf_ptq.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1108


🏁 Script executed:

cd examples/llm_ptq && sed -n '970,1020p' hf_ptq.py

Repository: NVIDIA/Model-Optimizer

Length of output: 2331


🏁 Script executed:

cd examples/llm_ptq && sed -n '1180,1220p' hf_ptq.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1472


Don't mutate the live generation_config in get_model().

The mutation persists on the returned model object, and both the before-PTQ and after-PTQ preview calls (full_model.generate() at lines 922 and 980 in hf_ptq.py) use that same model instance. For checkpoints with sampling hyperparameters, this makes the previews non-deterministic instead of deterministic, undermining PTQ smoke test comparisons. Normalize a copy during export instead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/example_utils.py` around lines 853 - 867, The current code
mutates the live model.generation_config (gen_cfg) which makes the same model
instance used by get_model() non-deterministic; instead, create a copy of the
generation_config (e.g., via copy.deepcopy or by constructing a new
GenerationConfig from the dict) and modify the copy’s do_sample flag, leaving
model.generation_config unchanged; update the export/normalization logic around
gen_cfg to use this gen_cfg_copy (or a temporary variable) so
previews/full_model.generate() remain deterministic and only the exported
metadata contains the normalized setting.

Comment thread examples/llm_ptq/hf_ptq.py Outdated
…mat ID

Integrates the scaffolding from PR #1313 (NVFP4 W4A16 generic support) into
this branch so the two PRs don't diverge on naming or export-path coverage.
The embedding + Nemotron-H enablement in the previous commit is unchanged;
this commit just adopts #1313's conventions for the pieces that overlap.

Changes
- modelopt/torch/quantization/config.py: rename the W4A16 recipe constant
  NVFP4_DEFAULT_WEIGHT_ONLY_CFG -> NVFP4_W4A16_CFG to match #1313.
- modelopt/torch/export/model_config.py: add QUANTIZATION_NVFP4_W4A16 as a
  distinct format ID instead of relying on NVFP4 branches tolerating a
  disabled input_quantizer.
- modelopt/torch/export/quant_utils.py: thread NVFP4_W4A16 through
  get_weight_scaling_factor, get_weight_scaling_factor_2,
  to_quantized_weight, and the nvfp4_w4a16 branch of
  process_layer_quant_config. Add explicit W4A16 detection in
  _get_quantization_from_layer when input_quantizer is absent/disabled.
- modelopt/torch/export/unified_export_hf.py: add NVFP4_W4A16 to the
  weight_scale_2 registration and NVFP4 transpose lists.
- modelopt/torch/export/convert_hf_config.py: add NVFP4_W4A16 mapping in
  _quant_algo_to_group_config and convert_hf_quant_config_format so the
  llm-compressor conversion emits a weight-only config group.
- examples/llm_ptq/hf_ptq.py: rename qformat nvfp4_wo -> nvfp4_w4a16; add
  --exclude_modules CLI (composes with the Nemotron-H helpers added in
  the previous commit); emit a post-export vLLM deployment warning.
- examples/llm_ptq/scripts/huggingface_example.sh: add nvfp4_w4a16 to the
  qformat allowlist, EXCLUDE_MODULES env pass-through, and a W4A16 export
  notice.
- CHANGELOG.rst: document W4A16 (covers the overlap with #1313) and the
  Embedding / Nemotron-H enablement unique to this PR.
- tests/gpu/torch/export/test_unified_hf_export_and_check_safetensors.py:
  add an nvfp4_w4a16 parametrize entry for the tiny_llama fixture.

pre-commit (ruff, ruff-format, mypy, bandit, insert-license, rst-lint)
passes on all touched files.

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
examples/llm_ptq/hf_ptq.py (1)

597-637: ⚠️ Potential issue | 🟠 Major

Keep lm_head quantization aligned with the base Nemotron-H recipe.

This helper only re-enables *lm_head*weight_quantizer. When Nemotron-H runs with an activation-aware recipe such as fp8 or nvfp4, lm_head stops matching the rest of the model; for NVFP4, modelopt/torch/export/quant_utils.py now even reclassifies it as nvfp4_w4a16 because the input quantizer stays disabled. Either gate this helper to weight-only recipes or append a mirrored *lm_head*input_quantizer rule copied from the base config.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/hf_ptq.py` around lines 597 - 637, The helper
_enable_lm_head_and_embedding_quantization currently only re-enables the lm_head
weight quantizer which desynchronizes lm_head when activation-aware recipes
(e.g. fp8/nvfp4) are used; update it to either (A) only run when the active
recipe is weight-only (check quant_cfg["algorithm"] or similar indicator) OR (B)
also append a mirrored "*lm_head*input_quantizer" entry copied from the
base/input quantizer config so lm_head keeps the same input quantization as the
rest of the model; modify _enable_lm_head_and_embedding_quantization to perform
one of these two fixes and ensure the new entry uses copy.deepcopy like the
existing weight entries.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 686-697: The Nemotron-H opt-in currently adds enable/config
entries after quantize_main() has already appended user --exclude_modules rules
(so it can override user exclusions); update the flow so user exclusions are
respected by either moving the
_enable_lm_head_and_embedding_quantization(quant_cfg, weight_quantizer_cfg) call
to run before quantize_main()/before mono_quantize() applies exclude updates, or
(preferably) change _enable_lm_head_and_embedding_quantization to check
quant_cfg.exclude_modules (and any existing disable rules) and skip adding
enable/config entries for "lm_head" or "embeddings" if the user explicitly
excluded them; make this change around quantize_main(), mono_quantize(), and
_enable_lm_head_and_embedding_quantization so the user's --exclude_modules is
never silently undone.

In `@examples/llm_ptq/scripts/huggingface_example.sh`:
- Around line 130-132: The wrapper currently expands shell globs because
EXCLUDE_MODULES is injected unquoted into PTQ_ARGS before calling hf_ptq.py; fix
this by preserving the literal pattern when forwarding --exclude_modules: stop
building a single unquoted string and use a bash array for PTQ_ARGS (e.g.,
append the two separate elements "--exclude_modules" and "$EXCLUDE_MODULES") or
otherwise ensure the variable is quoted when added so hf_ptq.py receives the
exact pattern (references: EXCLUDE_MODULES, PTQ_ARGS, and the --exclude_modules
argument passed to hf_ptq.py).

In `@modelopt/torch/export/convert_hf_config.py`:
- Around line 191-198: The NVFP4_W4A16 branch sets config_group_details.targets
to only ["Linear"], which omits embeddings even though weight-only quantization
applies to Embedding layers; update the targets list in the NVFP4_W4A16 branch
(where quant_algo_value == "NVFP4_W4A16" and config_group_details is built) to
include "Embedding" (e.g., ["Linear", "Embedding"]) before assigning
new_config["config_groups"] so compressed-tensors exports match the actual
NVFP4_W4A16 quantization coverage.

---

Duplicate comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 597-637: The helper _enable_lm_head_and_embedding_quantization
currently only re-enables the lm_head weight quantizer which desynchronizes
lm_head when activation-aware recipes (e.g. fp8/nvfp4) are used; update it to
either (A) only run when the active recipe is weight-only (check
quant_cfg["algorithm"] or similar indicator) OR (B) also append a mirrored
"*lm_head*input_quantizer" entry copied from the base/input quantizer config so
lm_head keeps the same input quantization as the rest of the model; modify
_enable_lm_head_and_embedding_quantization to perform one of these two fixes and
ensure the new entry uses copy.deepcopy like the existing weight entries.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 7568812b-30c3-4c61-bbc4-d04ef8fc9364

📥 Commits

Reviewing files that changed from the base of the PR and between 43c3454 and 490b6b2.

📒 Files selected for processing (9)
  • CHANGELOG.rst
  • examples/llm_ptq/hf_ptq.py
  • examples/llm_ptq/scripts/huggingface_example.sh
  • modelopt/torch/export/convert_hf_config.py
  • modelopt/torch/export/model_config.py
  • modelopt/torch/export/quant_utils.py
  • modelopt/torch/export/unified_export_hf.py
  • modelopt/torch/quantization/config.py
  • tests/gpu/torch/export/test_unified_hf_export_and_check_safetensors.py
✅ Files skipped from review due to trivial changes (2)
  • modelopt/torch/export/model_config.py
  • CHANGELOG.rst
🚧 Files skipped from review as they are similar to previous changes (1)
  • modelopt/torch/export/unified_export_hf.py

Comment thread examples/llm_ptq/hf_ptq.py Outdated
Comment thread examples/llm_ptq/scripts/huggingface_example.sh
Comment thread modelopt/torch/export/convert_hf_config.py
@ajrasane ajrasane self-assigned this Apr 23, 2026
- example_utils: swap in a normalized generation_config via context manager
  during export instead of mutating the live one in get_model() — preview
  generate() calls stay deterministic
- hf_ptq: mirror *lm_head*input_quantizer for activation-aware recipes so
  lm_head doesn't silently downgrade to W4A16 under NVFP4/FP8
- hf_ptq: respect --exclude_modules in the Nemotron-H lm_head/embedding
  override so user exclusions aren't silently undone
- hf_ptq: add nvfp4_w4a16 to the auto_quantize qformat allowlist for
  consistency with QUANT_CFG_CHOICES
- huggingface_example.sh: pass --exclude_modules via a bash array (set -f)
  so wildcard patterns like *embed_tokens* reach argparse verbatim instead
  of being glob-expanded against the filesystem
- convert_hf_config: include Embedding in the NVFP4_W4A16 target set so
  compressed-tensors consumers dispatch on quantized embedding weights

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
examples/llm_ptq/example_utils.py (1)

873-876: Consider documenting the top_k=50 check.

The check getattr(original, "top_k", None) not in (None, 0, 50) treats top_k=50 as a "default/unset" value. While 50 is indeed the transformers default, this implicit knowledge could benefit from a brief inline comment for maintainability.

💡 Suggested clarification
         has_sampling_hyperparam = (
             getattr(original, "top_p", None) not in (None, 1.0)
-            or getattr(original, "top_k", None) not in (None, 0, 50)
+            or getattr(original, "top_k", None) not in (None, 0, 50)  # 50 is transformers default
             or getattr(original, "temperature", None) not in (None, 1.0)
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/example_utils.py` around lines 873 - 876, The condition
using getattr(original, "top_k", None) not in (None, 0, 50) implicitly treats
top_k==50 as a default/unset value; add a brief inline comment next to that
expression (or expand the surrounding docstring) stating that 50 is the
HuggingFace/transformers default so it should be treated as unset, e.g., “# 50
is transformers' default top_k, treat as unset”; ensure the comment references
getattr(original, "top_k", None) so future readers understand why 50 is
excluded.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@examples/llm_ptq/example_utils.py`:
- Around line 873-876: The condition using getattr(original, "top_k", None) not
in (None, 0, 50) implicitly treats top_k==50 as a default/unset value; add a
brief inline comment next to that expression (or expand the surrounding
docstring) stating that 50 is the HuggingFace/transformers default so it should
be treated as unset, e.g., “# 50 is transformers' default top_k, treat as
unset”; ensure the comment references getattr(original, "top_k", None) so future
readers understand why 50 is excluded.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0eeb4272-c356-455a-9a92-0ae3a6fbd489

📥 Commits

Reviewing files that changed from the base of the PR and between 490b6b2 and a115c88.

📒 Files selected for processing (4)
  • examples/llm_ptq/example_utils.py
  • examples/llm_ptq/hf_ptq.py
  • examples/llm_ptq/scripts/huggingface_example.sh
  • modelopt/torch/export/convert_hf_config.py

mts.export(full_model)


def _enable_lm_head_and_embedding_quantization(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we define this in the modelop_recipe if everything modelopt_recipes/models can be captured with our yaml recipe system?

Comment on lines +728 to +754
# For Nemotron-H (Mamba-2 + MLP + Attention hybrid, e.g. NVIDIA-Nemotron-3-Nano-4B),
# extend quantization coverage to the lm_head and the input token embedding. On this
# architecture those two 131072x3136 tables account for ~21% of parameters, so leaving
# them at bf16 wastes most of the NVFP4 memory benefit.
if model_type == "nemotron_h":
weight_quantizer_cfg = _extract_wildcard_quantizer_cfg(quant_cfg, "weight_quantizer")
if weight_quantizer_cfg is not None:
# ``input_quantizer_cfg`` is present only for activation-aware recipes (fp8, nvfp4,
# ...). For weight-only recipes (nvfp4_w4a16, fp8_pb_wo, ...) this returns None and
# ``lm_head`` stays weight-only along with the embedding.
input_quantizer_cfg = _extract_wildcard_quantizer_cfg(quant_cfg, "input_quantizer")
print(
"Nemotron-H detected: extending quantization to lm_head and input embedding "
"(backbone.embeddings)."
)
_enable_lm_head_and_embedding_quantization(
quant_cfg,
weight_quantizer_cfg,
input_quantizer_cfg=input_quantizer_cfg,
user_excluded_modules=args.exclude_modules or None,
)
else:
warnings.warn(
"Nemotron-H detected but quant_cfg has no wildcard '*weight_quantizer' entry; "
"skipping lm_head/embedding extension (model-specific or non-standard recipe)."
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the previous comment, wondering if our recipe system can replace this ad hoc change to support specific models

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants