NVIDIA · meenchen · Dec 22, 2025 · Dec 19, 2025 · Dec 19, 2025 · Dec 19, 2025
diff --git a/CHANGELOG.rst b/CHANGELOG.rst
@@ -13,6 +13,7 @@ NVIDIA Model Optimizer Changelog (Linux)
 - Add support for subgraphs in ONNX autocast.
 - Add support for parallel draft heads in Eagle speculative decoding.
 - Add support to enable custom emulated quantization backend. See :meth:`register_quant_backend <modelopt.torch.quantization.nn.modules.tensor_quantizer.register_quant_backend>`` for more details. See an example in ``tests/unit/torch/quantization/test_custom_backend.py``.
+- Add ``examples/llm_qad`` for QAD training with Megatron-LM.
 
 **Deprecations**
 

@@ -0,0 +1,170 @@
+# QAD Training Scripts
+
+Quantization-Aware Distillation (QAD) training scripts for language models using Megatron-LM. These scripts enable training quantized (e.g., NVFP4) student models with knowledge distillation from full-precision teacher models.
+
+## Overview
+
+| Script | Purpose |
+|--------|---------|
+| `qad.sh` | Main training script (run inside container) |
+| `sbatch_qad.sh` | SLURM batch submission wrapper |
+| `configs/*.conf` | Model-specific configuration files |
+
+## Requirements
+
+### Clone Required Repositories
+
+```bash
+# Set your workspace directory
+export WORKSPACE=/path/to/your/workspace
+
+# Clone Megatron-LM (with ModelOpt integration)
+git clone https://github.com/NVIDIA/Megatron-LM.git ${WORKSPACE}/Megatron-LM
+
+# Clone Model-Optimizer
+git clone https://github.com/NVIDIA/TensorRT-Model-Optimizer.git ${WORKSPACE}/Model-Optimizer
+```
+
+### Prepare Checkpoints
+
+You need the following checkpoints before training:
+
+1. **Student checkpoint**: Quantized (e.g., NVFP4) model in Megatron-LM format
+2. **Teacher checkpoint**: Full-precision (BF16) model in Megatron-LM format
+3. **Teacher config YAML**: Model architecture configuration
+
+See [Megatron-LM ModelOpt examples](https://github.com/NVIDIA/Megatron-LM/tree/main/examples/post_training/modelopt) for checkpoint conversion from HuggingFace format.
+
+## Creating a Configuration
+
+### Available Templates
+
+| Config | Model | Type |
+|--------|-------|------|
+| `qwen3-30b-a3b-instruct-2507-moe_template.conf` | Qwen3-30B-A3B-Instruct | MoE |
+| `qwen3-8b_template.conf` | Qwen3-8B | Dense |
+
+### Create Your Config
+
+1. Copy a template:
+
+   ```bash
+   # For MoE models
+   cp configs/qwen3-30b-a3b-instruct-2507-moe_template.conf configs/my-experiment.conf
+
+   # For Dense models
+   cp configs/qwen3-8b_template.conf configs/my-experiment.conf
+   ```
+
+2. Fill in required fields:
+
+   **Checkpoints** (required):
+
+   | Variable | Description |
+   |----------|-------------|
+   | `STUDENT_CKPT` | Path to quantized student MLM checkpoint |
+   | `TEACHER_CKPT` | Path to teacher MLM checkpoint |
+   | `TEACHER_MODEL_CONFIG` | Path to teacher YAML config (see below) |
+
+   **Paths** (required):
+
+   | Variable | Description |
+   |----------|-------------|
+   | `MLM_DIR` | Path to Megatron-LM directory |
+   | `BLEND_PATH` | Path to datablend JSON (from dataset generation) |
+
+   **Parallelism** (adjust for your hardware):
+
+   | Variable | Dense Model | MoE Model |
+   |----------|-------------|-----------|
+   | `IS_MOE` | `false` | `true` |
+   | `TP_SIZE` | `1` | `2` |
+   | `EP_SIZE` | `1` | `4` |
+   | `MBS` | `4` | `2` |
+
+   **Training** (tune as needed):
+
+   | Variable | Default | Description |
+   |----------|---------|-------------|
+   | `LR` | `1e-5` | Learning rate |
+   | `GBS` | `256` | Global batch size |
+   | `SAVE_INTERVAL` | `200` | Checkpoint interval |
+
+### Teacher Model Config (YAML)
+
+Create a YAML file with teacher model architecture (example: `configs/Qwen3-30B-A3B-teacher.yaml`):
+
+```yaml
+num_layers: 48
+hidden_size: 2048
+num_attention_heads: 32
+num_query_groups: 4
+kv_channels: 128
+ffn_hidden_size: 6144
+```
+
+## Dataset Generation
+
+Use the one-button script to generate the default datablend:
+
+```bash
+cd data_utils/
+
+bash generate_dataset.sh \
+    --output-dir /path/to/datasets \
+    --mlm-path /path/to/Megatron-LM \
+    --tokenizer <HF-model>  # e.g., Qwen/Qwen3-30B-A3B-Instruct-2507
+```
+
+**Requirements**: HuggingFace token for `nvidia/Nemotron-Post-Training-Dataset-v2`. Login first: `huggingface-cli login`
+
+**Output**: Creates `datablend_combined.json` with OpenScience + Nemotron-v2 datasets. Set `BLEND_PATH` in your config to point to this file.
+
+## Quick Start
+
+### SLURM Batch Submission (Recommended)
+
+First, update `sbatch_qad.sh` SLURM header with your cluster settings:
+
+- `--account=<your-account>`
+- `--nodes`, `--gres=gpu`, `-t` as needed
+
+```bash
+# Submit training job (override account on command line)
+sbatch --account=<your-account> sbatch_qad.sh --config configs/my-experiment.conf
+
+# With HuggingFace token (for gated models)
+sbatch --account=<your-account> sbatch_qad.sh --hf-token $HF_TOKEN --config configs/my-experiment.conf
+
+# Adjust nodes and time
+sbatch --account=<your-account> --nodes=4 -t 8:00:00 sbatch_qad.sh --config configs/my-experiment.conf
+```
+
+### Interactive Mode
+
+```bash
+# Get interactive node
+srun -A <account> --nodes=1 -p batch --mpi=pmix \
+    --container-image=nvcr.io/nvidia/pytorch:25.06-py3 \
+    --container-mounts="..." \
+    -t 4:0:0 --pty bash
+
+# Run training
+bash qad.sh --config configs/qwen3-8b.conf
+```
+
+## Resuming Training
+
+Training automatically resumes from checkpoints. To force a fresh start:
+
+```bash
+rm -rf /path/to/checkpoints/*/latest_checkpointed_iteration.txt
+```
+
+## Troubleshooting
+
+### OOM Errors
+
+- Reduce `MBS`
+- Increase `EP_SIZE`, `TP_SIZE`, `PP_SIZE`
+- Add more nodes
@@ -0,0 +1,73 @@
+#!/bin/bash
+########################################################
+# QAD Configuration: Qwen3-30B-A3B Instruct (MoE)
+# Mixture of Experts - requires more resources
+#
+# Usage:
+#   sbatch sbatch_qad.sh --config configs/qwen3-30b-a3b-instruct-2507-moe_template.conf
+########################################################
+
+########################################################
+# MODEL
+########################################################
+export STUDENT_MODEL="Qwen3-30B-A3B-Instruct-2507"
+export TEACHER_MODEL="Qwen3-30B-A3B-Instruct-2507"
+export TOKENIZER_MODEL="Qwen/Qwen3-30B-A3B-Instruct-2507"
+
+########################################################
+# CHECKPOINTS (REQUIRED)
+########################################################
+export STUDENT_CKPT="" # Student MLM checkpoint path
+export TEACHER_CKPT="" # Teacher MLM checkpoint path
+export TEACHER_MODEL_CONFIG="" # Teacher MLM model config yaml file, e.g., configs/Qwen3-30B-A3B-teacher.yaml
+
+########################################################
+# TRAINING (REQUIRED - no defaults in qwen_qad.sh)
+########################################################
+export LR="5e-6"
+export GBS=64
+export MIN_LR="1e-8"
+export LR_DECAY_STYLE="cosine"
+export SAVE_INTERVAL=200
+export LOG_INTERVAL=10
+export DATASET_NAME="openscience_nemotron"  # use for logging
+export TRAIN_SAMPLES=5120000
+
+########################################################
+# PARALLELISM
+# Note: QAD loads both student + teacher models, requires more memory
+########################################################
+export TP_SIZE=2
+export PP_SIZE=1
+export MBS=2
+export NUM_GPUS=4
+export MASTER_PORT=29500
+
+########################################################
+# MOE
+########################################################
+export EP_SIZE=4
+export IS_MOE=false
+
+########################################################
+# PATHS (REQUIRED - no defaults in qwen_qad.sh)
+########################################################
+export MLM_DIR="" # path to Megatron-LM source directory
+export MODELOPT_DIR="" # path to Model-Optimizer source directory
+export STUDENT_CONFIG_FILE="" # path to student model args script, e.g., ${MLM_DIR}/examples/post_training/modelopt/conf/Qwen/Qwen3-30B-A3B.sh
+export QAD_CHECKPOINT_ROOT="" # path to store QAD checkpoints
+export DATACACHE_DIR="" # path to data cache directory
+
+########################################################
+# CONTAINER
+########################################################
+export CONTAINER_IMAGE="" # path to container image, e.g., nvcr.io/nvidia/pytorch:25.06-py3
+export CONTAINER_MOUNTS="" # container mounts, e.g., "/lustre/fs1:/lustre/fs1"
+export CONTAINER_WORKDIR="" # container work directory, e.g., "<path-to-modelopt>/Model-Optimizer/examples/llm_qad"
+
+
+########################################################
+# DATASET
+########################################################
+# Generate with: bash data_utils/generate_dataset.sh --output-dir <path> --mlm-path <path> --tokenizer <model>
+export BLEND_PATH="" # path to datablend_combined.json from generate_dataset.sh
@@ -0,0 +1,71 @@
+#!/bin/bash
+########################################################
+# QAD Configuration: Qwen3-8B (Dense Model)
+#
+# Usage:
+#   sbatch sbatch_qad.sh --config configs/qwen3-8b_template.conf
+########################################################
+
+########################################################
+# MODEL
+########################################################
+export STUDENT_MODEL="Qwen3-8B"
+export TEACHER_MODEL="Qwen3-8B"
+export TOKENIZER_MODEL="Qwen/Qwen3-8B"
+
+########################################################
+# CHECKPOINTS (REQUIRED)
+########################################################
+export STUDENT_CKPT="" # Student MLM checkpoint path
+export TEACHER_CKPT="" # Teacher MLM checkpoint path
+export TEACHER_MODEL_CONFIG="" # Teacher MLM model config yaml file
+
+########################################################
+# TRAINING
+########################################################
+export LR="5e-6"
+export GBS=64
+export MIN_LR="1e-8"
+export LR_DECAY_STYLE="cosine"
+export SAVE_INTERVAL=200
+export LOG_INTERVAL=10
+export DATASET_NAME="openscience_nemotron"  # use for logging
+export TRAIN_SAMPLES=5120000
+
+########################################################
+# PARALLELISM (Dense model - simpler settings)
+########################################################
+export TP_SIZE=1
+export PP_SIZE=1
+export MBS=4
+export NUM_GPUS=8
+export MASTER_PORT=29500
+
+########################################################
+# MOE
+########################################################
+export EP_SIZE=1
+export IS_MOE=false
+
+########################################################
+# PATHS (REQUIRED)
+########################################################
+export MLM_DIR="" # path to Megatron-LM source directory
+export MODELOPT_DIR="" # path to Model-Optimizer source directory
+export STUDENT_CONFIG_FILE="" # path to student model args script, e.g., ${MLM_DIR}/examples/post_training/modelopt/conf/Qwen/Qwen3-8B.sh
+export QAD_CHECKPOINT_ROOT="" # path to store QAD checkpoints
+export DATACACHE_DIR="" # path to data cache directory
+
+########################################################
+# CONTAINER
+########################################################
+export CONTAINER_IMAGE="" # path to container image, e.g., nvcr.io/nvidia/pytorch:25.06-py3
+export CONTAINER_MOUNTS="" # container mounts, e.g., "/lustre/fs1:/lustre/fs1"
+export CONTAINER_WORKDIR="" # container work directory
+
+########################################################
+# DATASET
+########################################################
+# Generate with: bash data_utils/generate_dataset.sh --output-dir <path> --mlm-path <path> --tokenizer <model>
+export BLEND_PATH="" # path to datablend_combined.json from generate_dataset.sh
+