Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ NVIDIA Model Optimizer Changelog (Linux)
- Add support for subgraphs in ONNX autocast.
- Add support for parallel draft heads in Eagle speculative decoding.
- Add support to enable custom emulated quantization backend. See :meth:`register_quant_backend <modelopt.torch.quantization.nn.modules.tensor_quantizer.register_quant_backend>`` for more details. See an example in ``tests/unit/torch/quantization/test_custom_backend.py``.
- Add ``examples/llm_qad`` for QAD training with Megatron-LM.

**Deprecations**

Expand Down
170 changes: 170 additions & 0 deletions examples/llm_qad/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
# QAD Training Scripts

Quantization-Aware Distillation (QAD) training scripts for language models using Megatron-LM. These scripts enable training quantized (e.g., NVFP4) student models with knowledge distillation from full-precision teacher models.

## Overview

| Script | Purpose |
|--------|---------|
| `qad.sh` | Main training script (run inside container) |
| `sbatch_qad.sh` | SLURM batch submission wrapper |
| `configs/*.conf` | Model-specific configuration files |

## Requirements

### Clone Required Repositories

```bash
# Set your workspace directory
export WORKSPACE=/path/to/your/workspace

# Clone Megatron-LM (with ModelOpt integration)
git clone https://github.com/NVIDIA/Megatron-LM.git ${WORKSPACE}/Megatron-LM

# Clone Model-Optimizer
git clone https://github.com/NVIDIA/TensorRT-Model-Optimizer.git ${WORKSPACE}/Model-Optimizer
```

### Prepare Checkpoints

You need the following checkpoints before training:

1. **Student checkpoint**: Quantized (e.g., NVFP4) model in Megatron-LM format
2. **Teacher checkpoint**: Full-precision (BF16) model in Megatron-LM format
3. **Teacher config YAML**: Model architecture configuration

See [Megatron-LM ModelOpt examples](https://github.com/NVIDIA/Megatron-LM/tree/main/examples/post_training/modelopt) for checkpoint conversion from HuggingFace format.

## Creating a Configuration

### Available Templates

| Config | Model | Type |
|--------|-------|------|
| `qwen3-30b-a3b-instruct-2507-moe_template.conf` | Qwen3-30B-A3B-Instruct | MoE |
| `qwen3-8b_template.conf` | Qwen3-8B | Dense |

### Create Your Config

1. Copy a template:

```bash
# For MoE models
cp configs/qwen3-30b-a3b-instruct-2507-moe_template.conf configs/my-experiment.conf

# For Dense models
cp configs/qwen3-8b_template.conf configs/my-experiment.conf
```

2. Fill in required fields:

**Checkpoints** (required):

| Variable | Description |
|----------|-------------|
| `STUDENT_CKPT` | Path to quantized student MLM checkpoint |
| `TEACHER_CKPT` | Path to teacher MLM checkpoint |
| `TEACHER_MODEL_CONFIG` | Path to teacher YAML config (see below) |

**Paths** (required):

| Variable | Description |
|----------|-------------|
| `MLM_DIR` | Path to Megatron-LM directory |
| `BLEND_PATH` | Path to datablend JSON (from dataset generation) |

**Parallelism** (adjust for your hardware):

| Variable | Dense Model | MoE Model |
|----------|-------------|-----------|
| `IS_MOE` | `false` | `true` |
| `TP_SIZE` | `1` | `2` |
| `EP_SIZE` | `1` | `4` |
| `MBS` | `4` | `2` |

**Training** (tune as needed):

| Variable | Default | Description |
|----------|---------|-------------|
| `LR` | `1e-5` | Learning rate |
| `GBS` | `256` | Global batch size |
| `SAVE_INTERVAL` | `200` | Checkpoint interval |

### Teacher Model Config (YAML)

Create a YAML file with teacher model architecture (example: `configs/Qwen3-30B-A3B-teacher.yaml`):

```yaml
num_layers: 48
hidden_size: 2048
num_attention_heads: 32
num_query_groups: 4
kv_channels: 128
ffn_hidden_size: 6144
```

## Dataset Generation

Use the one-button script to generate the default datablend:

```bash
cd data_utils/

bash generate_dataset.sh \
--output-dir /path/to/datasets \
--mlm-path /path/to/Megatron-LM \
--tokenizer <HF-model> # e.g., Qwen/Qwen3-30B-A3B-Instruct-2507
```

**Requirements**: HuggingFace token for `nvidia/Nemotron-Post-Training-Dataset-v2`. Login first: `huggingface-cli login`

**Output**: Creates `datablend_combined.json` with OpenScience + Nemotron-v2 datasets. Set `BLEND_PATH` in your config to point to this file.

## Quick Start

### SLURM Batch Submission (Recommended)

First, update `sbatch_qad.sh` SLURM header with your cluster settings:

- `--account=<your-account>`
- `--nodes`, `--gres=gpu`, `-t` as needed

```bash
# Submit training job (override account on command line)
sbatch --account=<your-account> sbatch_qad.sh --config configs/my-experiment.conf

# With HuggingFace token (for gated models)
sbatch --account=<your-account> sbatch_qad.sh --hf-token $HF_TOKEN --config configs/my-experiment.conf

# Adjust nodes and time
sbatch --account=<your-account> --nodes=4 -t 8:00:00 sbatch_qad.sh --config configs/my-experiment.conf
```

### Interactive Mode

```bash
# Get interactive node
srun -A <account> --nodes=1 -p batch --mpi=pmix \
--container-image=nvcr.io/nvidia/pytorch:25.06-py3 \
--container-mounts="..." \
-t 4:0:0 --pty bash

# Run training
bash qad.sh --config configs/qwen3-8b.conf
```

## Resuming Training

Training automatically resumes from checkpoints. To force a fresh start:

```bash
rm -rf /path/to/checkpoints/*/latest_checkpointed_iteration.txt
```

## Troubleshooting

### OOM Errors

- Reduce `MBS`
- Increase `EP_SIZE`, `TP_SIZE`, `PP_SIZE`
- Add more nodes
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
#!/bin/bash
########################################################
# QAD Configuration: Qwen3-30B-A3B Instruct (MoE)
# Mixture of Experts - requires more resources
#
# Usage:
# sbatch sbatch_qad.sh --config configs/qwen3-30b-a3b-instruct-2507-moe_template.conf
########################################################

########################################################
# MODEL
########################################################
export STUDENT_MODEL="Qwen3-30B-A3B-Instruct-2507"
export TEACHER_MODEL="Qwen3-30B-A3B-Instruct-2507"
export TOKENIZER_MODEL="Qwen/Qwen3-30B-A3B-Instruct-2507"

########################################################
# CHECKPOINTS (REQUIRED)
########################################################
export STUDENT_CKPT="" # Student MLM checkpoint path
export TEACHER_CKPT="" # Teacher MLM checkpoint path
export TEACHER_MODEL_CONFIG="" # Teacher MLM model config yaml file, e.g., configs/Qwen3-30B-A3B-teacher.yaml

########################################################
# TRAINING (REQUIRED - no defaults in qwen_qad.sh)
########################################################
export LR="5e-6"
export GBS=64
export MIN_LR="1e-8"
export LR_DECAY_STYLE="cosine"
export SAVE_INTERVAL=200
export LOG_INTERVAL=10
export DATASET_NAME="openscience_nemotron" # use for logging
export TRAIN_SAMPLES=5120000

########################################################
# PARALLELISM
# Note: QAD loads both student + teacher models, requires more memory
########################################################
export TP_SIZE=2
export PP_SIZE=1
export MBS=2
export NUM_GPUS=4
export MASTER_PORT=29500

########################################################
# MOE
########################################################
export EP_SIZE=4
export IS_MOE=false

########################################################
# PATHS (REQUIRED - no defaults in qwen_qad.sh)
########################################################
export MLM_DIR="" # path to Megatron-LM source directory
export MODELOPT_DIR="" # path to Model-Optimizer source directory
export STUDENT_CONFIG_FILE="" # path to student model args script, e.g., ${MLM_DIR}/examples/post_training/modelopt/conf/Qwen/Qwen3-30B-A3B.sh
export QAD_CHECKPOINT_ROOT="" # path to store QAD checkpoints
export DATACACHE_DIR="" # path to data cache directory

########################################################
# CONTAINER
########################################################
export CONTAINER_IMAGE="" # path to container image, e.g., nvcr.io/nvidia/pytorch:25.06-py3
export CONTAINER_MOUNTS="" # container mounts, e.g., "/lustre/fs1:/lustre/fs1"
export CONTAINER_WORKDIR="" # container work directory, e.g., "<path-to-modelopt>/Model-Optimizer/examples/llm_qad"


########################################################
# DATASET
########################################################
# Generate with: bash data_utils/generate_dataset.sh --output-dir <path> --mlm-path <path> --tokenizer <model>
export BLEND_PATH="" # path to datablend_combined.json from generate_dataset.sh
71 changes: 71 additions & 0 deletions examples/llm_qad/configs/qwen3-8b_template.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
#!/bin/bash
########################################################
# QAD Configuration: Qwen3-8B (Dense Model)
#
# Usage:
# sbatch sbatch_qad.sh --config configs/qwen3-8b_template.conf
########################################################

########################################################
# MODEL
########################################################
export STUDENT_MODEL="Qwen3-8B"
export TEACHER_MODEL="Qwen3-8B"
export TOKENIZER_MODEL="Qwen/Qwen3-8B"

########################################################
# CHECKPOINTS (REQUIRED)
########################################################
export STUDENT_CKPT="" # Student MLM checkpoint path
export TEACHER_CKPT="" # Teacher MLM checkpoint path
export TEACHER_MODEL_CONFIG="" # Teacher MLM model config yaml file

########################################################
# TRAINING
########################################################
export LR="5e-6"
export GBS=64
export MIN_LR="1e-8"
export LR_DECAY_STYLE="cosine"
export SAVE_INTERVAL=200
export LOG_INTERVAL=10
export DATASET_NAME="openscience_nemotron" # use for logging
export TRAIN_SAMPLES=5120000

########################################################
# PARALLELISM (Dense model - simpler settings)
########################################################
export TP_SIZE=1
export PP_SIZE=1
export MBS=4
export NUM_GPUS=8
export MASTER_PORT=29500

########################################################
# MOE
########################################################
export EP_SIZE=1
export IS_MOE=false

########################################################
# PATHS (REQUIRED)
########################################################
export MLM_DIR="" # path to Megatron-LM source directory
export MODELOPT_DIR="" # path to Model-Optimizer source directory
export STUDENT_CONFIG_FILE="" # path to student model args script, e.g., ${MLM_DIR}/examples/post_training/modelopt/conf/Qwen/Qwen3-8B.sh
export QAD_CHECKPOINT_ROOT="" # path to store QAD checkpoints
export DATACACHE_DIR="" # path to data cache directory

########################################################
# CONTAINER
########################################################
export CONTAINER_IMAGE="" # path to container image, e.g., nvcr.io/nvidia/pytorch:25.06-py3
export CONTAINER_MOUNTS="" # container mounts, e.g., "/lustre/fs1:/lustre/fs1"
export CONTAINER_WORKDIR="" # container work directory

########################################################
# DATASET
########################################################
# Generate with: bash data_utils/generate_dataset.sh --output-dir <path> --mlm-path <path> --tokenizer <model>
export BLEND_PATH="" # path to datablend_combined.json from generate_dataset.sh

Loading