diff --git a/CHANGELOG.rst b/CHANGELOG.rst
index 61c198026..826b51160 100755
--- a/CHANGELOG.rst
+++ b/CHANGELOG.rst
@@ -13,6 +13,7 @@ NVIDIA Model Optimizer Changelog (Linux)
 - Add support for subgraphs in ONNX autocast.
 - Add support for parallel draft heads in Eagle speculative decoding.
 - Add support to enable custom emulated quantization backend. See :meth:`register_quant_backend <modelopt.torch.quantization.nn.modules.tensor_quantizer.register_quant_backend>`` for more details. See an example in ``tests/unit/torch/quantization/test_custom_backend.py``.
+- Add ``examples/llm_qad`` for QAD training with Megatron-LM.
 
 **Deprecations**
 
diff --git a/examples/llm_qad/README.md b/examples/llm_qad/README.md
new file mode 100644
index 000000000..68fd01849
--- /dev/null
+++ b/examples/llm_qad/README.md
@@ -0,0 +1,170 @@
+# QAD Training Scripts
+
+Quantization-Aware Distillation (QAD) training scripts for language models using Megatron-LM. These scripts enable training quantized (e.g., NVFP4) student models with knowledge distillation from full-precision teacher models.
+
+## Overview
+
+| Script | Purpose |
+|--------|---------|
+| `qad.sh` | Main training script (run inside container) |
+| `sbatch_qad.sh` | SLURM batch submission wrapper |
+| `configs/*.conf` | Model-specific configuration files |
+
+## Requirements
+
+### Clone Required Repositories
+
+```bash
+# Set your workspace directory
+export WORKSPACE=/path/to/your/workspace
+
+# Clone Megatron-LM (with ModelOpt integration)
+git clone https://github.com/NVIDIA/Megatron-LM.git ${WORKSPACE}/Megatron-LM
+
+# Clone Model-Optimizer
+git clone https://github.com/NVIDIA/TensorRT-Model-Optimizer.git ${WORKSPACE}/Model-Optimizer
+```
+
+### Prepare Checkpoints
+
+You need the following checkpoints before training:
+
+1. **Student checkpoint**: Quantized (e.g., NVFP4) model in Megatron-LM format
+2. **Teacher checkpoint**: Full-precision (BF16) model in Megatron-LM format
+3. **Teacher config YAML**: Model architecture configuration
+
+See [Megatron-LM ModelOpt examples](https://github.com/NVIDIA/Megatron-LM/tree/main/examples/post_training/modelopt) for checkpoint conversion from HuggingFace format.
+
+## Creating a Configuration
+
+### Available Templates
+
+| Config | Model | Type |
+|--------|-------|------|
+| `qwen3-30b-a3b-instruct-2507-moe_template.conf` | Qwen3-30B-A3B-Instruct | MoE |
+| `qwen3-8b_template.conf` | Qwen3-8B | Dense |
+
+### Create Your Config
+
+1. Copy a template:
+
+   ```bash
+   # For MoE models
+   cp configs/qwen3-30b-a3b-instruct-2507-moe_template.conf configs/my-experiment.conf
+   
+   # For Dense models
+   cp configs/qwen3-8b_template.conf configs/my-experiment.conf
+   ```
+
+2. Fill in required fields:
+
+   **Checkpoints** (required):
+
+   | Variable | Description |
+   |----------|-------------|
+   | `STUDENT_CKPT` | Path to quantized student MLM checkpoint |
+   | `TEACHER_CKPT` | Path to teacher MLM checkpoint |
+   | `TEACHER_MODEL_CONFIG` | Path to teacher YAML config (see below) |
+
+   **Paths** (required):
+
+   | Variable | Description |
+   |----------|-------------|
+   | `MLM_DIR` | Path to Megatron-LM directory |
+   | `BLEND_PATH` | Path to datablend JSON (from dataset generation) |
+
+   **Parallelism** (adjust for your hardware):
+
+   | Variable | Dense Model | MoE Model |
+   |----------|-------------|-----------|
+   | `IS_MOE` | `false` | `true` |
+   | `TP_SIZE` | `1` | `2` |
+   | `EP_SIZE` | `1` | `4` |
+   | `MBS` | `4` | `2` |
+
+   **Training** (tune as needed):
+
+   | Variable | Default | Description |
+   |----------|---------|-------------|
+   | `LR` | `1e-5` | Learning rate |
+   | `GBS` | `256` | Global batch size |
+   | `SAVE_INTERVAL` | `200` | Checkpoint interval |
+
+### Teacher Model Config (YAML)
+
+Create a YAML file with teacher model architecture (example: `configs/Qwen3-30B-A3B-teacher.yaml`):
+
+```yaml
+num_layers: 48
+hidden_size: 2048
+num_attention_heads: 32
+num_query_groups: 4
+kv_channels: 128
+ffn_hidden_size: 6144
+```
+
+## Dataset Generation
+
+Use the one-button script to generate the default datablend:
+
+```bash
+cd data_utils/
+
+bash generate_dataset.sh \
+    --output-dir /path/to/datasets \
+    --mlm-path /path/to/Megatron-LM \
+    --tokenizer <HF-model>  # e.g., Qwen/Qwen3-30B-A3B-Instruct-2507
+```
+
+**Requirements**: HuggingFace token for `nvidia/Nemotron-Post-Training-Dataset-v2`. Login first: `huggingface-cli login`
+
+**Output**: Creates `datablend_combined.json` with OpenScience + Nemotron-v2 datasets. Set `BLEND_PATH` in your config to point to this file.
+
+## Quick Start
+
+### SLURM Batch Submission (Recommended)
+
+First, update `sbatch_qad.sh` SLURM header with your cluster settings:
+
+- `--account=<your-account>`
+- `--nodes`, `--gres=gpu`, `-t` as needed
+
+```bash
+# Submit training job (override account on command line)
+sbatch --account=<your-account> sbatch_qad.sh --config configs/my-experiment.conf
+
+# With HuggingFace token (for gated models)
+sbatch --account=<your-account> sbatch_qad.sh --hf-token $HF_TOKEN --config configs/my-experiment.conf
+
+# Adjust nodes and time
+sbatch --account=<your-account> --nodes=4 -t 8:00:00 sbatch_qad.sh --config configs/my-experiment.conf
+```
+
+### Interactive Mode
+
+```bash
+# Get interactive node
+srun -A <account> --nodes=1 -p batch --mpi=pmix \
+    --container-image=nvcr.io/nvidia/pytorch:25.06-py3 \
+    --container-mounts="..." \
+    -t 4:0:0 --pty bash
+
+# Run training
+bash qad.sh --config configs/qwen3-8b.conf
+```
+
+## Resuming Training
+
+Training automatically resumes from checkpoints. To force a fresh start:
+
+```bash
+rm -rf /path/to/checkpoints/*/latest_checkpointed_iteration.txt
+```
+
+## Troubleshooting
+
+### OOM Errors
+
+- Reduce `MBS`
+- Increase `EP_SIZE`, `TP_SIZE`, `PP_SIZE`
+- Add more nodes
diff --git a/examples/llm_qad/configs/qwen3-30b-a3b-instruct-2507-moe_template.conf b/examples/llm_qad/configs/qwen3-30b-a3b-instruct-2507-moe_template.conf
new file mode 100644
index 000000000..52ca5efe0
--- /dev/null
+++ b/examples/llm_qad/configs/qwen3-30b-a3b-instruct-2507-moe_template.conf
@@ -0,0 +1,73 @@
+#!/bin/bash
+########################################################
+# QAD Configuration: Qwen3-30B-A3B Instruct (MoE)
+# Mixture of Experts - requires more resources
+#
+# Usage:
+#   sbatch sbatch_qad.sh --config configs/qwen3-30b-a3b-instruct-2507-moe_template.conf
+########################################################
+
+########################################################
+# MODEL
+########################################################
+export STUDENT_MODEL="Qwen3-30B-A3B-Instruct-2507"
+export TEACHER_MODEL="Qwen3-30B-A3B-Instruct-2507"
+export TOKENIZER_MODEL="Qwen/Qwen3-30B-A3B-Instruct-2507"
+
+########################################################
+# CHECKPOINTS (REQUIRED)
+########################################################
+export STUDENT_CKPT="" # Student MLM checkpoint path
+export TEACHER_CKPT="" # Teacher MLM checkpoint path
+export TEACHER_MODEL_CONFIG="" # Teacher MLM model config yaml file, e.g., configs/Qwen3-30B-A3B-teacher.yaml
+
+########################################################
+# TRAINING (REQUIRED - no defaults in qwen_qad.sh)
+########################################################
+export LR="5e-6"
+export GBS=64
+export MIN_LR="1e-8"
+export LR_DECAY_STYLE="cosine"
+export SAVE_INTERVAL=200
+export LOG_INTERVAL=10
+export DATASET_NAME="openscience_nemotron"  # use for logging
+export TRAIN_SAMPLES=5120000
+
+########################################################
+# PARALLELISM
+# Note: QAD loads both student + teacher models, requires more memory
+########################################################
+export TP_SIZE=2
+export PP_SIZE=1
+export MBS=2
+export NUM_GPUS=4
+export MASTER_PORT=29500
+
+########################################################
+# MOE
+########################################################
+export EP_SIZE=4
+export IS_MOE=false
+
+########################################################
+# PATHS (REQUIRED - no defaults in qwen_qad.sh)
+########################################################
+export MLM_DIR="" # path to Megatron-LM source directory
+export MODELOPT_DIR="" # path to Model-Optimizer source directory
+export STUDENT_CONFIG_FILE="" # path to student model args script, e.g., ${MLM_DIR}/examples/post_training/modelopt/conf/Qwen/Qwen3-30B-A3B.sh
+export QAD_CHECKPOINT_ROOT="" # path to store QAD checkpoints
+export DATACACHE_DIR="" # path to data cache directory
+
+########################################################
+# CONTAINER
+########################################################
+export CONTAINER_IMAGE="" # path to container image, e.g., nvcr.io/nvidia/pytorch:25.06-py3
+export CONTAINER_MOUNTS="" # container mounts, e.g., "/lustre/fs1:/lustre/fs1"
+export CONTAINER_WORKDIR="" # container work directory, e.g., "<path-to-modelopt>/Model-Optimizer/examples/llm_qad"
+
+
+########################################################
+# DATASET
+########################################################
+# Generate with: bash data_utils/generate_dataset.sh --output-dir <path> --mlm-path <path> --tokenizer <model>
+export BLEND_PATH="" # path to datablend_combined.json from generate_dataset.sh
\ No newline at end of file
diff --git a/examples/llm_qad/configs/qwen3-8b_template.conf b/examples/llm_qad/configs/qwen3-8b_template.conf
new file mode 100644
index 000000000..1af932b39
--- /dev/null
+++ b/examples/llm_qad/configs/qwen3-8b_template.conf
@@ -0,0 +1,71 @@
+#!/bin/bash
+########################################################
+# QAD Configuration: Qwen3-8B (Dense Model)
+#
+# Usage:
+#   sbatch sbatch_qad.sh --config configs/qwen3-8b_template.conf
+########################################################
+
+########################################################
+# MODEL
+########################################################
+export STUDENT_MODEL="Qwen3-8B"
+export TEACHER_MODEL="Qwen3-8B"
+export TOKENIZER_MODEL="Qwen/Qwen3-8B"
+
+########################################################
+# CHECKPOINTS (REQUIRED)
+########################################################
+export STUDENT_CKPT="" # Student MLM checkpoint path
+export TEACHER_CKPT="" # Teacher MLM checkpoint path
+export TEACHER_MODEL_CONFIG="" # Teacher MLM model config yaml file
+
+########################################################
+# TRAINING
+########################################################
+export LR="5e-6"
+export GBS=64
+export MIN_LR="1e-8"
+export LR_DECAY_STYLE="cosine"
+export SAVE_INTERVAL=200
+export LOG_INTERVAL=10
+export DATASET_NAME="openscience_nemotron"  # use for logging
+export TRAIN_SAMPLES=5120000
+
+########################################################
+# PARALLELISM (Dense model - simpler settings)
+########################################################
+export TP_SIZE=1
+export PP_SIZE=1
+export MBS=4
+export NUM_GPUS=8
+export MASTER_PORT=29500
+
+########################################################
+# MOE
+########################################################
+export EP_SIZE=1
+export IS_MOE=false
+
+########################################################
+# PATHS (REQUIRED)
+########################################################
+export MLM_DIR="" # path to Megatron-LM source directory
+export MODELOPT_DIR="" # path to Model-Optimizer source directory
+export STUDENT_CONFIG_FILE="" # path to student model args script, e.g., ${MLM_DIR}/examples/post_training/modelopt/conf/Qwen/Qwen3-8B.sh
+export QAD_CHECKPOINT_ROOT="" # path to store QAD checkpoints
+export DATACACHE_DIR="" # path to data cache directory
+
+########################################################
+# CONTAINER
+########################################################
+export CONTAINER_IMAGE="" # path to container image, e.g., nvcr.io/nvidia/pytorch:25.06-py3
+export CONTAINER_MOUNTS="" # container mounts, e.g., "/lustre/fs1:/lustre/fs1"
+export CONTAINER_WORKDIR="" # container work directory
+
+########################################################
+# DATASET
+########################################################
+# Generate with: bash data_utils/generate_dataset.sh --output-dir <path> --mlm-path <path> --tokenizer <model>
+export BLEND_PATH="" # path to datablend_combined.json from generate_dataset.sh
+
diff --git a/examples/llm_qad/data_utils/download_dataset.py b/examples/llm_qad/data_utils/download_dataset.py
new file mode 100644
index 000000000..e3e3d0646
--- /dev/null
+++ b/examples/llm_qad/data_utils/download_dataset.py
@@ -0,0 +1,201 @@
+#!/usr/bin/env python3
+# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Download datasets for QAD training (OpenScience, Nemotron-v2)."""
+
+from __future__ import annotations
+
+import argparse
+import json
+import os
+import random
+from typing import Any
+
+from tqdm import tqdm
+
+SEED = 42
+TRAIN_RATIO, VALID_RATIO = 0.95, 0.025
+_TOKENIZER = None
+
+
+def init_tokenizer(name: str) -> None:
+    """Load HuggingFace tokenizer for chat template."""
+    global _TOKENIZER
+    if name:
+        from transformers import AutoTokenizer
+
+        print(f"Loading tokenizer: {name}")
+        _TOKENIZER = AutoTokenizer.from_pretrained(name, trust_remote_code=True)
+
+
+def format_text(messages: list[dict], reasoning: str = "") -> str:
+    """Format messages to text using tokenizer chat template or simple format."""
+    # Add reasoning as thinking block if provided
+    if reasoning.strip():
+        messages = messages.copy()
+        for i, m in enumerate(messages):
+            if m.get("role") == "assistant" and i == len(messages) - 1:
+                messages[i] = {
+                    "role": "assistant",
+                    "content": f"<think>\n{reasoning}\n</think>\n{m.get('content', '')}",
+                }
+
+    if _TOKENIZER:
+        try:
+            return _TOKENIZER.apply_chat_template(messages, tokenize=False)
+        except Exception:
+            pass
+
+    # Fallback
+    return "\n\n".join(f"{m['role'].title()}: {m['content']}" for m in messages if m.get("content"))
+
+
+def split_and_save(examples: list[dict], output_dir: str, prefix: str) -> dict[str, int]:
+    """Shuffle, split into train/valid/test, and save as JSONL."""
+    random.seed(SEED)
+    random.shuffle(examples)
+
+    n = len(examples)
+    train_end = int(n * TRAIN_RATIO)
+    valid_end = train_end + int(n * VALID_RATIO)
+
+    splits = {
+        "train": examples[:train_end],
+        "validation": examples[train_end:valid_end],
+        "test": examples[valid_end:],
+    }
+
+    os.makedirs(output_dir, exist_ok=True)
+    counts = {}
+    for name, data in splits.items():
+        path = os.path.join(output_dir, f"{prefix}_{name}.jsonl")
+        with open(path, "w") as f:
+            f.writelines(json.dumps(d, ensure_ascii=False) + "\n" for d in data)
+        counts[name] = len(data)
+        print(f"  {name}: {len(data):,}")
+
+    return counts
+
+
+def download_openscience(output_dir: str, use_chat: bool) -> dict[str, Any]:
+    """Download nvidia/OpenScience dataset."""
+    from datasets import load_dataset
+
+    print("\nDownloading nvidia/OpenScience...")
+    ds = load_dataset("nvidia/OpenScience", "OS-Q3-235B-4")
+    data = ds["train"] if "train" in ds else ds[next(iter(ds.keys()))]
+
+    print(f"Processing {len(data)} examples...")
+    suffix = "_chat" if use_chat else ""
+    examples = []
+    for ex in tqdm(data.shuffle(seed=SEED), desc="openscience"):
+        msgs = [
+            {"role": "user", "content": ex.get("input", "")},
+            {"role": "assistant", "content": ex.get("output", "")},
+        ]
+        examples.append({"text": format_text(msgs)})
+
+    counts = split_and_save(examples, output_dir, f"openscience{suffix}")
+    return {"dataset": "openscience", "total": len(examples), **counts}
+
+
+def download_nemotron_v2(
+    output_dir: str, splits: list[str], sample_pct: float, suffix: str, include_reasoning: bool
+) -> list[dict[str, Any]]:
+    """Download nvidia/Nemotron-Post-Training-Dataset-v2 splits."""
+    from datasets import load_dataset
+
+    print(f"\nDownloading Nemotron-v2 ({', '.join(splits)}) @ {sample_pct}%...")
+    results = []
+
+    for split in splits:
+        print(f"\n{split}:")
+        ds = load_dataset("nvidia/Nemotron-Post-Training-Dataset-v2", split=split, streaming=True)
+
+        examples = []
+        for ex in tqdm(ds, desc=split):
+            msgs = ex.get("messages", [])
+            reasoning = ex.get("reasoning", "") if include_reasoning else ""
+            text = format_text(msgs, reasoning)
+            if text.strip():
+                examples.append({"text": text})
+
+        # Sample if needed
+        if sample_pct < 100:
+            random.seed(SEED)
+            target = int(len(examples) * sample_pct / 100)
+            examples = random.sample(examples, min(target, len(examples)))
+            print(f"  Sampled to {len(examples):,}")
+
+        if not examples:
+            continue
+
+        split_dir = os.path.join(output_dir, split)
+        counts = split_and_save(examples, split_dir, f"{split}_{suffix}")
+        results.append({"split_name": split, "total": len(examples), **counts})
+
+    return results
+
+
+def main():
+    p = argparse.ArgumentParser(description="Download QAD datasets")
+    p.add_argument("--dataset", required=True, choices=["openscience", "nemotron-v2", "all"])
+    p.add_argument("--output-dir", required=True)
+    p.add_argument("--tokenizer", help="HuggingFace tokenizer for chat template")
+    p.add_argument("--splits", default="stem,math,code,chat", help="Nemotron-v2 splits")
+    p.add_argument("--sample-percent", type=float, default=30.0)
+    p.add_argument(
+        "--include-reasoning", action="store_true", help="Include COT for Thinking models"
+    )
+    args = p.parse_args()
+
+    if args.tokenizer:
+        init_tokenizer(args.tokenizer)
+
+    # Build suffix
+    suffix = f"{int(args.sample_percent)}pct"
+    if args.include_reasoning:
+        suffix += "_cot"
+    if args.tokenizer:
+        suffix += "_chat"
+
+    results = []
+
+    if args.dataset in ["openscience", "all"]:
+        info = download_openscience(
+            os.path.join(args.output_dir, "openscience_splits"), args.tokenizer is not None
+        )
+        results.append(info)
+
+    if args.dataset in ["nemotron-v2", "all"]:
+        infos = download_nemotron_v2(
+            os.path.join(args.output_dir, "nemotron_v2"),
+            [s.strip() for s in args.splits.split(",")],
+            args.sample_percent,
+            suffix,
+            args.include_reasoning,
+        )
+        results.extend(infos)
+
+    print("\n" + "=" * 50)
+    print("Download complete!")
+    for r in results:
+        name = r.get("dataset") or r.get("split_name")
+        print(f"  {name}: {r['total']:,} (train={r['train']:,})")
+    print("=" * 50)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/llm_qad/data_utils/generate_dataset.sh b/examples/llm_qad/data_utils/generate_dataset.sh
new file mode 100755
index 000000000..39d678df9
--- /dev/null
+++ b/examples/llm_qad/data_utils/generate_dataset.sh
@@ -0,0 +1,105 @@
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Download and preprocess OpenScience + Nemotron-v2 datasets for QAD training.
+# Usage: bash generate_dataset.sh --output-dir <path> --mlm-path <path> --tokenizer <model>
+
+set -e
+
+# Defaults
+OUTPUT_DIR="" MLM_DIR="" TOKENIZER="" SAMPLE_PERCENT=30 INCLUDE_REASONING=false WORKERS=32
+
+# Parse args
+while [[ $# -gt 0 ]]; do
+    case $1 in
+        --output-dir) OUTPUT_DIR="$2"; shift 2;;
+        --mlm-path) MLM_DIR="$2"; shift 2;;
+        --tokenizer) TOKENIZER="$2"; shift 2;;
+        --sample-percent) SAMPLE_PERCENT="$2"; shift 2;;
+        --include-reasoning) INCLUDE_REASONING=true; shift;;
+        --workers) WORKERS="$2"; shift 2;;
+        *) echo "Unknown: $1"; exit 1;;
+    esac
+done
+
+# Validate
+if [ -z "$OUTPUT_DIR" ] || [ -z "$MLM_DIR" ] || [ -z "$TOKENIZER" ]; then
+    echo "Usage: bash generate_dataset.sh --output-dir <path> --mlm-path <path> --tokenizer <model>"
+    echo "Optional: --sample-percent N --include-reasoning --workers N"
+    exit 1
+fi
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+SUFFIX="${SAMPLE_PERCENT}pct$( [ "$INCLUDE_REASONING" = true ] && echo "_cot" )_chat"
+REASONING_FLAG=$( [ "$INCLUDE_REASONING" = true ] && echo "--include-reasoning" )
+
+echo "=== QAD Dataset Generation ==="
+echo "Output: $OUTPUT_DIR | Tokenizer: $TOKENIZER | Sample: ${SAMPLE_PERCENT}%"
+
+# Helper: preprocess JSONL to Megatron format
+preprocess() {
+    [ -f "$1" ] && python "$MLM_DIR/tools/preprocess_data.py" \
+        --input "$1" --output-prefix "$2" \
+        --tokenizer-type HuggingFaceTokenizer --tokenizer-model "$TOKENIZER" \
+        --append-eod --workers "$WORKERS" --json-keys text
+}
+
+# Step 1: Download
+echo -e "\n=== Downloading ==="
+python "$SCRIPT_DIR/download_dataset.py" --dataset openscience --output-dir "$OUTPUT_DIR" --tokenizer "$TOKENIZER"
+python "$SCRIPT_DIR/download_dataset.py" --dataset nemotron-v2 --output-dir "$OUTPUT_DIR" \
+    --sample-percent "$SAMPLE_PERCENT" $REASONING_FLAG --tokenizer "$TOKENIZER"
+
+# Step 2: Preprocess
+echo -e "\n=== Preprocessing ==="
+OS_IN="$OUTPUT_DIR/openscience_splits" OS_OUT="$OUTPUT_DIR/openscience_splits_preprocessed"
+NV_IN="$OUTPUT_DIR/nemotron_v2" NV_OUT="$OUTPUT_DIR/nemotron_v2_preprocessed"
+mkdir -p "$OS_OUT"
+
+for s in train validation test; do preprocess "$OS_IN/openscience_chat_$s.jsonl" "$OS_OUT/openscience_chat_$s" || true; done
+
+for split in code math stem chat; do
+    mkdir -p "$NV_OUT/$split"
+    for s in train validation test; do
+        preprocess "$NV_IN/$split/${split}_${SUFFIX}_$s.jsonl" "$NV_OUT/$split/${split}_${SUFFIX}_$s" || true
+    done
+done
+
+# Step 3: Create combined datablend
+BLEND="$OUTPUT_DIR/datablend_combined.json"
+cat > "$BLEND" << EOF
+{
+    "train": [
+        0.3, "$NV_OUT/code/code_${SUFFIX}_train_text_document",
+        0.2, "$NV_OUT/math/math_${SUFFIX}_train_text_document",
+        0.2, "$NV_OUT/stem/stem_${SUFFIX}_train_text_document",
+        0.1, "$NV_OUT/chat/chat_${SUFFIX}_train_text_document",
+        0.2, "$OS_OUT/openscience_chat_train_text_document"
+    ],
+    "valid": [
+        0.5, "$NV_OUT/stem/stem_${SUFFIX}_validation_text_document",
+        0.5, "$OS_OUT/openscience_chat_validation_text_document"
+    ],
+    "test": [
+        0.5, "$NV_OUT/stem/stem_${SUFFIX}_test_text_document",
+        0.5, "$OS_OUT/openscience_chat_test_text_document"
+    ]
+}
+EOF
+
+echo -e "\n=== Done! ==="
+echo "Datablend: $BLEND"
+echo "Set BLEND_PATH in your config and run: sbatch sbatch_qad.sh --config <config>"
diff --git a/examples/llm_qad/qad.sh b/examples/llm_qad/qad.sh
new file mode 100644
index 000000000..52ec2bd6a
--- /dev/null
+++ b/examples/llm_qad/qad.sh
@@ -0,0 +1,348 @@
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# QAD (Quantization-Aware Distillation) Training Script
+# Usage: bash qad.sh --config configs/your-config.conf
+
+set -euo pipefail
+
+# === Helpers ===
+die() { echo "[ERROR] $*" >&2; exit 1; }
+log_info() { echo "[INFO] $*"; }
+log_warn() { echo "[WARN] $*"; }
+require_var() { [[ -n "${!1:-}" ]] || die "$1 must be set in config"; }
+require_file() { [[ -f "$1" ]] || die "${2:-File} not found: $1"; }
+require_dir() { [[ -d "$1" ]] || die "${2:-Directory} not found: $1"; }
+sanitize() { echo "$1" | sed -e 's/[\/ :]/_/g' -e 's/[=]/_/g'; }
+
+# === Environment ===
+export NCCL_IB_SL=1
+export NCCL_IB_TIMEOUT=19
+export NCCL_P2P_NET_CHUNKSIZE=2097152
+export NCCL_DEBUG=WARN
+export NCCL_SHM_DISABLE=1
+export NCCL_NVLS_ENABLE=0
+export CUDA_DEVICE_MAX_CONNECTIONS=1
+export UB_TIMEOUT=720
+export NVTE_FWD_LAYERNORM_SM_MARGIN=16
+export NVTE_BWD_LAYERNORM_SM_MARGIN=16
+export TORCHINDUCTOR_COMPILE_THREADS=1
+export TORCH_COMPILE_DISABLE=1
+export PYTORCH_NO_CUDA_MEMORY_CACHING=0
+export TORCH_DISTRIBUTED_DEBUG=OFF
+export PYTORCH_JIT=0
+export TORCH_USE_CUDA_DSA=0
+export GLOO_SOCKET_IFNAME=ibp26s0
+
+# === Argument Parsing ===
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+CONFIG_FILE=""
+HF_TOKEN_ARG=""
+
+while [[ $# -gt 0 ]]; do
+    case $1 in
+        --config|-c) CONFIG_FILE="$2"; shift 2;;
+        --hf-token) HF_TOKEN_ARG="$2"; shift 2;;
+        *) die "Unknown argument: $1";;
+    esac
+done
+
+# HuggingFace token
+[[ -n "$HF_TOKEN_ARG" ]] && export HF_TOKEN="$HF_TOKEN_ARG"
+[[ -n "${HF_TOKEN:-}" ]] && export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN" && log_info "HuggingFace token configured"
+
+# === Load Config ===
+if [[ -z "$CONFIG_FILE" ]]; then
+    die "Config file required. Use --config <path>\nAvailable: $(ls -1 "${SCRIPT_DIR}/configs/"*.conf 2>/dev/null | tr '\n' ' ')"
+fi
+[[ "$CONFIG_FILE" = /* ]] || CONFIG_FILE="${SCRIPT_DIR}/${CONFIG_FILE}"
+require_file "$CONFIG_FILE" "Config file"
+log_info "Loading config: ${CONFIG_FILE}"
+source "$CONFIG_FILE"
+
+# === Validate Required Config ===
+for v in LR GBS MIN_LR LR_DECAY_STYLE SAVE_INTERVAL LOG_INTERVAL \
+         STUDENT_MODEL TEACHER_MODEL DATASET_NAME BLEND_PATH TRAIN_SAMPLES IS_MOE TOKENIZER_MODEL \
+         TP_SIZE MBS STUDENT_CKPT TEACHER_CKPT TEACHER_MODEL_CONFIG \
+         STUDENT_CONFIG_FILE MLM_DIR MODELOPT_DIR QAD_CHECKPOINT_ROOT DATACACHE_DIR; do
+    require_var "$v"
+done
+
+# === Defaults for Optional Config ===
+EP_SIZE="${EP_SIZE:-1}"
+PP_SIZE="${PP_SIZE:-1}"
+NUM_GPUS="${NUM_GPUS:-8}"
+NNODES="${NNODES:-1}"
+NODE_RANK="${NODE_RANK:-0}"
+MASTER_ADDR="${MASTER_ADDR:-localhost}"
+MASTER_PORT="${MASTER_PORT:-29500}"
+LR_DECAY_SAMPLES="${LR_DECAY_SAMPLES:-$(( TRAIN_SAMPLES * 99 / 100 ))}"
+LR_WARMUP_SAMPLES="${LR_WARMUP_SAMPLES:-$(( TRAIN_SAMPLES / 100 ))}"
+SAVE_RETAIN_INTERVAL="${SAVE_RETAIN_INTERVAL:-$SAVE_INTERVAL}"
+EVAL_INTERVAL="${EVAL_INTERVAL:-$SAVE_INTERVAL}"
+EVAL_ITERS="${EVAL_ITERS:-20}"
+MAX_SEQ="${MAX_SEQ:-}"
+RUN_TAG="${RUN_TAG:-}"
+KD_CFG_PATH="${KD_CFG_PATH:-}"
+ITERATIONS_TO_SKIP="${ITERATIONS_TO_SKIP:-}"
+ENABLE_MOE_PERF="${ENABLE_MOE_PERF:-1}"
+ENABLE_MOE_EXPERIMENTAL="${ENABLE_MOE_EXPERIMENTAL:-0}"
+LOG_PARAMS_NORM="${LOG_PARAMS_NORM:-}"
+
+# === Load Student Model Config ===
+require_file "$STUDENT_CONFIG_FILE" "Student model config"
+log_info "Loading student model config: ${STUDENT_CONFIG_FILE}"
+set +u; source "$STUDENT_CONFIG_FILE"; set -u
+STUDENT_MODEL_ARGS="${MODEL_ARGS}"
+
+# Log params norm (disabled for MoE to save memory)
+if [[ "${LOG_PARAMS_NORM}" == "1" ]]; then
+    LOG_PARAMS_NORM_ARG="--log-params-norm"
+elif [[ "$IS_MOE" == "true" ]]; then
+    LOG_PARAMS_NORM_ARG=""
+    log_warn "log-params-norm disabled for MoE model"
+else
+    LOG_PARAMS_NORM_ARG="--log-params-norm"
+fi
+
+log_info "Model: ${STUDENT_MODEL} | TP=${TP_SIZE} PP=${PP_SIZE} EP=${EP_SIZE} MBS=${MBS} MoE=${IS_MOE}"
+
+# === Validate Checkpoints ===
+require_dir "$STUDENT_CKPT" "Student checkpoint"
+require_dir "$TEACHER_CKPT" "Teacher checkpoint"
+require_file "$TEACHER_MODEL_CONFIG" "Teacher model config"
+log_info "Student: ${STUDENT_CKPT}"
+log_info "Teacher: ${TEACHER_CKPT}"
+
+# === Output Paths ===
+DATETIME=$(date +'date_%y-%m-%d_time_%H-%M-%S')
+STUDENT_CKPT_NAME=$(basename "${STUDENT_CKPT}")
+TEACHER_CKPT_NAME=$(basename "${TEACHER_CKPT}")
+
+TAG_PARTS="lr$(sanitize "$LR")-minlr$(sanitize "$MIN_LR")-decay$(sanitize "$LR_DECAY_STYLE")"
+[[ -n "$MAX_SEQ" ]] && TAG_PARTS="${TAG_PARTS}-seq${MAX_SEQ}"
+[[ -n "$RUN_TAG" ]] && TAG_PARTS="${TAG_PARTS}-tag$(sanitize "$RUN_TAG")"
+
+OUTPUT_ROOT="${QAD_CHECKPOINT_ROOT}/${STUDENT_CKPT_NAME}-Teacher-${TEACHER_CKPT_NAME}-Data-${DATASET_NAME}-${TAG_PARTS}"
+CHECKPOINT_DIR="${OUTPUT_ROOT}/checkpoints/${STUDENT_CKPT_NAME}"
+TENSORBOARD_DIR="${OUTPUT_ROOT}/tensorboard/${STUDENT_CKPT_NAME}"
+LOGS_DIR="${OUTPUT_ROOT}/logs"
+mkdir -p "${LOGS_DIR}" "${CHECKPOINT_DIR}" "${DATACACHE_DIR}" "${TENSORBOARD_DIR}"
+
+# === Resume Logic ===
+if [[ -f "${CHECKPOINT_DIR}/latest_checkpointed_iteration.txt" ]]; then
+    log_info "Resuming from: ${CHECKPOINT_DIR}"
+    LOAD_CHECKPOINT_DIR="${CHECKPOINT_DIR}"
+    FINETUNE_FLAG=""
+    LOAD_OPTIM_ARGS=""
+    CKPT_PARALLEL_LOAD_ARG="--ckpt-fully-parallel-load"
+else
+    log_info "Starting fresh from base checkpoint"
+    LOAD_CHECKPOINT_DIR="${STUDENT_CKPT}"
+    FINETUNE_FLAG="--finetune"
+    LOAD_OPTIM_ARGS="--no-load-optim --no-load-rng"
+    CKPT_PARALLEL_LOAD_ARG=""
+fi
+
+# === Log Configuration ===
+ENV_LOG="${LOGS_DIR}/${STUDENT_CKPT_NAME}_${DATETIME}.env.log"
+{
+    echo "=== QAD Training: ${STUDENT_MODEL} ==="
+    echo "Time: ${DATETIME}"
+    echo "LR=${LR} MinLR=${MIN_LR} Decay=${LR_DECAY_STYLE} GBS=${GBS} MBS=${MBS}"
+    echo "TrainSamples=${TRAIN_SAMPLES} SaveInterval=${SAVE_INTERVAL} LogInterval=${LOG_INTERVAL}"
+    echo "TP=${TP_SIZE} PP=${PP_SIZE} EP=${EP_SIZE} Nodes=${NNODES} GPUs/node=${NUM_GPUS}"
+    echo "Checkpoint: ${CHECKPOINT_DIR}"
+    echo "TensorBoard: ${TENSORBOARD_DIR}"
+    env
+} > "$ENV_LOG"
+
+# === Build Training Arguments ===
+
+# Checkpoint loading
+CHECKPOINT_ARGS=" \
+    --auto-detect-ckpt-format \
+    --export-te-mcore-model \
+    --dist-ckpt-strictness log_unexpected \
+    ${FINETUNE_FLAG} \
+    ${LOAD_OPTIM_ARGS} \
+    --load ${LOAD_CHECKPOINT_DIR} \
+    --export-kd-teacher-load ${TEACHER_CKPT} \
+    --teacher-model-config ${TEACHER_MODEL_CONFIG}"
+
+# KD config (optional)
+if [[ -n "$KD_CFG_PATH" && -f "$KD_CFG_PATH" ]]; then
+    CHECKPOINT_ARGS="${CHECKPOINT_ARGS} --export-kd-cfg ${KD_CFG_PATH}"
+    log_info "Using KD config: ${KD_CFG_PATH}"
+fi
+
+# Tokenizer
+TOKENIZER_ARGS=" \
+    --tokenizer-type HuggingFaceTokenizer \
+    --tokenizer-model ${TOKENIZER_MODEL}"
+
+# Data
+DATA_ARGS=" \
+    --per-split-data-args-path ${BLEND_PATH} \
+    --data-cache-path ${DATACACHE_DIR} \
+    --no-mmap-bin-files \
+    --num-dataset-builder-threads 16 \
+    --no-create-attention-mask-in-dataloader"
+
+# Sequence length override
+SEQ_ARGS=""
+if [[ -n "$MAX_SEQ" ]]; then
+    SEQ_ARGS="--seq-length ${MAX_SEQ} --max-position-embeddings ${MAX_SEQ}"
+    log_info "Sequence length override: ${MAX_SEQ}"
+fi
+
+# Training
+TRAINING_ARGS=" \
+    --micro-batch-size ${MBS} \
+    --global-batch-size ${GBS} \
+    --train-samples ${TRAIN_SAMPLES} \
+    --lr-decay-samples ${LR_DECAY_SAMPLES} \
+    --lr-warmup-samples ${LR_WARMUP_SAMPLES} \
+    --attention-dropout 0.0 \
+    --hidden-dropout 0.0 \
+    --bf16 \
+    ${SEQ_ARGS}"
+
+# Optimizer
+OPTIMIZER_ARGS=" \
+    --lr ${LR} \
+    --min-lr ${MIN_LR} \
+    --weight-decay 0.1 \
+    --clip-grad 1.0 \
+    --lr-decay-style ${LR_DECAY_STYLE} \
+    --adam-beta1 0.9 \
+    --adam-beta2 0.95 \
+    --use-distributed-optimizer \
+    --overlap-grad-reduce \
+    --overlap-param-gather"
+
+# Parallelism
+PARALLEL_ARGS=" \
+    --tensor-model-parallel-size ${TP_SIZE} \
+    --pipeline-model-parallel-size ${PP_SIZE} \
+    --distributed-timeout-minutes 360 \
+    --disable-gloo-process-groups \
+    --ddp-num-buckets 7"
+
+# Expert parallelism for MoE
+if [[ "$IS_MOE" == "true" && "$EP_SIZE" -gt 1 ]]; then
+    PARALLEL_ARGS="${PARALLEL_ARGS} --expert-model-parallel-size ${EP_SIZE}"
+    log_info "MoE Expert Parallelism: EP=${EP_SIZE}"
+fi
+
+# Sequence parallel (add if not in model config)
+if ! echo "$STUDENT_MODEL_ARGS" | grep -q "sequence-parallel"; then
+    PARALLEL_ARGS="${PARALLEL_ARGS} --sequence-parallel"
+fi
+
+# MoE performance optimizations
+MOE_PERF_ARGS=""
+if [[ "$IS_MOE" == "true" && "$ENABLE_MOE_PERF" == "1" ]]; then
+    log_info "MoE Performance Optimizations: ENABLED"
+    MOE_PERF_ARGS=" \
+        --moe-token-dispatcher-type alltoall \
+        --moe-shared-expert-overlap \
+        --moe-permute-fusion \
+        --moe-grouped-gemm \
+        --cross-entropy-loss-fusion \
+        --cross-entropy-fusion-impl native"
+    
+    if [[ "$ENABLE_MOE_EXPERIMENTAL" == "1" ]]; then
+        MOE_PERF_ARGS="${MOE_PERF_ARGS} --enable-experimental"
+        log_warn "Experimental MoE features enabled"
+    fi
+elif [[ "$IS_MOE" == "true" ]]; then
+    log_warn "MoE Performance Optimizations: DISABLED"
+fi
+
+# Memory optimization
+MEMORY_ARGS=" \
+    --recompute-granularity full \
+    --recompute-method uniform \
+    --recompute-num-layers 1 \
+    --no-gradient-accumulation-fusion"
+
+# Checkpoint saving
+SAVE_ARGS=" \
+    --save ${CHECKPOINT_DIR} \
+    --save-interval ${SAVE_INTERVAL} \
+    --save-retain-interval ${SAVE_RETAIN_INTERVAL} \
+    --ckpt-format torch_dist \
+    --ckpt-fully-parallel-save \
+    --ckpt-assume-constant-structure \
+    ${CKPT_PARALLEL_LOAD_ARG}"
+
+# Logging
+LOGGING_ARGS=" \
+    --log-interval ${LOG_INTERVAL} \
+    --eval-iters ${EVAL_ITERS} \
+    --eval-interval ${EVAL_INTERVAL} \
+    --log-progress \
+    --timing-log-option minmax \
+    ${LOG_PARAMS_NORM_ARG:-} \
+    --log-num-zeros-in-grad \
+    --log-throughput \
+    --log-straggler \
+    --disable-straggler-on-startup \
+    --straggler-minmax-count 16 \
+    --tensorboard-dir ${TENSORBOARD_DIR}"
+
+# Runtime
+RUNTIME_ARGS=" \
+    --exit-duration-in-mins 1200 \
+    --num-workers 8 \
+    --no-check-for-nan-in-loss-and-grad"
+
+# Combine all arguments
+ALL_ARGS=" \
+    ${CHECKPOINT_ARGS} \
+    ${STUDENT_MODEL_ARGS} \
+    ${TOKENIZER_ARGS} \
+    ${DATA_ARGS} \
+    ${TRAINING_ARGS} \
+    ${OPTIMIZER_ARGS} \
+    ${PARALLEL_ARGS} \
+    ${MOE_PERF_ARGS} \
+    ${MEMORY_ARGS} \
+    ${SAVE_ARGS} \
+    ${LOGGING_ARGS} \
+    ${RUNTIME_ARGS}"
+
+# Optional: iterations to skip
+[[ -n "$ITERATIONS_TO_SKIP" ]] && ALL_ARGS="${ALL_ARGS} --iterations-to-skip ${ITERATIONS_TO_SKIP}"
+
+# === Launch Training ===
+export PYTHONPATH="${MODELOPT_DIR}:${MLM_DIR}:${PYTHONPATH:-}"
+LOG_FILE="${LOGS_DIR}/${STUDENT_CKPT_NAME}_qad_${DATETIME}.log"
+
+log_info "Starting training..."
+log_info "Log file: ${LOG_FILE}"
+log_info "Distributed: ${NNODES} nodes x ${NUM_GPUS} GPUs = $((NNODES * NUM_GPUS)) total"
+
+torchrun \
+    --nproc_per_node="${NUM_GPUS}" \
+    --nnodes="${NNODES}" \
+    --node_rank="${NODE_RANK}" \
+    --master_addr="${MASTER_ADDR}" \
+    --master_port="${MASTER_PORT}" \
+    "${MLM_DIR}/pretrain_gpt.py" ${ALL_ARGS} 2>&1 | tee "${LOG_FILE}"
+
+log_info "Training completed. Logs: ${LOG_FILE}"
diff --git a/examples/llm_qad/sbatch_qad.sh b/examples/llm_qad/sbatch_qad.sh
new file mode 100755
index 000000000..613b9bc27
--- /dev/null
+++ b/examples/llm_qad/sbatch_qad.sh
@@ -0,0 +1,168 @@
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# QAD SLURM Batch Submission Script
+# Usage: sbatch sbatch_qad.sh --config configs/your-config.conf
+# Override: sbatch --nodes=4 --account=<account> sbatch_qad.sh --config ...
+
+#SBATCH -p batch
+#SBATCH --account=<your-account>
+#SBATCH --nodes=4
+#SBATCH -t 4:00:00
+#SBATCH --exclusive
+#SBATCH --mem=0
+#SBATCH --gres=gpu:4
+#SBATCH --ntasks-per-node=1
+#SBATCH --job-name=qad-training
+
+set -x -e
+
+# === Parse Arguments ===
+SCRIPT_DIR="${SLURM_SUBMIT_DIR:-$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)}"
+CONFIG_FILE=""
+HF_TOKEN_ARG=""
+
+while [[ $# -gt 0 ]]; do
+    case $1 in
+        --config|-c) CONFIG_FILE="$2"; shift 2;;
+        --hf-token) HF_TOKEN_ARG="$2"; shift 2;;
+        *) break;;
+    esac
+done
+
+[[ -n "$HF_TOKEN_ARG" ]] && export HF_TOKEN="$HF_TOKEN_ARG"
+
+# === Load Config ===
+if [[ -n "$CONFIG_FILE" ]]; then
+    [[ "$CONFIG_FILE" = /* ]] || CONFIG_FILE="${SCRIPT_DIR}/${CONFIG_FILE}"
+    if [[ -f "$CONFIG_FILE" ]]; then
+        echo "Loading config: ${CONFIG_FILE}"
+        source "$CONFIG_FILE"
+    else
+        echo "ERROR: Config not found: ${CONFIG_FILE}"
+        ls -1 "${SCRIPT_DIR}/configs/"*.conf 2>/dev/null || echo "(no configs found)"
+        exit 1
+    fi
+fi
+
+# === Default Paths (override in config) ===
+MLM_DIR="${MLM_DIR:-/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_modelopt/users/weimingc/workspace/Megatron-LM}"
+MODELOPT_DIR="${MODELOPT_DIR:-/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_modelopt/users/weimingc/workspace/TensorRT-Model-Optimizer}"
+MODELS_ROOT="${MODELS_ROOT:-/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_modelopt/users/weimingc/models}"
+QAD_CHECKPOINT_ROOT="${QAD_CHECKPOINT_ROOT:-/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_modelopt/users/weimingc/checkpoints}"
+DATACACHE_DIR="${DATACACHE_DIR:-/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_modelopt/users/weimingc/data_cache}"
+LOG_DIR="${LOG_DIR:-${QAD_CHECKPOINT_ROOT}/logs_slurm}"
+
+# Container settings
+CONTAINER_IMAGE="${CONTAINER_IMAGE:-/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_modelopt/users/weimingc/containers/pytorch_25.06-py3.sqsh}"
+CONTAINER_MOUNTS="${CONTAINER_MOUNTS:-/lustre/fs1:/lustre/fs1}"
+CONTAINER_WORKDIR="${CONTAINER_WORKDIR:-/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_modelopt/users/weimingc/workspace/TensorRT-Model-Optimizer/examples/llm_qad}"
+
+# Parallelism (required from config)
+TP_SIZE="${TP_SIZE:?ERROR: TP_SIZE must be set in config}"
+MBS="${MBS:?ERROR: MBS must be set in config}"
+PP_SIZE="${PP_SIZE:-1}"
+EP_SIZE="${EP_SIZE:-1}"
+NUM_GPUS="${NUM_GPUS:-8}"
+MASTER_PORT="${MASTER_PORT:-29500}"
+
+# Multi-node from SLURM
+NNODES="${SLURM_NNODES:-4}"
+MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
+
+mkdir -p "${LOG_DIR}"
+DATETIME=$(date +'date_%y-%m-%d_time_%H-%M-%S')
+
+# === Display Configuration ===
+echo "========================================"
+echo "QAD Training Configuration"
+echo "========================================"
+[[ -n "$CONFIG_FILE" ]] && echo "Config: ${CONFIG_FILE}"
+echo "Model: ${STUDENT_MODEL:-unknown} -> Teacher: ${TEACHER_MODEL:-unknown}"
+echo "LR: ${LR:-?} | Dataset: ${DATASET_NAME:-?}"
+echo "Parallelism: TP=${TP_SIZE} PP=${PP_SIZE} EP=${EP_SIZE} MBS=${MBS}"
+echo "Nodes: ${NNODES} x ${NUM_GPUS} GPUs = $((NNODES * NUM_GPUS)) total"
+echo "Master: ${MASTER_ADDR}:${MASTER_PORT}"
+echo ""
+echo "Paths:"
+echo "  MLM_DIR: ${MLM_DIR}"
+echo "  MODELOPT_DIR: ${MODELOPT_DIR}"
+echo "  Checkpoints: ${QAD_CHECKPOINT_ROOT}"
+echo ""
+echo "Container: ${CONTAINER_IMAGE}"
+echo ""
+echo "Checkpoints:"
+echo "  Student: ${STUDENT_CKPT:-NOT SET}"
+echo "  Teacher: ${TEACHER_CKPT:-NOT SET}"
+[[ -n "${BLEND_PATH:-}" ]] && echo "  Blend: ${BLEND_PATH}"
+echo "========================================"
+
+# Validate required
+[[ -z "${STUDENT_CKPT:-}" ]] && echo "ERROR: STUDENT_CKPT required" && exit 1
+[[ -z "${TEACHER_CKPT:-}" ]] && echo "ERROR: TEACHER_CKPT required" && exit 1
+
+# === Build Container Exports ===
+# Use local /tmp for Triton cache to avoid race conditions
+EXPORTS="export TRITON_CACHE_DIR=/tmp/triton_cache_\${SLURM_JOB_ID}_\${SLURM_PROCID}"
+EXPORTS="${EXPORTS} && export NODE_RANK=\${SLURM_PROCID}"
+EXPORTS="${EXPORTS} && export NNODES=${NNODES} NUM_GPUS=${NUM_GPUS}"
+EXPORTS="${EXPORTS} && export TP_SIZE=${TP_SIZE} PP_SIZE=${PP_SIZE} EP_SIZE=${EP_SIZE} MBS=${MBS}"
+EXPORTS="${EXPORTS} && export IS_MOE=${IS_MOE:-false}"
+EXPORTS="${EXPORTS} && export MASTER_ADDR=${MASTER_ADDR} MASTER_PORT=${MASTER_PORT}"
+EXPORTS="${EXPORTS} && export MLM_DIR=${MLM_DIR} MODELOPT_DIR=${MODELOPT_DIR}"
+EXPORTS="${EXPORTS} && export QAD_CHECKPOINT_ROOT=${QAD_CHECKPOINT_ROOT} DATACACHE_DIR=${DATACACHE_DIR}"
+EXPORTS="${EXPORTS} && export STUDENT_CKPT=${STUDENT_CKPT} TEACHER_CKPT=${TEACHER_CKPT}"
+
+# Training hyperparameters
+for v in LR GBS MIN_LR LR_DECAY_STYLE SAVE_INTERVAL LOG_INTERVAL STUDENT_MODEL TEACHER_MODEL DATASET_NAME; do
+    [[ -n "${!v:-}" ]] && EXPORTS="${EXPORTS} && export ${v}=${!v}"
+done
+
+# Model config
+[[ -n "${STUDENT_CONFIG_FILE:-}" ]] && EXPORTS="${EXPORTS} && export STUDENT_CONFIG_FILE=${STUDENT_CONFIG_FILE}"
+[[ -n "${TOKENIZER_MODEL:-}" ]] && EXPORTS="${EXPORTS} && export TOKENIZER_MODEL=${TOKENIZER_MODEL}"
+[[ -n "${TEACHER_MODEL_CONFIG:-}" ]] && EXPORTS="${EXPORTS} && export TEACHER_MODEL_CONFIG=${TEACHER_MODEL_CONFIG}"
+
+# Dataset
+[[ -n "${BLEND_PATH:-}" ]] && EXPORTS="${EXPORTS} && export BLEND_PATH=${BLEND_PATH}"
+[[ -n "${TRAIN_SAMPLES:-}" ]] && EXPORTS="${EXPORTS} && export TRAIN_SAMPLES=${TRAIN_SAMPLES}"
+
+# Optional
+[[ -n "${HF_TOKEN:-}" ]] && EXPORTS="${EXPORTS} && export HF_TOKEN=${HF_TOKEN} HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}"
+[[ -n "${ITERATIONS_TO_SKIP:-}" ]] && EXPORTS="${EXPORTS} && export ITERATIONS_TO_SKIP=${ITERATIONS_TO_SKIP}"
+[[ -n "${DISTILL_CONFIG_PATH:-}" ]] && EXPORTS="${EXPORTS} && export DISTILL_CONFIG_PATH=${DISTILL_CONFIG_PATH}"
+
+# === Launch ===
+CONFIG_ARGS=""
+[[ -n "${CONFIG_FILE}" ]] && CONFIG_ARGS="--config ${CONFIG_FILE}"
+[[ -n "${HF_TOKEN:-}" ]] && CONFIG_ARGS="${CONFIG_ARGS} --hf-token ${HF_TOKEN}"
+
+run_cmd="pip install transformers==4.54 && ${EXPORTS} && cd ${CONTAINER_WORKDIR} && bash qad.sh ${CONFIG_ARGS}"
+
+echo "Running: ${run_cmd}"
+
+srun -l \
+    --output=${LOG_DIR}/%x_%j_${DATETIME}.log \
+    --error=${LOG_DIR}/err_%x_%j_${DATETIME}.log \
+    --container-image ${CONTAINER_IMAGE} \
+    --container-mounts ${CONTAINER_MOUNTS} \
+    --container-workdir ${CONTAINER_WORKDIR} \
+    sh -c "${run_cmd}"
+
+echo "========================================"
+echo "QAD Training completed at $(date)"
+echo "Logs: ${LOG_DIR}/"
+echo "========================================"