diff --git a/CHANGELOG.rst b/CHANGELOG.rst index 61c198026..826b51160 100755 --- a/CHANGELOG.rst +++ b/CHANGELOG.rst @@ -13,6 +13,7 @@ NVIDIA Model Optimizer Changelog (Linux) - Add support for subgraphs in ONNX autocast. - Add support for parallel draft heads in Eagle speculative decoding. - Add support to enable custom emulated quantization backend. See :meth:`register_quant_backend `` for more details. See an example in ``tests/unit/torch/quantization/test_custom_backend.py``. +- Add ``examples/llm_qad`` for QAD training with Megatron-LM. **Deprecations** diff --git a/examples/llm_qad/README.md b/examples/llm_qad/README.md new file mode 100644 index 000000000..68fd01849 --- /dev/null +++ b/examples/llm_qad/README.md @@ -0,0 +1,170 @@ +# QAD Training Scripts + +Quantization-Aware Distillation (QAD) training scripts for language models using Megatron-LM. These scripts enable training quantized (e.g., NVFP4) student models with knowledge distillation from full-precision teacher models. + +## Overview + +| Script | Purpose | +|--------|---------| +| `qad.sh` | Main training script (run inside container) | +| `sbatch_qad.sh` | SLURM batch submission wrapper | +| `configs/*.conf` | Model-specific configuration files | + +## Requirements + +### Clone Required Repositories + +```bash +# Set your workspace directory +export WORKSPACE=/path/to/your/workspace + +# Clone Megatron-LM (with ModelOpt integration) +git clone https://github.com/NVIDIA/Megatron-LM.git ${WORKSPACE}/Megatron-LM + +# Clone Model-Optimizer +git clone https://github.com/NVIDIA/TensorRT-Model-Optimizer.git ${WORKSPACE}/Model-Optimizer +``` + +### Prepare Checkpoints + +You need the following checkpoints before training: + +1. **Student checkpoint**: Quantized (e.g., NVFP4) model in Megatron-LM format +2. **Teacher checkpoint**: Full-precision (BF16) model in Megatron-LM format +3. **Teacher config YAML**: Model architecture configuration + +See [Megatron-LM ModelOpt examples](https://github.com/NVIDIA/Megatron-LM/tree/main/examples/post_training/modelopt) for checkpoint conversion from HuggingFace format. + +## Creating a Configuration + +### Available Templates + +| Config | Model | Type | +|--------|-------|------| +| `qwen3-30b-a3b-instruct-2507-moe_template.conf` | Qwen3-30B-A3B-Instruct | MoE | +| `qwen3-8b_template.conf` | Qwen3-8B | Dense | + +### Create Your Config + +1. Copy a template: + + ```bash + # For MoE models + cp configs/qwen3-30b-a3b-instruct-2507-moe_template.conf configs/my-experiment.conf + + # For Dense models + cp configs/qwen3-8b_template.conf configs/my-experiment.conf + ``` + +2. Fill in required fields: + + **Checkpoints** (required): + + | Variable | Description | + |----------|-------------| + | `STUDENT_CKPT` | Path to quantized student MLM checkpoint | + | `TEACHER_CKPT` | Path to teacher MLM checkpoint | + | `TEACHER_MODEL_CONFIG` | Path to teacher YAML config (see below) | + + **Paths** (required): + + | Variable | Description | + |----------|-------------| + | `MLM_DIR` | Path to Megatron-LM directory | + | `BLEND_PATH` | Path to datablend JSON (from dataset generation) | + + **Parallelism** (adjust for your hardware): + + | Variable | Dense Model | MoE Model | + |----------|-------------|-----------| + | `IS_MOE` | `false` | `true` | + | `TP_SIZE` | `1` | `2` | + | `EP_SIZE` | `1` | `4` | + | `MBS` | `4` | `2` | + + **Training** (tune as needed): + + | Variable | Default | Description | + |----------|---------|-------------| + | `LR` | `1e-5` | Learning rate | + | `GBS` | `256` | Global batch size | + | `SAVE_INTERVAL` | `200` | Checkpoint interval | + +### Teacher Model Config (YAML) + +Create a YAML file with teacher model architecture (example: `configs/Qwen3-30B-A3B-teacher.yaml`): + +```yaml +num_layers: 48 +hidden_size: 2048 +num_attention_heads: 32 +num_query_groups: 4 +kv_channels: 128 +ffn_hidden_size: 6144 +``` + +## Dataset Generation + +Use the one-button script to generate the default datablend: + +```bash +cd data_utils/ + +bash generate_dataset.sh \ + --output-dir /path/to/datasets \ + --mlm-path /path/to/Megatron-LM \ + --tokenizer # e.g., Qwen/Qwen3-30B-A3B-Instruct-2507 +``` + +**Requirements**: HuggingFace token for `nvidia/Nemotron-Post-Training-Dataset-v2`. Login first: `huggingface-cli login` + +**Output**: Creates `datablend_combined.json` with OpenScience + Nemotron-v2 datasets. Set `BLEND_PATH` in your config to point to this file. + +## Quick Start + +### SLURM Batch Submission (Recommended) + +First, update `sbatch_qad.sh` SLURM header with your cluster settings: + +- `--account=` +- `--nodes`, `--gres=gpu`, `-t` as needed + +```bash +# Submit training job (override account on command line) +sbatch --account= sbatch_qad.sh --config configs/my-experiment.conf + +# With HuggingFace token (for gated models) +sbatch --account= sbatch_qad.sh --hf-token $HF_TOKEN --config configs/my-experiment.conf + +# Adjust nodes and time +sbatch --account= --nodes=4 -t 8:00:00 sbatch_qad.sh --config configs/my-experiment.conf +``` + +### Interactive Mode + +```bash +# Get interactive node +srun -A --nodes=1 -p batch --mpi=pmix \ + --container-image=nvcr.io/nvidia/pytorch:25.06-py3 \ + --container-mounts="..." \ + -t 4:0:0 --pty bash + +# Run training +bash qad.sh --config configs/qwen3-8b.conf +``` + +## Resuming Training + +Training automatically resumes from checkpoints. To force a fresh start: + +```bash +rm -rf /path/to/checkpoints/*/latest_checkpointed_iteration.txt +``` + +## Troubleshooting + +### OOM Errors + +- Reduce `MBS` +- Increase `EP_SIZE`, `TP_SIZE`, `PP_SIZE` +- Add more nodes diff --git a/examples/llm_qad/configs/qwen3-30b-a3b-instruct-2507-moe_template.conf b/examples/llm_qad/configs/qwen3-30b-a3b-instruct-2507-moe_template.conf new file mode 100644 index 000000000..52ca5efe0 --- /dev/null +++ b/examples/llm_qad/configs/qwen3-30b-a3b-instruct-2507-moe_template.conf @@ -0,0 +1,73 @@ +#!/bin/bash +######################################################## +# QAD Configuration: Qwen3-30B-A3B Instruct (MoE) +# Mixture of Experts - requires more resources +# +# Usage: +# sbatch sbatch_qad.sh --config configs/qwen3-30b-a3b-instruct-2507-moe_template.conf +######################################################## + +######################################################## +# MODEL +######################################################## +export STUDENT_MODEL="Qwen3-30B-A3B-Instruct-2507" +export TEACHER_MODEL="Qwen3-30B-A3B-Instruct-2507" +export TOKENIZER_MODEL="Qwen/Qwen3-30B-A3B-Instruct-2507" + +######################################################## +# CHECKPOINTS (REQUIRED) +######################################################## +export STUDENT_CKPT="" # Student MLM checkpoint path +export TEACHER_CKPT="" # Teacher MLM checkpoint path +export TEACHER_MODEL_CONFIG="" # Teacher MLM model config yaml file, e.g., configs/Qwen3-30B-A3B-teacher.yaml + +######################################################## +# TRAINING (REQUIRED - no defaults in qwen_qad.sh) +######################################################## +export LR="5e-6" +export GBS=64 +export MIN_LR="1e-8" +export LR_DECAY_STYLE="cosine" +export SAVE_INTERVAL=200 +export LOG_INTERVAL=10 +export DATASET_NAME="openscience_nemotron" # use for logging +export TRAIN_SAMPLES=5120000 + +######################################################## +# PARALLELISM +# Note: QAD loads both student + teacher models, requires more memory +######################################################## +export TP_SIZE=2 +export PP_SIZE=1 +export MBS=2 +export NUM_GPUS=4 +export MASTER_PORT=29500 + +######################################################## +# MOE +######################################################## +export EP_SIZE=4 +export IS_MOE=false + +######################################################## +# PATHS (REQUIRED - no defaults in qwen_qad.sh) +######################################################## +export MLM_DIR="" # path to Megatron-LM source directory +export MODELOPT_DIR="" # path to Model-Optimizer source directory +export STUDENT_CONFIG_FILE="" # path to student model args script, e.g., ${MLM_DIR}/examples/post_training/modelopt/conf/Qwen/Qwen3-30B-A3B.sh +export QAD_CHECKPOINT_ROOT="" # path to store QAD checkpoints +export DATACACHE_DIR="" # path to data cache directory + +######################################################## +# CONTAINER +######################################################## +export CONTAINER_IMAGE="" # path to container image, e.g., nvcr.io/nvidia/pytorch:25.06-py3 +export CONTAINER_MOUNTS="" # container mounts, e.g., "/lustre/fs1:/lustre/fs1" +export CONTAINER_WORKDIR="" # container work directory, e.g., "/Model-Optimizer/examples/llm_qad" + + +######################################################## +# DATASET +######################################################## +# Generate with: bash data_utils/generate_dataset.sh --output-dir --mlm-path --tokenizer +export BLEND_PATH="" # path to datablend_combined.json from generate_dataset.sh \ No newline at end of file diff --git a/examples/llm_qad/configs/qwen3-8b_template.conf b/examples/llm_qad/configs/qwen3-8b_template.conf new file mode 100644 index 000000000..1af932b39 --- /dev/null +++ b/examples/llm_qad/configs/qwen3-8b_template.conf @@ -0,0 +1,71 @@ +#!/bin/bash +######################################################## +# QAD Configuration: Qwen3-8B (Dense Model) +# +# Usage: +# sbatch sbatch_qad.sh --config configs/qwen3-8b_template.conf +######################################################## + +######################################################## +# MODEL +######################################################## +export STUDENT_MODEL="Qwen3-8B" +export TEACHER_MODEL="Qwen3-8B" +export TOKENIZER_MODEL="Qwen/Qwen3-8B" + +######################################################## +# CHECKPOINTS (REQUIRED) +######################################################## +export STUDENT_CKPT="" # Student MLM checkpoint path +export TEACHER_CKPT="" # Teacher MLM checkpoint path +export TEACHER_MODEL_CONFIG="" # Teacher MLM model config yaml file + +######################################################## +# TRAINING +######################################################## +export LR="5e-6" +export GBS=64 +export MIN_LR="1e-8" +export LR_DECAY_STYLE="cosine" +export SAVE_INTERVAL=200 +export LOG_INTERVAL=10 +export DATASET_NAME="openscience_nemotron" # use for logging +export TRAIN_SAMPLES=5120000 + +######################################################## +# PARALLELISM (Dense model - simpler settings) +######################################################## +export TP_SIZE=1 +export PP_SIZE=1 +export MBS=4 +export NUM_GPUS=8 +export MASTER_PORT=29500 + +######################################################## +# MOE +######################################################## +export EP_SIZE=1 +export IS_MOE=false + +######################################################## +# PATHS (REQUIRED) +######################################################## +export MLM_DIR="" # path to Megatron-LM source directory +export MODELOPT_DIR="" # path to Model-Optimizer source directory +export STUDENT_CONFIG_FILE="" # path to student model args script, e.g., ${MLM_DIR}/examples/post_training/modelopt/conf/Qwen/Qwen3-8B.sh +export QAD_CHECKPOINT_ROOT="" # path to store QAD checkpoints +export DATACACHE_DIR="" # path to data cache directory + +######################################################## +# CONTAINER +######################################################## +export CONTAINER_IMAGE="" # path to container image, e.g., nvcr.io/nvidia/pytorch:25.06-py3 +export CONTAINER_MOUNTS="" # container mounts, e.g., "/lustre/fs1:/lustre/fs1" +export CONTAINER_WORKDIR="" # container work directory + +######################################################## +# DATASET +######################################################## +# Generate with: bash data_utils/generate_dataset.sh --output-dir --mlm-path --tokenizer +export BLEND_PATH="" # path to datablend_combined.json from generate_dataset.sh + diff --git a/examples/llm_qad/data_utils/download_dataset.py b/examples/llm_qad/data_utils/download_dataset.py new file mode 100644 index 000000000..e3e3d0646 --- /dev/null +++ b/examples/llm_qad/data_utils/download_dataset.py @@ -0,0 +1,201 @@ +#!/usr/bin/env python3 +# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Download datasets for QAD training (OpenScience, Nemotron-v2).""" + +from __future__ import annotations + +import argparse +import json +import os +import random +from typing import Any + +from tqdm import tqdm + +SEED = 42 +TRAIN_RATIO, VALID_RATIO = 0.95, 0.025 +_TOKENIZER = None + + +def init_tokenizer(name: str) -> None: + """Load HuggingFace tokenizer for chat template.""" + global _TOKENIZER + if name: + from transformers import AutoTokenizer + + print(f"Loading tokenizer: {name}") + _TOKENIZER = AutoTokenizer.from_pretrained(name, trust_remote_code=True) + + +def format_text(messages: list[dict], reasoning: str = "") -> str: + """Format messages to text using tokenizer chat template or simple format.""" + # Add reasoning as thinking block if provided + if reasoning.strip(): + messages = messages.copy() + for i, m in enumerate(messages): + if m.get("role") == "assistant" and i == len(messages) - 1: + messages[i] = { + "role": "assistant", + "content": f"\n{reasoning}\n\n{m.get('content', '')}", + } + + if _TOKENIZER: + try: + return _TOKENIZER.apply_chat_template(messages, tokenize=False) + except Exception: + pass + + # Fallback + return "\n\n".join(f"{m['role'].title()}: {m['content']}" for m in messages if m.get("content")) + + +def split_and_save(examples: list[dict], output_dir: str, prefix: str) -> dict[str, int]: + """Shuffle, split into train/valid/test, and save as JSONL.""" + random.seed(SEED) + random.shuffle(examples) + + n = len(examples) + train_end = int(n * TRAIN_RATIO) + valid_end = train_end + int(n * VALID_RATIO) + + splits = { + "train": examples[:train_end], + "validation": examples[train_end:valid_end], + "test": examples[valid_end:], + } + + os.makedirs(output_dir, exist_ok=True) + counts = {} + for name, data in splits.items(): + path = os.path.join(output_dir, f"{prefix}_{name}.jsonl") + with open(path, "w") as f: + f.writelines(json.dumps(d, ensure_ascii=False) + "\n" for d in data) + counts[name] = len(data) + print(f" {name}: {len(data):,}") + + return counts + + +def download_openscience(output_dir: str, use_chat: bool) -> dict[str, Any]: + """Download nvidia/OpenScience dataset.""" + from datasets import load_dataset + + print("\nDownloading nvidia/OpenScience...") + ds = load_dataset("nvidia/OpenScience", "OS-Q3-235B-4") + data = ds["train"] if "train" in ds else ds[next(iter(ds.keys()))] + + print(f"Processing {len(data)} examples...") + suffix = "_chat" if use_chat else "" + examples = [] + for ex in tqdm(data.shuffle(seed=SEED), desc="openscience"): + msgs = [ + {"role": "user", "content": ex.get("input", "")}, + {"role": "assistant", "content": ex.get("output", "")}, + ] + examples.append({"text": format_text(msgs)}) + + counts = split_and_save(examples, output_dir, f"openscience{suffix}") + return {"dataset": "openscience", "total": len(examples), **counts} + + +def download_nemotron_v2( + output_dir: str, splits: list[str], sample_pct: float, suffix: str, include_reasoning: bool +) -> list[dict[str, Any]]: + """Download nvidia/Nemotron-Post-Training-Dataset-v2 splits.""" + from datasets import load_dataset + + print(f"\nDownloading Nemotron-v2 ({', '.join(splits)}) @ {sample_pct}%...") + results = [] + + for split in splits: + print(f"\n{split}:") + ds = load_dataset("nvidia/Nemotron-Post-Training-Dataset-v2", split=split, streaming=True) + + examples = [] + for ex in tqdm(ds, desc=split): + msgs = ex.get("messages", []) + reasoning = ex.get("reasoning", "") if include_reasoning else "" + text = format_text(msgs, reasoning) + if text.strip(): + examples.append({"text": text}) + + # Sample if needed + if sample_pct < 100: + random.seed(SEED) + target = int(len(examples) * sample_pct / 100) + examples = random.sample(examples, min(target, len(examples))) + print(f" Sampled to {len(examples):,}") + + if not examples: + continue + + split_dir = os.path.join(output_dir, split) + counts = split_and_save(examples, split_dir, f"{split}_{suffix}") + results.append({"split_name": split, "total": len(examples), **counts}) + + return results + + +def main(): + p = argparse.ArgumentParser(description="Download QAD datasets") + p.add_argument("--dataset", required=True, choices=["openscience", "nemotron-v2", "all"]) + p.add_argument("--output-dir", required=True) + p.add_argument("--tokenizer", help="HuggingFace tokenizer for chat template") + p.add_argument("--splits", default="stem,math,code,chat", help="Nemotron-v2 splits") + p.add_argument("--sample-percent", type=float, default=30.0) + p.add_argument( + "--include-reasoning", action="store_true", help="Include COT for Thinking models" + ) + args = p.parse_args() + + if args.tokenizer: + init_tokenizer(args.tokenizer) + + # Build suffix + suffix = f"{int(args.sample_percent)}pct" + if args.include_reasoning: + suffix += "_cot" + if args.tokenizer: + suffix += "_chat" + + results = [] + + if args.dataset in ["openscience", "all"]: + info = download_openscience( + os.path.join(args.output_dir, "openscience_splits"), args.tokenizer is not None + ) + results.append(info) + + if args.dataset in ["nemotron-v2", "all"]: + infos = download_nemotron_v2( + os.path.join(args.output_dir, "nemotron_v2"), + [s.strip() for s in args.splits.split(",")], + args.sample_percent, + suffix, + args.include_reasoning, + ) + results.extend(infos) + + print("\n" + "=" * 50) + print("Download complete!") + for r in results: + name = r.get("dataset") or r.get("split_name") + print(f" {name}: {r['total']:,} (train={r['train']:,})") + print("=" * 50) + + +if __name__ == "__main__": + main() diff --git a/examples/llm_qad/data_utils/generate_dataset.sh b/examples/llm_qad/data_utils/generate_dataset.sh new file mode 100755 index 000000000..39d678df9 --- /dev/null +++ b/examples/llm_qad/data_utils/generate_dataset.sh @@ -0,0 +1,105 @@ +#!/bin/bash +# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Download and preprocess OpenScience + Nemotron-v2 datasets for QAD training. +# Usage: bash generate_dataset.sh --output-dir --mlm-path --tokenizer + +set -e + +# Defaults +OUTPUT_DIR="" MLM_DIR="" TOKENIZER="" SAMPLE_PERCENT=30 INCLUDE_REASONING=false WORKERS=32 + +# Parse args +while [[ $# -gt 0 ]]; do + case $1 in + --output-dir) OUTPUT_DIR="$2"; shift 2;; + --mlm-path) MLM_DIR="$2"; shift 2;; + --tokenizer) TOKENIZER="$2"; shift 2;; + --sample-percent) SAMPLE_PERCENT="$2"; shift 2;; + --include-reasoning) INCLUDE_REASONING=true; shift;; + --workers) WORKERS="$2"; shift 2;; + *) echo "Unknown: $1"; exit 1;; + esac +done + +# Validate +if [ -z "$OUTPUT_DIR" ] || [ -z "$MLM_DIR" ] || [ -z "$TOKENIZER" ]; then + echo "Usage: bash generate_dataset.sh --output-dir --mlm-path --tokenizer " + echo "Optional: --sample-percent N --include-reasoning --workers N" + exit 1 +fi + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +SUFFIX="${SAMPLE_PERCENT}pct$( [ "$INCLUDE_REASONING" = true ] && echo "_cot" )_chat" +REASONING_FLAG=$( [ "$INCLUDE_REASONING" = true ] && echo "--include-reasoning" ) + +echo "=== QAD Dataset Generation ===" +echo "Output: $OUTPUT_DIR | Tokenizer: $TOKENIZER | Sample: ${SAMPLE_PERCENT}%" + +# Helper: preprocess JSONL to Megatron format +preprocess() { + [ -f "$1" ] && python "$MLM_DIR/tools/preprocess_data.py" \ + --input "$1" --output-prefix "$2" \ + --tokenizer-type HuggingFaceTokenizer --tokenizer-model "$TOKENIZER" \ + --append-eod --workers "$WORKERS" --json-keys text +} + +# Step 1: Download +echo -e "\n=== Downloading ===" +python "$SCRIPT_DIR/download_dataset.py" --dataset openscience --output-dir "$OUTPUT_DIR" --tokenizer "$TOKENIZER" +python "$SCRIPT_DIR/download_dataset.py" --dataset nemotron-v2 --output-dir "$OUTPUT_DIR" \ + --sample-percent "$SAMPLE_PERCENT" $REASONING_FLAG --tokenizer "$TOKENIZER" + +# Step 2: Preprocess +echo -e "\n=== Preprocessing ===" +OS_IN="$OUTPUT_DIR/openscience_splits" OS_OUT="$OUTPUT_DIR/openscience_splits_preprocessed" +NV_IN="$OUTPUT_DIR/nemotron_v2" NV_OUT="$OUTPUT_DIR/nemotron_v2_preprocessed" +mkdir -p "$OS_OUT" + +for s in train validation test; do preprocess "$OS_IN/openscience_chat_$s.jsonl" "$OS_OUT/openscience_chat_$s" || true; done + +for split in code math stem chat; do + mkdir -p "$NV_OUT/$split" + for s in train validation test; do + preprocess "$NV_IN/$split/${split}_${SUFFIX}_$s.jsonl" "$NV_OUT/$split/${split}_${SUFFIX}_$s" || true + done +done + +# Step 3: Create combined datablend +BLEND="$OUTPUT_DIR/datablend_combined.json" +cat > "$BLEND" << EOF +{ + "train": [ + 0.3, "$NV_OUT/code/code_${SUFFIX}_train_text_document", + 0.2, "$NV_OUT/math/math_${SUFFIX}_train_text_document", + 0.2, "$NV_OUT/stem/stem_${SUFFIX}_train_text_document", + 0.1, "$NV_OUT/chat/chat_${SUFFIX}_train_text_document", + 0.2, "$OS_OUT/openscience_chat_train_text_document" + ], + "valid": [ + 0.5, "$NV_OUT/stem/stem_${SUFFIX}_validation_text_document", + 0.5, "$OS_OUT/openscience_chat_validation_text_document" + ], + "test": [ + 0.5, "$NV_OUT/stem/stem_${SUFFIX}_test_text_document", + 0.5, "$OS_OUT/openscience_chat_test_text_document" + ] +} +EOF + +echo -e "\n=== Done! ===" +echo "Datablend: $BLEND" +echo "Set BLEND_PATH in your config and run: sbatch sbatch_qad.sh --config " diff --git a/examples/llm_qad/qad.sh b/examples/llm_qad/qad.sh new file mode 100644 index 000000000..52ec2bd6a --- /dev/null +++ b/examples/llm_qad/qad.sh @@ -0,0 +1,348 @@ +#!/bin/bash +# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# QAD (Quantization-Aware Distillation) Training Script +# Usage: bash qad.sh --config configs/your-config.conf + +set -euo pipefail + +# === Helpers === +die() { echo "[ERROR] $*" >&2; exit 1; } +log_info() { echo "[INFO] $*"; } +log_warn() { echo "[WARN] $*"; } +require_var() { [[ -n "${!1:-}" ]] || die "$1 must be set in config"; } +require_file() { [[ -f "$1" ]] || die "${2:-File} not found: $1"; } +require_dir() { [[ -d "$1" ]] || die "${2:-Directory} not found: $1"; } +sanitize() { echo "$1" | sed -e 's/[\/ :]/_/g' -e 's/[=]/_/g'; } + +# === Environment === +export NCCL_IB_SL=1 +export NCCL_IB_TIMEOUT=19 +export NCCL_P2P_NET_CHUNKSIZE=2097152 +export NCCL_DEBUG=WARN +export NCCL_SHM_DISABLE=1 +export NCCL_NVLS_ENABLE=0 +export CUDA_DEVICE_MAX_CONNECTIONS=1 +export UB_TIMEOUT=720 +export NVTE_FWD_LAYERNORM_SM_MARGIN=16 +export NVTE_BWD_LAYERNORM_SM_MARGIN=16 +export TORCHINDUCTOR_COMPILE_THREADS=1 +export TORCH_COMPILE_DISABLE=1 +export PYTORCH_NO_CUDA_MEMORY_CACHING=0 +export TORCH_DISTRIBUTED_DEBUG=OFF +export PYTORCH_JIT=0 +export TORCH_USE_CUDA_DSA=0 +export GLOO_SOCKET_IFNAME=ibp26s0 + +# === Argument Parsing === +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +CONFIG_FILE="" +HF_TOKEN_ARG="" + +while [[ $# -gt 0 ]]; do + case $1 in + --config|-c) CONFIG_FILE="$2"; shift 2;; + --hf-token) HF_TOKEN_ARG="$2"; shift 2;; + *) die "Unknown argument: $1";; + esac +done + +# HuggingFace token +[[ -n "$HF_TOKEN_ARG" ]] && export HF_TOKEN="$HF_TOKEN_ARG" +[[ -n "${HF_TOKEN:-}" ]] && export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN" && log_info "HuggingFace token configured" + +# === Load Config === +if [[ -z "$CONFIG_FILE" ]]; then + die "Config file required. Use --config \nAvailable: $(ls -1 "${SCRIPT_DIR}/configs/"*.conf 2>/dev/null | tr '\n' ' ')" +fi +[[ "$CONFIG_FILE" = /* ]] || CONFIG_FILE="${SCRIPT_DIR}/${CONFIG_FILE}" +require_file "$CONFIG_FILE" "Config file" +log_info "Loading config: ${CONFIG_FILE}" +source "$CONFIG_FILE" + +# === Validate Required Config === +for v in LR GBS MIN_LR LR_DECAY_STYLE SAVE_INTERVAL LOG_INTERVAL \ + STUDENT_MODEL TEACHER_MODEL DATASET_NAME BLEND_PATH TRAIN_SAMPLES IS_MOE TOKENIZER_MODEL \ + TP_SIZE MBS STUDENT_CKPT TEACHER_CKPT TEACHER_MODEL_CONFIG \ + STUDENT_CONFIG_FILE MLM_DIR MODELOPT_DIR QAD_CHECKPOINT_ROOT DATACACHE_DIR; do + require_var "$v" +done + +# === Defaults for Optional Config === +EP_SIZE="${EP_SIZE:-1}" +PP_SIZE="${PP_SIZE:-1}" +NUM_GPUS="${NUM_GPUS:-8}" +NNODES="${NNODES:-1}" +NODE_RANK="${NODE_RANK:-0}" +MASTER_ADDR="${MASTER_ADDR:-localhost}" +MASTER_PORT="${MASTER_PORT:-29500}" +LR_DECAY_SAMPLES="${LR_DECAY_SAMPLES:-$(( TRAIN_SAMPLES * 99 / 100 ))}" +LR_WARMUP_SAMPLES="${LR_WARMUP_SAMPLES:-$(( TRAIN_SAMPLES / 100 ))}" +SAVE_RETAIN_INTERVAL="${SAVE_RETAIN_INTERVAL:-$SAVE_INTERVAL}" +EVAL_INTERVAL="${EVAL_INTERVAL:-$SAVE_INTERVAL}" +EVAL_ITERS="${EVAL_ITERS:-20}" +MAX_SEQ="${MAX_SEQ:-}" +RUN_TAG="${RUN_TAG:-}" +KD_CFG_PATH="${KD_CFG_PATH:-}" +ITERATIONS_TO_SKIP="${ITERATIONS_TO_SKIP:-}" +ENABLE_MOE_PERF="${ENABLE_MOE_PERF:-1}" +ENABLE_MOE_EXPERIMENTAL="${ENABLE_MOE_EXPERIMENTAL:-0}" +LOG_PARAMS_NORM="${LOG_PARAMS_NORM:-}" + +# === Load Student Model Config === +require_file "$STUDENT_CONFIG_FILE" "Student model config" +log_info "Loading student model config: ${STUDENT_CONFIG_FILE}" +set +u; source "$STUDENT_CONFIG_FILE"; set -u +STUDENT_MODEL_ARGS="${MODEL_ARGS}" + +# Log params norm (disabled for MoE to save memory) +if [[ "${LOG_PARAMS_NORM}" == "1" ]]; then + LOG_PARAMS_NORM_ARG="--log-params-norm" +elif [[ "$IS_MOE" == "true" ]]; then + LOG_PARAMS_NORM_ARG="" + log_warn "log-params-norm disabled for MoE model" +else + LOG_PARAMS_NORM_ARG="--log-params-norm" +fi + +log_info "Model: ${STUDENT_MODEL} | TP=${TP_SIZE} PP=${PP_SIZE} EP=${EP_SIZE} MBS=${MBS} MoE=${IS_MOE}" + +# === Validate Checkpoints === +require_dir "$STUDENT_CKPT" "Student checkpoint" +require_dir "$TEACHER_CKPT" "Teacher checkpoint" +require_file "$TEACHER_MODEL_CONFIG" "Teacher model config" +log_info "Student: ${STUDENT_CKPT}" +log_info "Teacher: ${TEACHER_CKPT}" + +# === Output Paths === +DATETIME=$(date +'date_%y-%m-%d_time_%H-%M-%S') +STUDENT_CKPT_NAME=$(basename "${STUDENT_CKPT}") +TEACHER_CKPT_NAME=$(basename "${TEACHER_CKPT}") + +TAG_PARTS="lr$(sanitize "$LR")-minlr$(sanitize "$MIN_LR")-decay$(sanitize "$LR_DECAY_STYLE")" +[[ -n "$MAX_SEQ" ]] && TAG_PARTS="${TAG_PARTS}-seq${MAX_SEQ}" +[[ -n "$RUN_TAG" ]] && TAG_PARTS="${TAG_PARTS}-tag$(sanitize "$RUN_TAG")" + +OUTPUT_ROOT="${QAD_CHECKPOINT_ROOT}/${STUDENT_CKPT_NAME}-Teacher-${TEACHER_CKPT_NAME}-Data-${DATASET_NAME}-${TAG_PARTS}" +CHECKPOINT_DIR="${OUTPUT_ROOT}/checkpoints/${STUDENT_CKPT_NAME}" +TENSORBOARD_DIR="${OUTPUT_ROOT}/tensorboard/${STUDENT_CKPT_NAME}" +LOGS_DIR="${OUTPUT_ROOT}/logs" +mkdir -p "${LOGS_DIR}" "${CHECKPOINT_DIR}" "${DATACACHE_DIR}" "${TENSORBOARD_DIR}" + +# === Resume Logic === +if [[ -f "${CHECKPOINT_DIR}/latest_checkpointed_iteration.txt" ]]; then + log_info "Resuming from: ${CHECKPOINT_DIR}" + LOAD_CHECKPOINT_DIR="${CHECKPOINT_DIR}" + FINETUNE_FLAG="" + LOAD_OPTIM_ARGS="" + CKPT_PARALLEL_LOAD_ARG="--ckpt-fully-parallel-load" +else + log_info "Starting fresh from base checkpoint" + LOAD_CHECKPOINT_DIR="${STUDENT_CKPT}" + FINETUNE_FLAG="--finetune" + LOAD_OPTIM_ARGS="--no-load-optim --no-load-rng" + CKPT_PARALLEL_LOAD_ARG="" +fi + +# === Log Configuration === +ENV_LOG="${LOGS_DIR}/${STUDENT_CKPT_NAME}_${DATETIME}.env.log" +{ + echo "=== QAD Training: ${STUDENT_MODEL} ===" + echo "Time: ${DATETIME}" + echo "LR=${LR} MinLR=${MIN_LR} Decay=${LR_DECAY_STYLE} GBS=${GBS} MBS=${MBS}" + echo "TrainSamples=${TRAIN_SAMPLES} SaveInterval=${SAVE_INTERVAL} LogInterval=${LOG_INTERVAL}" + echo "TP=${TP_SIZE} PP=${PP_SIZE} EP=${EP_SIZE} Nodes=${NNODES} GPUs/node=${NUM_GPUS}" + echo "Checkpoint: ${CHECKPOINT_DIR}" + echo "TensorBoard: ${TENSORBOARD_DIR}" + env +} > "$ENV_LOG" + +# === Build Training Arguments === + +# Checkpoint loading +CHECKPOINT_ARGS=" \ + --auto-detect-ckpt-format \ + --export-te-mcore-model \ + --dist-ckpt-strictness log_unexpected \ + ${FINETUNE_FLAG} \ + ${LOAD_OPTIM_ARGS} \ + --load ${LOAD_CHECKPOINT_DIR} \ + --export-kd-teacher-load ${TEACHER_CKPT} \ + --teacher-model-config ${TEACHER_MODEL_CONFIG}" + +# KD config (optional) +if [[ -n "$KD_CFG_PATH" && -f "$KD_CFG_PATH" ]]; then + CHECKPOINT_ARGS="${CHECKPOINT_ARGS} --export-kd-cfg ${KD_CFG_PATH}" + log_info "Using KD config: ${KD_CFG_PATH}" +fi + +# Tokenizer +TOKENIZER_ARGS=" \ + --tokenizer-type HuggingFaceTokenizer \ + --tokenizer-model ${TOKENIZER_MODEL}" + +# Data +DATA_ARGS=" \ + --per-split-data-args-path ${BLEND_PATH} \ + --data-cache-path ${DATACACHE_DIR} \ + --no-mmap-bin-files \ + --num-dataset-builder-threads 16 \ + --no-create-attention-mask-in-dataloader" + +# Sequence length override +SEQ_ARGS="" +if [[ -n "$MAX_SEQ" ]]; then + SEQ_ARGS="--seq-length ${MAX_SEQ} --max-position-embeddings ${MAX_SEQ}" + log_info "Sequence length override: ${MAX_SEQ}" +fi + +# Training +TRAINING_ARGS=" \ + --micro-batch-size ${MBS} \ + --global-batch-size ${GBS} \ + --train-samples ${TRAIN_SAMPLES} \ + --lr-decay-samples ${LR_DECAY_SAMPLES} \ + --lr-warmup-samples ${LR_WARMUP_SAMPLES} \ + --attention-dropout 0.0 \ + --hidden-dropout 0.0 \ + --bf16 \ + ${SEQ_ARGS}" + +# Optimizer +OPTIMIZER_ARGS=" \ + --lr ${LR} \ + --min-lr ${MIN_LR} \ + --weight-decay 0.1 \ + --clip-grad 1.0 \ + --lr-decay-style ${LR_DECAY_STYLE} \ + --adam-beta1 0.9 \ + --adam-beta2 0.95 \ + --use-distributed-optimizer \ + --overlap-grad-reduce \ + --overlap-param-gather" + +# Parallelism +PARALLEL_ARGS=" \ + --tensor-model-parallel-size ${TP_SIZE} \ + --pipeline-model-parallel-size ${PP_SIZE} \ + --distributed-timeout-minutes 360 \ + --disable-gloo-process-groups \ + --ddp-num-buckets 7" + +# Expert parallelism for MoE +if [[ "$IS_MOE" == "true" && "$EP_SIZE" -gt 1 ]]; then + PARALLEL_ARGS="${PARALLEL_ARGS} --expert-model-parallel-size ${EP_SIZE}" + log_info "MoE Expert Parallelism: EP=${EP_SIZE}" +fi + +# Sequence parallel (add if not in model config) +if ! echo "$STUDENT_MODEL_ARGS" | grep -q "sequence-parallel"; then + PARALLEL_ARGS="${PARALLEL_ARGS} --sequence-parallel" +fi + +# MoE performance optimizations +MOE_PERF_ARGS="" +if [[ "$IS_MOE" == "true" && "$ENABLE_MOE_PERF" == "1" ]]; then + log_info "MoE Performance Optimizations: ENABLED" + MOE_PERF_ARGS=" \ + --moe-token-dispatcher-type alltoall \ + --moe-shared-expert-overlap \ + --moe-permute-fusion \ + --moe-grouped-gemm \ + --cross-entropy-loss-fusion \ + --cross-entropy-fusion-impl native" + + if [[ "$ENABLE_MOE_EXPERIMENTAL" == "1" ]]; then + MOE_PERF_ARGS="${MOE_PERF_ARGS} --enable-experimental" + log_warn "Experimental MoE features enabled" + fi +elif [[ "$IS_MOE" == "true" ]]; then + log_warn "MoE Performance Optimizations: DISABLED" +fi + +# Memory optimization +MEMORY_ARGS=" \ + --recompute-granularity full \ + --recompute-method uniform \ + --recompute-num-layers 1 \ + --no-gradient-accumulation-fusion" + +# Checkpoint saving +SAVE_ARGS=" \ + --save ${CHECKPOINT_DIR} \ + --save-interval ${SAVE_INTERVAL} \ + --save-retain-interval ${SAVE_RETAIN_INTERVAL} \ + --ckpt-format torch_dist \ + --ckpt-fully-parallel-save \ + --ckpt-assume-constant-structure \ + ${CKPT_PARALLEL_LOAD_ARG}" + +# Logging +LOGGING_ARGS=" \ + --log-interval ${LOG_INTERVAL} \ + --eval-iters ${EVAL_ITERS} \ + --eval-interval ${EVAL_INTERVAL} \ + --log-progress \ + --timing-log-option minmax \ + ${LOG_PARAMS_NORM_ARG:-} \ + --log-num-zeros-in-grad \ + --log-throughput \ + --log-straggler \ + --disable-straggler-on-startup \ + --straggler-minmax-count 16 \ + --tensorboard-dir ${TENSORBOARD_DIR}" + +# Runtime +RUNTIME_ARGS=" \ + --exit-duration-in-mins 1200 \ + --num-workers 8 \ + --no-check-for-nan-in-loss-and-grad" + +# Combine all arguments +ALL_ARGS=" \ + ${CHECKPOINT_ARGS} \ + ${STUDENT_MODEL_ARGS} \ + ${TOKENIZER_ARGS} \ + ${DATA_ARGS} \ + ${TRAINING_ARGS} \ + ${OPTIMIZER_ARGS} \ + ${PARALLEL_ARGS} \ + ${MOE_PERF_ARGS} \ + ${MEMORY_ARGS} \ + ${SAVE_ARGS} \ + ${LOGGING_ARGS} \ + ${RUNTIME_ARGS}" + +# Optional: iterations to skip +[[ -n "$ITERATIONS_TO_SKIP" ]] && ALL_ARGS="${ALL_ARGS} --iterations-to-skip ${ITERATIONS_TO_SKIP}" + +# === Launch Training === +export PYTHONPATH="${MODELOPT_DIR}:${MLM_DIR}:${PYTHONPATH:-}" +LOG_FILE="${LOGS_DIR}/${STUDENT_CKPT_NAME}_qad_${DATETIME}.log" + +log_info "Starting training..." +log_info "Log file: ${LOG_FILE}" +log_info "Distributed: ${NNODES} nodes x ${NUM_GPUS} GPUs = $((NNODES * NUM_GPUS)) total" + +torchrun \ + --nproc_per_node="${NUM_GPUS}" \ + --nnodes="${NNODES}" \ + --node_rank="${NODE_RANK}" \ + --master_addr="${MASTER_ADDR}" \ + --master_port="${MASTER_PORT}" \ + "${MLM_DIR}/pretrain_gpt.py" ${ALL_ARGS} 2>&1 | tee "${LOG_FILE}" + +log_info "Training completed. Logs: ${LOG_FILE}" diff --git a/examples/llm_qad/sbatch_qad.sh b/examples/llm_qad/sbatch_qad.sh new file mode 100755 index 000000000..613b9bc27 --- /dev/null +++ b/examples/llm_qad/sbatch_qad.sh @@ -0,0 +1,168 @@ +#!/bin/bash +# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# QAD SLURM Batch Submission Script +# Usage: sbatch sbatch_qad.sh --config configs/your-config.conf +# Override: sbatch --nodes=4 --account= sbatch_qad.sh --config ... + +#SBATCH -p batch +#SBATCH --account= +#SBATCH --nodes=4 +#SBATCH -t 4:00:00 +#SBATCH --exclusive +#SBATCH --mem=0 +#SBATCH --gres=gpu:4 +#SBATCH --ntasks-per-node=1 +#SBATCH --job-name=qad-training + +set -x -e + +# === Parse Arguments === +SCRIPT_DIR="${SLURM_SUBMIT_DIR:-$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)}" +CONFIG_FILE="" +HF_TOKEN_ARG="" + +while [[ $# -gt 0 ]]; do + case $1 in + --config|-c) CONFIG_FILE="$2"; shift 2;; + --hf-token) HF_TOKEN_ARG="$2"; shift 2;; + *) break;; + esac +done + +[[ -n "$HF_TOKEN_ARG" ]] && export HF_TOKEN="$HF_TOKEN_ARG" + +# === Load Config === +if [[ -n "$CONFIG_FILE" ]]; then + [[ "$CONFIG_FILE" = /* ]] || CONFIG_FILE="${SCRIPT_DIR}/${CONFIG_FILE}" + if [[ -f "$CONFIG_FILE" ]]; then + echo "Loading config: ${CONFIG_FILE}" + source "$CONFIG_FILE" + else + echo "ERROR: Config not found: ${CONFIG_FILE}" + ls -1 "${SCRIPT_DIR}/configs/"*.conf 2>/dev/null || echo "(no configs found)" + exit 1 + fi +fi + +# === Default Paths (override in config) === +MLM_DIR="${MLM_DIR:-/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_modelopt/users/weimingc/workspace/Megatron-LM}" +MODELOPT_DIR="${MODELOPT_DIR:-/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_modelopt/users/weimingc/workspace/TensorRT-Model-Optimizer}" +MODELS_ROOT="${MODELS_ROOT:-/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_modelopt/users/weimingc/models}" +QAD_CHECKPOINT_ROOT="${QAD_CHECKPOINT_ROOT:-/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_modelopt/users/weimingc/checkpoints}" +DATACACHE_DIR="${DATACACHE_DIR:-/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_modelopt/users/weimingc/data_cache}" +LOG_DIR="${LOG_DIR:-${QAD_CHECKPOINT_ROOT}/logs_slurm}" + +# Container settings +CONTAINER_IMAGE="${CONTAINER_IMAGE:-/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_modelopt/users/weimingc/containers/pytorch_25.06-py3.sqsh}" +CONTAINER_MOUNTS="${CONTAINER_MOUNTS:-/lustre/fs1:/lustre/fs1}" +CONTAINER_WORKDIR="${CONTAINER_WORKDIR:-/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_modelopt/users/weimingc/workspace/TensorRT-Model-Optimizer/examples/llm_qad}" + +# Parallelism (required from config) +TP_SIZE="${TP_SIZE:?ERROR: TP_SIZE must be set in config}" +MBS="${MBS:?ERROR: MBS must be set in config}" +PP_SIZE="${PP_SIZE:-1}" +EP_SIZE="${EP_SIZE:-1}" +NUM_GPUS="${NUM_GPUS:-8}" +MASTER_PORT="${MASTER_PORT:-29500}" + +# Multi-node from SLURM +NNODES="${SLURM_NNODES:-4}" +MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1) + +mkdir -p "${LOG_DIR}" +DATETIME=$(date +'date_%y-%m-%d_time_%H-%M-%S') + +# === Display Configuration === +echo "========================================" +echo "QAD Training Configuration" +echo "========================================" +[[ -n "$CONFIG_FILE" ]] && echo "Config: ${CONFIG_FILE}" +echo "Model: ${STUDENT_MODEL:-unknown} -> Teacher: ${TEACHER_MODEL:-unknown}" +echo "LR: ${LR:-?} | Dataset: ${DATASET_NAME:-?}" +echo "Parallelism: TP=${TP_SIZE} PP=${PP_SIZE} EP=${EP_SIZE} MBS=${MBS}" +echo "Nodes: ${NNODES} x ${NUM_GPUS} GPUs = $((NNODES * NUM_GPUS)) total" +echo "Master: ${MASTER_ADDR}:${MASTER_PORT}" +echo "" +echo "Paths:" +echo " MLM_DIR: ${MLM_DIR}" +echo " MODELOPT_DIR: ${MODELOPT_DIR}" +echo " Checkpoints: ${QAD_CHECKPOINT_ROOT}" +echo "" +echo "Container: ${CONTAINER_IMAGE}" +echo "" +echo "Checkpoints:" +echo " Student: ${STUDENT_CKPT:-NOT SET}" +echo " Teacher: ${TEACHER_CKPT:-NOT SET}" +[[ -n "${BLEND_PATH:-}" ]] && echo " Blend: ${BLEND_PATH}" +echo "========================================" + +# Validate required +[[ -z "${STUDENT_CKPT:-}" ]] && echo "ERROR: STUDENT_CKPT required" && exit 1 +[[ -z "${TEACHER_CKPT:-}" ]] && echo "ERROR: TEACHER_CKPT required" && exit 1 + +# === Build Container Exports === +# Use local /tmp for Triton cache to avoid race conditions +EXPORTS="export TRITON_CACHE_DIR=/tmp/triton_cache_\${SLURM_JOB_ID}_\${SLURM_PROCID}" +EXPORTS="${EXPORTS} && export NODE_RANK=\${SLURM_PROCID}" +EXPORTS="${EXPORTS} && export NNODES=${NNODES} NUM_GPUS=${NUM_GPUS}" +EXPORTS="${EXPORTS} && export TP_SIZE=${TP_SIZE} PP_SIZE=${PP_SIZE} EP_SIZE=${EP_SIZE} MBS=${MBS}" +EXPORTS="${EXPORTS} && export IS_MOE=${IS_MOE:-false}" +EXPORTS="${EXPORTS} && export MASTER_ADDR=${MASTER_ADDR} MASTER_PORT=${MASTER_PORT}" +EXPORTS="${EXPORTS} && export MLM_DIR=${MLM_DIR} MODELOPT_DIR=${MODELOPT_DIR}" +EXPORTS="${EXPORTS} && export QAD_CHECKPOINT_ROOT=${QAD_CHECKPOINT_ROOT} DATACACHE_DIR=${DATACACHE_DIR}" +EXPORTS="${EXPORTS} && export STUDENT_CKPT=${STUDENT_CKPT} TEACHER_CKPT=${TEACHER_CKPT}" + +# Training hyperparameters +for v in LR GBS MIN_LR LR_DECAY_STYLE SAVE_INTERVAL LOG_INTERVAL STUDENT_MODEL TEACHER_MODEL DATASET_NAME; do + [[ -n "${!v:-}" ]] && EXPORTS="${EXPORTS} && export ${v}=${!v}" +done + +# Model config +[[ -n "${STUDENT_CONFIG_FILE:-}" ]] && EXPORTS="${EXPORTS} && export STUDENT_CONFIG_FILE=${STUDENT_CONFIG_FILE}" +[[ -n "${TOKENIZER_MODEL:-}" ]] && EXPORTS="${EXPORTS} && export TOKENIZER_MODEL=${TOKENIZER_MODEL}" +[[ -n "${TEACHER_MODEL_CONFIG:-}" ]] && EXPORTS="${EXPORTS} && export TEACHER_MODEL_CONFIG=${TEACHER_MODEL_CONFIG}" + +# Dataset +[[ -n "${BLEND_PATH:-}" ]] && EXPORTS="${EXPORTS} && export BLEND_PATH=${BLEND_PATH}" +[[ -n "${TRAIN_SAMPLES:-}" ]] && EXPORTS="${EXPORTS} && export TRAIN_SAMPLES=${TRAIN_SAMPLES}" + +# Optional +[[ -n "${HF_TOKEN:-}" ]] && EXPORTS="${EXPORTS} && export HF_TOKEN=${HF_TOKEN} HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}" +[[ -n "${ITERATIONS_TO_SKIP:-}" ]] && EXPORTS="${EXPORTS} && export ITERATIONS_TO_SKIP=${ITERATIONS_TO_SKIP}" +[[ -n "${DISTILL_CONFIG_PATH:-}" ]] && EXPORTS="${EXPORTS} && export DISTILL_CONFIG_PATH=${DISTILL_CONFIG_PATH}" + +# === Launch === +CONFIG_ARGS="" +[[ -n "${CONFIG_FILE}" ]] && CONFIG_ARGS="--config ${CONFIG_FILE}" +[[ -n "${HF_TOKEN:-}" ]] && CONFIG_ARGS="${CONFIG_ARGS} --hf-token ${HF_TOKEN}" + +run_cmd="pip install transformers==4.54 && ${EXPORTS} && cd ${CONTAINER_WORKDIR} && bash qad.sh ${CONFIG_ARGS}" + +echo "Running: ${run_cmd}" + +srun -l \ + --output=${LOG_DIR}/%x_%j_${DATETIME}.log \ + --error=${LOG_DIR}/err_%x_%j_${DATETIME}.log \ + --container-image ${CONTAINER_IMAGE} \ + --container-mounts ${CONTAINER_MOUNTS} \ + --container-workdir ${CONTAINER_WORKDIR} \ + sh -c "${run_cmd}" + +echo "========================================" +echo "QAD Training completed at $(date)" +echo "Logs: ${LOG_DIR}/" +echo "========================================"