Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -21,4 +21,6 @@ benchmark-*.json
datasets/
abnormal.txt
gpu-rank-map.txt
.specstory
.specstory
!datasets/
!datasets/**
4 changes: 2 additions & 2 deletions CODEOWNERS
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,6 @@ megatron/core/inference/
.gitlab/
.github/
.gitlab-ci.yml
Dockerfile.ci.lts
Dockerfile.ci.dev
docker/Dockerfile.ci.lts
docker/Dockerfile.ci.dev
tests/
71 changes: 40 additions & 31 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,18 @@
<div align="center">
<img src="images/megatronapp.png" alt="MegatronApp logo" height="96">
</div>
<h1 align="center">MegatronApp: Toolchain Built around Megatron-LM for Distributed Training</h1>

<p align="center">
An extension for performance tuning, slow-node detection, and training-process visualization.
</p>

**MegatronApp: Toolchain built around Megatron-LM for Distributed Training**
<!-- > Sample: AI Health Assistant | Powered by Your Data -->
<p align="center">
<a href="https://github.com/OpenSQZ/MegatronApp/blob/main/docker/DockerUsage.md">🍳 Cookbook</a> |
<a href="https://arxiv.org/pdf/2507.19845">📄 Technical Report</a>
</p>

An extension for performance tuning, slow-node detection, and training-process visualization.
</div>

<!-- **📢 Announcements**

Expand All @@ -18,7 +27,7 @@ An extension for performance tuning, slow-node detection, and training-process v
# News <!-- omit in toc -->

### 📌 Pinned
* [2025.10.17] 🔥🔥🔥 We provide user-friendly [docker guidance](./DockerUsage.md) for all four features of MegatronApp. Please try it out!
* [2025.10.17] 🔥🔥🔥 We provide user-friendly [docker guidance](./docker/DockerUsage.md) for all four features of MegatronApp. Please try it out!
* [2025.07.27] 📢📢📢 The MegatronApp technical report has been released! See [here](https://arxiv.org/pdf/2507.19845).
* [2025.07.04] 🔥🔥🔥 MegatronApp is officially launched at WAIC 2025! Our code is available [here](https://github.com/OpenSQZ/MegatronApp). Come and try it out!

Expand Down Expand Up @@ -167,7 +176,7 @@ docker run --runtime --nvidia --gpus all -it --rm \
To install additional required packages, run

```bash
pip install -r requirements.txt
pip install -r requirements/requirements.txt
```

## MegaScan
Expand Down Expand Up @@ -197,7 +206,7 @@ Alternatively, you can use elastic training. See [torchrun](https://docs.pytorch
2. After training, you will find separated trace files in the current directory. The trace files are named as `benchmark-data-{}-pipeline-{}-tensor-{}.json`, where `{}` is the rank number. Now we should aggregate the trace files into a single trace file:

```bash
python scripts/aggregate.py --b trace_output --output benchmark.json
python tools/aggregate.py --b trace_output --output benchmark.json
```

3. You can visualize the trace file using Chrome Tracing (or Perfetto UI). Open the trace file in Chrome Tracing by navigating to `chrome://tracing` in your browser (or https://ui.perfetto.dev/). Now you can explore the trace data, zoom in on specific events, and analyze the performance characteristics of your distributed training run.
Expand All @@ -218,7 +227,7 @@ python scripts/aggregate.py --b trace_output --output benchmark.json
2. Run the training script. Then aggregate the trace files as described above, but with an additional command line argument to enable the detection algorithm:

```bash
python scripts/aggregate.py \
python tools/aggregate.py \
-b . \ # Equivalent to --bench-dir
-d # Enable the detection algorithm, Equivalent to --detect
```
Expand All @@ -245,12 +254,12 @@ bash a_pretrain_script.sh $RANK
```
For example
```bash
bash pretrain_gpt.sh 0
bash scripts/pretrain_gpt.sh 0
```

**Frontend (Vue)**: Navigate to the frontend directory and start the development server.
```bash
cd transformer-visualize
cd tools/visualization/transformer-visualize
npm run dev
```
After launching both, open your browser to the specified address (usually http://localhost:5173). You will see the main interface.
Expand Down Expand Up @@ -307,13 +316,13 @@ UseIB: true
- The Python environment in the image automatically includes almost all of the required packages. To install additional required packages, run

```bash
pip install -r requirements.txt
pip install -r requirements/requirements.txt
```

- Install infiniband prerequisites

```bash
bash prerequisite.sh
bash scripts/prerequisite.sh
```

- Build the `shm_tensor_new_rdma` (for multinode) and `shm_tensor_new_rdma_pre_alloc` modules.
Expand Down Expand Up @@ -341,17 +350,17 @@ First, prepare your dataset in the following `.json` format with one sample per
{"src": "bloomberg", "text": "Var Energi agrees to buy Exxonmobil's Norway assets for $4.5 bln. MILAN, Sept 26 (Reuters) - Var Energi AS, the Norwegian oil and gas group 69.6% owned by Italian major Eni, has agreed to buy the Norwegian upstream assets of ExxonMobil for $4.5 billion. The deal is expected to be completed in the final quarter of this year, Var Energi said on Thursday. Reporting by Stephen Jewkes; editing by Francesca Landini MILAN, Sept 26 (Reuters) - Var Energi AS, the Norwegian oil and gas group 69.6% owned by Italian major Eni, has agreed to buy the Norwegian upstream assets of ExxonMobil for $4.5 billion. The deal is expected to be completed in the final quarter of this year, Var Energi said on Thursday. Reporting by Stephen Jewkes; editing by Francesca Landini", "type": "Eng", "id": "1", "title": "Var Energi agrees to buy Exxonmobil's Norway assets for $4.5 bln. "}
{"src": "bloomberg", "text": "Trump says 'incorrect' he is willing to meet Iran with 'no conditions'. WASHINGTON (Reuters) - U.S. President Donald Trump on Sunday appeared to play down the chances that he might be willing to meet with Iranian officials, saying reports that he would do so without conditions were not accurate. \u201cThe Fake News is saying that I am willing to meet with Iran, \u2018No Conditions.\u2019 That is an incorrect statement (as usual!),\u201d Trump said on Twitter. In fact, as recently as on Sept. 10, U.S. Secretary of State Mike Pompeo said \u201cHe (Trump) is prepared to meet with no preconditions.\u201d Reporting By Arshad Mohammed; Editing by Shri Navaratnam WASHINGTON (Reuters) - U.S. President Donald Trump on Sunday appeared to play down the chances that he might be willing to meet with Iranian officials, saying reports that he would do so without conditions were not accurate. \u201cThe Fake News is saying that I am willing to meet with Iran, \u2018No Conditions.\u2019 That is an incorrect statement (as usual!),\u201d Trump said on Twitter. In fact, as recently as on Sept. 10, U.S. Secretary of State Mike Pompeo said \u201cHe (Trump) is prepared to meet with no preconditions.\u201d Reporting By Arshad Mohammed; Editing by Shri Navaratnam", "type": "Eng", "id": "2", "title": "Trump says 'incorrect' he is willing to meet Iran with 'no conditions'. "}
```
note that we have provided a sample dataset under `datasets_gpt/` and `datasets_bert/`.
note that we have provided a sample dataset under `datasets/gpt/` and `datasets/bert/`.

Then, prepare the vocab file (gpt and bert) and the merges file (gpt-only). We have provided it in the respective directories.

For bert, run the following
```bash
cd datasets
python ../tools/preprocess_data.py \
--input ../datasets_bert/dataset.json \
--input ../datasets/bert/dataset.json \
--output-prefix bert \
--vocab-file ../datasets_bert/vocab.txt \
--vocab-file ../datasets/bert/vocab.txt \
--tokenizer-type BertWordPieceLowerCase \
--split-sentences \
--workers $(nproc)
Expand All @@ -362,11 +371,11 @@ For GPT, run the following
```bash
cd datasets
python ../tools/preprocess_data.py \
--input ../datasets_gpt/dataset.json \
--input ../datasets/gpt/dataset.json \
--output-prefix gpt \
--vocab-file ../datasets_gpt/vocab.json \
--vocab-file ../datasets/gpt/vocab.json \
--tokenizer-type GPT2BPETokenizer \
--merge-file ../datasets_gpt/merges.txt \
--merge-file ../datasets/gpt/merges.txt \
--append-eod \
--workers $(nproc)
```
Expand All @@ -377,18 +386,18 @@ For other models, please refer to `nvidia/megatron` for the corresponding datase
To run distributed training on a single node, go to the project root directory and run

```bash
bash run_single_gpt.sh
bash scripts/run_single_gpt.sh
```

for GPT and

```bash
bash run_single_bert.sh
bash scripts/run_single_bert.sh
```

for bert.

The `run_single_<model>.sh` files have the following structure:
The `scripts/run_single_<model>.sh` files have the following structure:

- Parameters include `pipeline_parallel`, `model_chunks` and `tensor_parallel`
- The `virtual_stage_layer` parameter specifies how many layers there are in a single virtual pipeline stage. It is calculated as
Expand All @@ -405,46 +414,46 @@ There are also several critical parameters in `examples/gpt3/train_gpt3_175b_dis
- `--workload` specifies the workload of each single thread, and hence determines the number of threads used in P2P communication
- `--num-gpus` specifies the number of GPUs on the current node (single node training)
- Other critical parameters include the number of layers of the model, the global batch size and the sequence length
- Note that currently the global batch size value is 16 and is static in `run_single_<model>.sh`. It needs to simultaneously modify `run_single_<model>.sh` if adjusting the layers.
- Note that currently the global batch size value is 16 and is static in `scripts/run_single_<model>.sh`. It needs to simultaneously modify `scripts/run_single_<model>.sh` if adjusting the layers.

For the remaining models, you can either directly run
```bash
bash examples/<model>/<train_file>.sh
```
or write a file similar to `run_{single,master,worker}_<model>.sh` that sets up configurations and runs the shell under `examples/`
or write a file similar to `scripts/run_{single,master,worker}_<model>.sh` that sets up configurations and runs the shell under `examples/`

#### Multinode Distributed Training
To run distributed training on multiple nodes, go to the root directory. First run

```bash
bash run_master_<model>.sh
bash scripts/run_master_<model>.sh
```

and then start another pod and run

```bash
bash run_worker_<model>.sh
bash scripts/run_worker_<model>.sh
```

The `run_master_<model>.sh` has the following parameters
The `scripts/run_master_<model>.sh` has the following parameters

- Similar to `run_single_<model>.sh`, we have `pipeline_parallel`, `model_chunks` and `tensor_parallel`
- Similar to `scripts/run_single_<model>.sh`, we have `pipeline_parallel`, `model_chunks` and `tensor_parallel`
- It writes the master pod IP to `examples/gpt3/train_gpt3_175b_distributed_master.sh` and to `train_gpt3_175b_distributed_worker.sh` (bert in the corresponding directory)
- Set the number of nodes to be 2 and master node has rank 0
- Starts the shell under `examples`

and `run_worker_<model>.sh` does the following
and `scripts/run_worker_<model>.sh` does the following
- Set the number of nodes to be 2 and the worker node has rank 1
- Starts the shell under `examples`

The `examples/gpt3/train_gpt3_175b_distributed_master.sh` and `examples/gpt3/train_gpt3_175b_distributed_worker.sh` are similar to the single node version, except that the `--node-ips` is mandatory, which is the infiniband IPs of the pods in the order of their GPU ranks. And also the `--multi-node` flag should be turned on.

### Profiling

Each run will generate a trace dir in `benchmark`. Go to the `profiling` directory and run
Each run will generate a trace dir in `benchmark`. Go to the `tools/profiling` directory and run

```
python aggregate.py --benchmark_dir benchmark/your-benchmark-dir
python tools/aggregate.py --benchmark_dir benchmark/your-benchmark-dir
```

in the root dir to produce an aggregated trace file.
Expand All @@ -464,10 +473,10 @@ Just follow above installation instructions.
$\quad$ To run distributed training on a single node, go to the project root directory and run

```bash
bash pretrain_gpt.sh $RANK
bash scripts/pretrain_gpt.sh $RANK
```

Here `pretrain_gpt.sh` is an example pretraining `Bash` script.
Here `scripts/pretrain_gpt.sh` is an example pretraining `Bash` script.

There are two extra options: `--forward-backward-disaggregating` and `--ignore-forward-tensor-parallel` in `TRAINING_ARGS`.

Expand Down
Loading