OpenSQZ · MengzhangLI · Nov 4, 2025 · Nov 6, 2025 · Nov 6, 2025
diff --git a/.gitignore b/.gitignore
@@ -21,4 +21,6 @@ benchmark-*.json
 datasets/
 abnormal.txt
 gpu-rank-map.txt
-.specstory
+.specstory
+!datasets/
+!datasets/**
diff --git a/CODEOWNERS b/CODEOWNERS
@@ -47,6 +47,6 @@ megatron/core/inference/
 .gitlab/
 .github/
 .gitlab-ci.yml
-Dockerfile.ci.lts
-Dockerfile.ci.dev
+docker/Dockerfile.ci.lts
+docker/Dockerfile.ci.dev
 tests/
diff --git a/README.md b/README.md
@@ -1,9 +1,18 @@
 <div align="center">
+  <img src="images/megatronapp.png" alt="MegatronApp logo" height="96">
+</div>
+<h1 align="center">MegatronApp: Toolchain Built around Megatron-LM for Distributed Training</h1>
+
+<p align="center">
+  An extension for performance tuning, slow-node detection, and training-process visualization.
+</p>
 
-**MegatronApp: Toolchain built around Megatron-LM for Distributed Training**
-<!-- > Sample: AI Health Assistant | Powered by Your Data -->
+<p align="center">
+  <a href="https://github.com/OpenSQZ/MegatronApp/blob/main/docker/DockerUsage.md">🍳 Cookbook</a> |
+  <a href="https://arxiv.org/pdf/2507.19845">📄 Technical Report</a>
+</p>
 
-An extension for performance tuning, slow-node detection, and training-process visualization.
+</div>
 
 <!-- **📢 Announcements**  
 
@@ -18,7 +27,7 @@ An extension for performance tuning, slow-node detection, and training-process v
 # News <!-- omit in toc -->
 
 ### 📌 Pinned
-* [2025.10.17] 🔥🔥🔥 We provide user-friendly [docker guidance](./DockerUsage.md) for all four features of MegatronApp. Please try it out!
+* [2025.10.17] 🔥🔥🔥 We provide user-friendly [docker guidance](./docker/DockerUsage.md) for all four features of MegatronApp. Please try it out!
 * [2025.07.27] 📢📢📢 The MegatronApp technical report has been released! See [here](https://arxiv.org/pdf/2507.19845).
 * [2025.07.04] 🔥🔥🔥 MegatronApp is officially launched at WAIC 2025! Our code is available [here](https://github.com/OpenSQZ/MegatronApp). Come and try it out!
 
@@ -167,7 +176,7 @@ docker run --runtime --nvidia --gpus all -it --rm \
 To install additional required packages, run
 
 ```bash
-pip install -r requirements.txt
+pip install -r requirements/requirements.txt
 ```
 
 ## MegaScan
@@ -197,7 +206,7 @@ Alternatively, you can use elastic training. See [torchrun](https://docs.pytorch
 2. After training, you will find separated trace files in the current directory. The trace files are named as `benchmark-data-{}-pipeline-{}-tensor-{}.json`, where `{}` is the rank number. Now we should aggregate the trace files into a single trace file:
 
 ```bash
-python scripts/aggregate.py --b trace_output --output benchmark.json
+python tools/aggregate.py --b trace_output --output benchmark.json
 ```
 
 3. You can visualize the trace file using Chrome Tracing (or Perfetto UI). Open the trace file in Chrome Tracing by navigating to `chrome://tracing` in your browser (or https://ui.perfetto.dev/). Now you can explore the trace data, zoom in on specific events, and analyze the performance characteristics of your distributed training run.
@@ -218,7 +227,7 @@ python scripts/aggregate.py --b trace_output --output benchmark.json
     2. Run the training script. Then aggregate the trace files as described above, but with an additional command line argument to enable the detection algorithm:
 
     ```bash
-    python scripts/aggregate.py \
+    python tools/aggregate.py \
         -b . \ # Equivalent to --bench-dir
         -d # Enable the detection algorithm, Equivalent to --detect
     ```
@@ -245,12 +254,12 @@ bash a_pretrain_script.sh $RANK
 ```
 For example
 ```bash
-bash pretrain_gpt.sh 0
+bash scripts/pretrain_gpt.sh 0
 ```
 
 **Frontend (Vue)**: Navigate to the frontend directory and start the development server.
 ```bash
-cd transformer-visualize
+cd tools/visualization/transformer-visualize
 npm run dev
 ```
 After launching both, open your browser to the specified address (usually http://localhost:5173). You will see the main interface.
@@ -307,13 +316,13 @@ UseIB: true
 - The Python environment in the image automatically includes almost all of the required packages. To install additional required packages, run
 
 ```bash
-pip install -r requirements.txt
+pip install -r requirements/requirements.txt
 ```
 
 - Install infiniband prerequisites
 
 ```bash
-bash prerequisite.sh
+bash scripts/prerequisite.sh
 ```
 
 - Build the `shm_tensor_new_rdma` (for multinode) and `shm_tensor_new_rdma_pre_alloc` modules.
@@ -341,17 +350,17 @@ First, prepare your dataset in the following `.json` format with one sample per
 {"src": "bloomberg", "text": "Var Energi agrees to buy Exxonmobil's Norway assets for $4.5 bln. MILAN, Sept 26 (Reuters) - Var Energi AS, the Norwegian oil and gas group 69.6% owned by Italian major Eni, has agreed to buy the Norwegian upstream assets of ExxonMobil for $4.5 billion. The deal is expected to be completed in the final quarter of this year, Var Energi said on Thursday. Reporting by Stephen Jewkes; editing by Francesca Landini MILAN, Sept 26 (Reuters) - Var Energi AS, the Norwegian oil and gas group 69.6% owned by Italian major Eni, has agreed to buy the Norwegian upstream assets of ExxonMobil for $4.5 billion. The deal is expected to be completed in the final quarter of this year, Var Energi said on Thursday. Reporting by Stephen Jewkes; editing by Francesca Landini", "type": "Eng", "id": "1", "title": "Var Energi agrees to buy Exxonmobil's Norway assets for $4.5 bln. "}
 {"src": "bloomberg", "text": "Trump says 'incorrect' he is willing to meet Iran with 'no conditions'. WASHINGTON (Reuters) - U.S. President Donald Trump on Sunday appeared to play down the chances that he might be willing to meet with Iranian officials, saying reports that he would do so without conditions were not accurate. \u201cThe Fake News is saying that I am willing to meet with Iran, \u2018No Conditions.\u2019 That is an incorrect statement (as usual!),\u201d Trump said on Twitter. In fact, as recently as on Sept. 10, U.S. Secretary of State Mike Pompeo said \u201cHe (Trump) is prepared to meet with no preconditions.\u201d Reporting By Arshad Mohammed; Editing by Shri Navaratnam WASHINGTON (Reuters) - U.S. President Donald Trump on Sunday appeared to play down the chances that he might be willing to meet with Iranian officials, saying reports that he would do so without conditions were not accurate. \u201cThe Fake News is saying that I am willing to meet with Iran, \u2018No Conditions.\u2019 That is an incorrect statement (as usual!),\u201d Trump said on Twitter. In fact, as recently as on Sept. 10, U.S. Secretary of State Mike Pompeo said \u201cHe (Trump) is prepared to meet with no preconditions.\u201d Reporting By Arshad Mohammed; Editing by Shri Navaratnam", "type": "Eng", "id": "2", "title": "Trump says 'incorrect' he is willing to meet Iran with 'no conditions'. "}
 ```
-note that we have provided a sample dataset under `datasets_gpt/` and `datasets_bert/`.
+note that we have provided a sample dataset under `datasets/gpt/` and `datasets/bert/`.
 
 Then, prepare the vocab file (gpt and bert) and the merges file (gpt-only). We have provided it in the respective directories.
 
 For bert, run the following
 ```bash
 cd datasets
 python ../tools/preprocess_data.py \
-       --input ../datasets_bert/dataset.json \
+       --input ../datasets/bert/dataset.json \
        --output-prefix bert \
-       --vocab-file ../datasets_bert/vocab.txt \
+       --vocab-file ../datasets/bert/vocab.txt \
        --tokenizer-type BertWordPieceLowerCase \
        --split-sentences \
        --workers $(nproc)
@@ -362,11 +371,11 @@ For GPT, run the following
 ```bash
 cd datasets
 python ../tools/preprocess_data.py \
-       --input ../datasets_gpt/dataset.json \
+       --input ../datasets/gpt/dataset.json \
        --output-prefix gpt \
-       --vocab-file ../datasets_gpt/vocab.json \
+       --vocab-file ../datasets/gpt/vocab.json \
        --tokenizer-type GPT2BPETokenizer \
-       --merge-file ../datasets_gpt/merges.txt \
+       --merge-file ../datasets/gpt/merges.txt \
        --append-eod \
        --workers $(nproc)
 ```
@@ -377,18 +386,18 @@ For other models, please refer to `nvidia/megatron` for the corresponding datase
 To run distributed training on a single node, go to the project root directory and run
 
 ```bash
-bash run_single_gpt.sh
+bash scripts/run_single_gpt.sh
 ```
 
 for GPT and
 
 ```bash
-bash run_single_bert.sh
+bash scripts/run_single_bert.sh
 ```
 
 for bert.
 
-The `run_single_<model>.sh` files have the following structure:
+The `scripts/run_single_<model>.sh` files have the following structure:
 
 - Parameters include `pipeline_parallel`, `model_chunks` and `tensor_parallel`
 - The `virtual_stage_layer` parameter specifies how many layers there are in a single virtual pipeline stage. It is calculated as
@@ -405,46 +414,46 @@ There are also several critical parameters in `examples/gpt3/train_gpt3_175b_dis
 - `--workload` specifies the workload of each single thread, and hence determines the number of threads used in P2P communication
 - `--num-gpus` specifies the number of GPUs on the current node (single node training)
 - Other critical parameters include the number of layers of the model, the global batch size and the sequence length
-- Note that currently the global batch size value is 16 and is static in `run_single_<model>.sh`. It needs to simultaneously modify `run_single_<model>.sh` if adjusting the layers.
+- Note that currently the global batch size value is 16 and is static in `scripts/run_single_<model>.sh`. It needs to simultaneously modify `scripts/run_single_<model>.sh` if adjusting the layers.
 
 For the remaining models, you can either directly run
 ```bash
 bash examples/<model>/<train_file>.sh
 ```
-or write a file similar to `run_{single,master,worker}_<model>.sh` that sets up configurations and runs the shell under `examples/`
+or write a file similar to `scripts/run_{single,master,worker}_<model>.sh` that sets up configurations and runs the shell under `examples/`
 
 #### Multinode Distributed Training
 To run distributed training on multiple nodes, go to the root directory. First run
 
 ```bash
-bash run_master_<model>.sh
+bash scripts/run_master_<model>.sh
 ```
 
 and then start another pod and run
 
 ```bash
-bash run_worker_<model>.sh
+bash scripts/run_worker_<model>.sh
 ```
 
-The `run_master_<model>.sh` has the following parameters
+The `scripts/run_master_<model>.sh` has the following parameters
 
-- Similar to `run_single_<model>.sh`, we have `pipeline_parallel`, `model_chunks` and `tensor_parallel`
+- Similar to `scripts/run_single_<model>.sh`, we have `pipeline_parallel`, `model_chunks` and `tensor_parallel`
 - It writes the master pod IP to `examples/gpt3/train_gpt3_175b_distributed_master.sh` and to `train_gpt3_175b_distributed_worker.sh` (bert in the corresponding directory)
 - Set the number of nodes to be 2 and master node has rank 0
 - Starts the shell under `examples`
 
-and `run_worker_<model>.sh` does the following
+and `scripts/run_worker_<model>.sh` does the following
 - Set the number of nodes to be 2 and the worker node has rank 1
 - Starts the shell under `examples`
 
 The `examples/gpt3/train_gpt3_175b_distributed_master.sh` and `examples/gpt3/train_gpt3_175b_distributed_worker.sh` are similar to the single node version, except that the `--node-ips` is mandatory, which is the infiniband IPs of the pods in the order of their GPU ranks. And also the `--multi-node` flag should be turned on.
 
 ### Profiling
 
-Each run will generate a trace dir in `benchmark`. Go to the `profiling` directory and run
+Each run will generate a trace dir in `benchmark`. Go to the `tools/profiling` directory and run
 
 ```
-python aggregate.py --benchmark_dir benchmark/your-benchmark-dir
+python tools/aggregate.py --benchmark_dir benchmark/your-benchmark-dir
 ```
 
 in the root dir to produce an aggregated trace file.
@@ -464,10 +473,10 @@ Just follow above installation instructions.
 $\quad$ To run distributed training on a single node, go to the project root directory and run
 
 ```bash
-bash pretrain_gpt.sh $RANK
+bash scripts/pretrain_gpt.sh $RANK
 ```
 
-Here `pretrain_gpt.sh` is an example pretraining `Bash` script. 
+Here `scripts/pretrain_gpt.sh` is an example pretraining `Bash` script. 
 
 There are two extra options: `--forward-backward-disaggregating` and `--ignore-forward-tensor-parallel` in `TRAINING_ARGS`.