feat: Refactor training framework with modular architecture #3

zzhfz · 2025-11-21T05:08:27Z

description

Add Modular Architecture Refactoring script

evidence

(min, max) time across ranks (ms):
    evaluate .......................................: (2344.29, 2344.29)
WARNING:megatron.core.rerun_state_machine:Setting RerunStateMachine mode validate_results
WARNING:megatron.core.rerun_state_machine:Setting RerunStateMachine mode validate_results
----------------------------------------------------------------------------------------------------------
 validation loss at iteration 10 on test set | lm loss value: 1.137142E+01 | lm loss PPL: 8.680520E+04 |
----------------------------------------------------------------------------------------------------------
[rank0]:[W1201 17:19:41.845778097 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

Result JSON written to: ./train/train.gpt.946e7e31-adaa-4421-bef1-63eb7726402f_result.json
Log written to: ./train/train.gpt.946e7e31-adaa-4421-bef1-63eb7726402f_train.log
CSV files: ./train/train.gpt.946e7e31-adaa-4421-bef1-63eb7726402f_train_loss.csv ./train/train.gpt.946e7e31-adaa-4421-bef1-63eb7726402f_train_ppl.csv ./train/train.gpt.946e7e31-adaa-4421-bef1-63eb7726402f_train_throughput.csv
Peak GPU memory: 3461 MiB (3.3799 GiB)

Training completed. Results saved to: ./train/train.gpt.946e7e31-adaa-4421-bef1-63eb7726402f_result.json

./train/train.gpt.946e7e31-adaa-4421-bef1-63eb7726402f_result.json

{
  "config": {
    "command": "torchrun --nproc_per_node=1 --master_port=22455 pretrain_gpt.py --tensor-model-parallel-size=1 --pipeline-model-parallel-size=1 --micro-batch-size=2 --global-batch-size=8 --seq-length=128 --lr=0.00015 --train-iters=10 --num-layers=2 --hidden-size=512 --num-attention-heads=8 --max-position-embeddings=128 --vocab-size=128256 --mock-data --tokenizer-type NullTokenizer --vocab-size 128256 --transformer-impl local --bf16 --no-gradient-accumulation-fusion --no-persist-layer-norm --log-interval 1 --log-throughput",
    "model": "gpt",
    "model_config": "./train/train_gpt_config.json",
    "train_dataset": "mock",
    "validation_dataset": null,
    "test_dataset": null,
    "train_args": {
      "mbs": 2,
      "gbs": 8,
      "seq_len": 128,
      "lr": 0.00015,
      "train_iters": 10,
      "warmup_iterations": 2,
      "parallel": {
        "dp": 1,
        "tp": 1,
        "pp": {
          "size": 1,
          "type": "default"
        },
        "sp": 0
      }
    },
    "timeout_ms": 10000,
    "warmup_iterations": 2,
    "measured_iterations": 10
  },
  "metrics": [
    {
      "name": "train.throughput",
      "type": "timeseries",
      "raw_data_url": "./train/train.gpt.946e7e31-adaa-4421-bef1-63eb7726402f_train_throughput.csv",
      "unit": "tokens/s/gpu"
    },
    {
      "name": "train.peak_memory_usage",
      "type": "scalar",
      "value": 3.379883,
      "unit": "GB"
    },
    {
      "name": "train.loss",
      "type": "timeseries",
      "raw_data_url": "./train/train.gpt.946e7e31-adaa-4421-bef1-63eb7726402f_train_loss.csv",
      "unit": ""
    },
    {
      "name": "train.ppl",
      "type": "timeseries",
      "raw_data_url": "./train/train.gpt.946e7e31-adaa-4421-bef1-63eb7726402f_train_ppl.csv",
      "unit": null
    }
  ]

./train/train.gpt.946e7e31-adaa-4421-bef1-63eb7726402f_train_loss.csv

(megatron) sunjinge@node1:~/Megatron-LM$cat ./train/train.gpt.946e7e31-adaa-4421-bef1-63eb7726402f_train_loss.csv
iteration,loss
3,11.75254
4,11.68381
5,11.58956
6,11.52593
7,11.47907
8,11.49399
9,11.62943
10,11.31772

./train/train.gpt.946e7e31-adaa-4421-bef1-63eb7726402f_train_ppl.csv

(megatron) sunjinge@node1:~/Megatron-LM$ cat ./train/train.gpt.946e7e31-adaa-4421-bef1-63eb7726402f_train_ppl.csv
iteration,ppl
3,127075.922273655
4,118635.37588116364
5,107964.74309662386
6,101308.94617402623
7,96671.12181752129
8,98124.2684725397
9,112356.26058455171
10,82266.56117246421

./train/train.gpt.946e7e31-adaa-4421-bef1-63eb7726402f_train_throughput.csv

(megatron) sunjinge@node1:~/Megatron-LM$ cat ./train/train.gpt.946e7e31-adaa-4421-bef1-63eb7726402f_train_throughput.csv
iteration,throughput
3,2010.9976433621368
4,1772.8531855955678
5,1693.1216931216932
6,1710.0868403473614
7,1711.2299465240644
8,1684.2105263157896
9,1692.0026437541308
10,1686.4295125164688

baominghelly · 2025-11-21T06:50:37Z