Skip to content

Conversation

@zzhfz
Copy link
Contributor

@zzhfz zzhfz commented Nov 21, 2025

description

Add Modular Architecture Refactoring script

evidence

(min, max) time across ranks (ms):
    evaluate .......................................: (2344.29, 2344.29)
WARNING:megatron.core.rerun_state_machine:Setting RerunStateMachine mode validate_results
WARNING:megatron.core.rerun_state_machine:Setting RerunStateMachine mode validate_results
----------------------------------------------------------------------------------------------------------
 validation loss at iteration 10 on test set | lm loss value: 1.137142E+01 | lm loss PPL: 8.680520E+04 |
----------------------------------------------------------------------------------------------------------
[rank0]:[W1201 17:19:41.845778097 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

Result JSON written to: ./train/train.gpt.946e7e31-adaa-4421-bef1-63eb7726402f_result.json
Log written to: ./train/train.gpt.946e7e31-adaa-4421-bef1-63eb7726402f_train.log
CSV files: ./train/train.gpt.946e7e31-adaa-4421-bef1-63eb7726402f_train_loss.csv ./train/train.gpt.946e7e31-adaa-4421-bef1-63eb7726402f_train_ppl.csv ./train/train.gpt.946e7e31-adaa-4421-bef1-63eb7726402f_train_throughput.csv
Peak GPU memory: 3461 MiB (3.3799 GiB)

Training completed. Results saved to: ./train/train.gpt.946e7e31-adaa-4421-bef1-63eb7726402f_result.json

./train/train.gpt.946e7e31-adaa-4421-bef1-63eb7726402f_result.json

{
  "config": {
    "command": "torchrun --nproc_per_node=1 --master_port=22455 pretrain_gpt.py --tensor-model-parallel-size=1 --pipeline-model-parallel-size=1 --micro-batch-size=2 --global-batch-size=8 --seq-length=128 --lr=0.00015 --train-iters=10 --num-layers=2 --hidden-size=512 --num-attention-heads=8 --max-position-embeddings=128 --vocab-size=128256 --mock-data --tokenizer-type NullTokenizer --vocab-size 128256 --transformer-impl local --bf16 --no-gradient-accumulation-fusion --no-persist-layer-norm --log-interval 1 --log-throughput",
    "model": "gpt",
    "model_config": "./train/train_gpt_config.json",
    "train_dataset": "mock",
    "validation_dataset": null,
    "test_dataset": null,
    "train_args": {
      "mbs": 2,
      "gbs": 8,
      "seq_len": 128,
      "lr": 0.00015,
      "train_iters": 10,
      "warmup_iterations": 2,
      "parallel": {
        "dp": 1,
        "tp": 1,
        "pp": {
          "size": 1,
          "type": "default"
        },
        "sp": 0
      }
    },
    "timeout_ms": 10000,
    "warmup_iterations": 2,
    "measured_iterations": 10
  },
  "metrics": [
    {
      "name": "train.throughput",
      "type": "timeseries",
      "raw_data_url": "./train/train.gpt.946e7e31-adaa-4421-bef1-63eb7726402f_train_throughput.csv",
      "unit": "tokens/s/gpu"
    },
    {
      "name": "train.peak_memory_usage",
      "type": "scalar",
      "value": 3.379883,
      "unit": "GB"
    },
    {
      "name": "train.loss",
      "type": "timeseries",
      "raw_data_url": "./train/train.gpt.946e7e31-adaa-4421-bef1-63eb7726402f_train_loss.csv",
      "unit": ""
    },
    {
      "name": "train.ppl",
      "type": "timeseries",
      "raw_data_url": "./train/train.gpt.946e7e31-adaa-4421-bef1-63eb7726402f_train_ppl.csv",
      "unit": null
    }
  ]

./train/train.gpt.946e7e31-adaa-4421-bef1-63eb7726402f_train_loss.csv

(megatron) sunjinge@node1:~/Megatron-LM$cat ./train/train.gpt.946e7e31-adaa-4421-bef1-63eb7726402f_train_loss.csv
iteration,loss
3,11.75254
4,11.68381
5,11.58956
6,11.52593
7,11.47907
8,11.49399
9,11.62943
10,11.31772

./train/train.gpt.946e7e31-adaa-4421-bef1-63eb7726402f_train_ppl.csv

(megatron) sunjinge@node1:~/Megatron-LM$ cat ./train/train.gpt.946e7e31-adaa-4421-bef1-63eb7726402f_train_ppl.csv
iteration,ppl
3,127075.922273655
4,118635.37588116364
5,107964.74309662386
6,101308.94617402623
7,96671.12181752129
8,98124.2684725397
9,112356.26058455171
10,82266.56117246421

./train/train.gpt.946e7e31-adaa-4421-bef1-63eb7726402f_train_throughput.csv

(megatron) sunjinge@node1:~/Megatron-LM$ cat ./train/train.gpt.946e7e31-adaa-4421-bef1-63eb7726402f_train_throughput.csv
iteration,throughput
3,2010.9976433621368
4,1772.8531855955678
5,1693.1216931216932
6,1710.0868403473614
7,1711.2299465240644
8,1684.2105263157896
9,1692.0026437541308
10,1686.4295125164688

@zzhfz zzhfz requested review from Chamberlain0w0 and baominghelly and removed request for Chamberlain0w0 and baominghelly November 21, 2025 06:21
@baominghelly baominghelly requested review from Chamberlain0w0 and baominghelly and removed request for Chamberlain0w0 and baominghelly November 21, 2025 06:22
@zzhfz zzhfz requested review from baominghelly and removed request for Chamberlain0w0 November 21, 2025 06:24
@baominghelly baominghelly requested review from Chamberlain0w0 and baominghelly and removed request for Chamberlain0w0 and baominghelly November 21, 2025 06:30

dp = parallel.get("dp", 1)
tp = parallel.get("tp", 1)
pp = parallel.get("pp", {}).get("value", 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果pp的类型不是 G-pipe,应该怎么办呢?

]

megatron_args = [
"pretrain_gpt.py",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个脚本是只针对gpt类的模型么?

torchrun_cmd = [
"torchrun",
f"--nproc_per_node={nproc_per_node}",
"--master_port=29501"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个port最好不要硬编码

f"--num-attention-heads={num_attention_heads}",
f"--max-position-embeddings={max_position_embeddings}",
f"--vocab-size={vocab_size}",
"--mock-data",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果我们的config里面有数据集,你这里传mock data是不是不行啊?


# regex patterns
loss_pattern = re.compile(r"lm loss:\s*([+\-]?\d+(?:\.\d+)?(?:[Ee][+\-]?\d+)?)", re.IGNORECASE)
#ppl_pattern_alt = re.compile(r"lm loss PPL:\s*([+\-]?\d+(?:\.\d+)?(?:[Ee][+\-]?\d+)?)", re.IGNORECASE)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个还需要不?不需要的话可以去掉

if me:
try:
elapsed_ms = float(me.group(1))
tokens_per_iter = mbs * seq
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个地方是不是也要除以显卡的数量啊?

flog.write(line)

# try match loss
m = loss_pattern.search(line)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个地方有没有跳过那些warmup的训练环节呢?

@zzhfz zzhfz requested a review from baominghelly November 24, 2025 07:26
@zzhfz zzhfz requested review from Chamberlain0w0 and baominghelly and removed request for baominghelly December 1, 2025 09:32
@zzhfz zzhfz changed the title Refine Megatron script: fix run_id feat: Refactor training framework with modular architecture Dec 1, 2025
}

# Start training process
print("Launching:", " ".join(cmd))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not use print for logging, we should use logger module here.
Please apply to all code in the pr.


# Add data configuration
if self.config.train_dataset is None or (isinstance(self.config.train_dataset, str) and self.config.train_dataset.lower() == "mock"):
megatron_args += ["--mock-data", "--tokenizer-type", "NullTokenizer", "--vocab-size", str(self.config.vocab_size)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to use different tokenizer for different model here?

f"--micro-batch-size={self.config.mbs}",
f"--global-batch-size={self.config.gbs}",
f"--seq-length={self.config.seq_len}",
f"--lr={self.config.lr}",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also consider decay lr here.

We could add these configs in our json file:

f"--lr_scheduler_type=cosine",        # 推荐:余弦退火 (cosine) 或 线性 (linear)
    f"--warmup_ratio=0.03",                # 推荐:前 3% 的步数用于热身
    # 或者使用步数
    # f"--warmup_steps=100",

# Add common parameters
megatron_args += [
"--transformer-impl", "local",
"--bf16",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should get precision from our json file

def main():
parser = argparse.ArgumentParser()
parser.add_argument("--config", required=True, help="path to config.json")
parser.add_argument("--framework", default="megatron", choices=["megatron", "infinitrain"],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we get this config from json file?

parser.add_argument("--config", required=True, help="path to config.json")
parser.add_argument("--framework", default="megatron", choices=["megatron", "infinitrain"],
help="training framework to use")
parser.add_argument("--gpu-platform", default="nvidia", choices=["nvidia", "other"],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a device arg in config file

Copy link
Collaborator

@baominghelly baominghelly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment added in pr

@zzhfz zzhfz requested review from baominghelly and removed request for Chamberlain0w0 December 3, 2025 10:10
Copy link
Contributor Author

@zzhfz zzhfz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@baominghelly 训练脚本修改已完成,请审核最新版本。

主要更新

  • 标准化输出格式:run_id/testcase符合规范
  • 完整配置支持
  • 已解决6条评审意见
  • 成功通过Megatron-LM测试

zzhfz

This comment was marked as resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants