Add Megatron-FSDP E2E integration test to TE CI/CD (L1).#2845
Add Megatron-FSDP E2E integration test to TE CI/CD (L1).#2845cspades wants to merge 5 commits intoNVIDIA:mainfrom
Conversation
Signed-off-by: Cory Ye <cye@nvidia.com>
Greptile SummaryThis PR adds a new The two blocking issues flagged in earlier review rounds have been addressed: Confidence Score: 5/5Safe to merge — all previously identified blocking issues have been resolved and only a minor P2 CI-speed suggestion remains. The two P0/P1 issues from prior review rounds (unset swallowing python3, and the bash -c newline-splitting bug) are both fixed. The only remaining observation is a P2 suggestion to shallow-clone Megatron-LM to speed up CI; this does not affect correctness. No files require special attention.
|
| Filename | Overview |
|---|---|
| qa/L1_pytorch_mcore_fsdp_integration/test.sh | New CI test script; previous blocking issues (unset swallowing python3, bash -c newline splitting) are fixed; one P2 suggestion to shallow-clone Megatron-LM for CI speed |
| qa/L1_pytorch_mcore_fsdp_integration/.gitignore | Correctly ignores the cloned Megatron-LM directory and generated vocab.json |
| qa/L1_pytorch_mcore_fsdp_integration/merges.txt | Stub BPE merges file (version header only) consistent with the sibling test's approach for mock data |
Sequence Diagram
sequenceDiagram
participant CI as CI Runner
participant GH as GitHub (Megatron-LM)
participant FS as Filesystem
participant GPU as GPU (GB200)
CI->>FS: Check if Megatron-LM dir exists
alt Not present
CI->>GH: git clone Megatron-LM
CI->>GH: git checkout 8cbc45b (pinned)
end
CI->>FS: Write mock vocab.json (4096 tokens)
CI->>FS: unset CUDA_DEVICE_MAX_CONNECTIONS
CI->>FS: export NVTE_* env vars
CI->>GPU: python3 -m torch.distributed.launch pretrain_gpt.py (FSDP, FP8, BF16)
GPU-->>CI: Train 10 iters + final eval
CI-->>CI: Exit 0 (pass) or non-zero (fail)
Reviews (8): Last reviewed commit: "Remove CPU initialization, add FW args." | Re-trigger Greptile
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Cory Ye <44509866+cspades@users.noreply.github.com>
Signed-off-by: Cory Ye <cye@nvidia.com>
|
Pipeline 47956532 |
f93ebbc to
e521694
Compare
Signed-off-by: Cory Ye <cye@nvidia.com>
5fb4871 to
fce5369
Compare
|
Depends on this: NVIDIA/Megatron-LM#4133 This PR correctly uses |
Description
Details
decoupled_gradbugs related to FusedAdam, and other less obvious CPU offloading & Tensor API bugs that are difficult to catch without running Megatron-FSDP. This functional test aims to reduce the frequency of that.Type of change
Changes
Please list the changes introduced in this PR:
Checklist: