Skip to content

Add Megatron-FSDP E2E integration test to TE CI/CD (L1).#2845

Open
cspades wants to merge 5 commits intoNVIDIA:mainfrom
cspades:cye/mfsdp-te-e2e-test
Open

Add Megatron-FSDP E2E integration test to TE CI/CD (L1).#2845
cspades wants to merge 5 commits intoNVIDIA:mainfrom
cspades:cye/mfsdp-te-e2e-test

Conversation

@cspades
Copy link
Copy Markdown
Member

@cspades cspades commented Apr 7, 2026

Description

  • Adds Megatron-FSDP E2E integration tests to the TransformerEngine CI/CD.

Details

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Signed-off-by: Cory Ye <cye@nvidia.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 7, 2026

Greptile Summary

This PR adds a new L1_pytorch_mcore_fsdp_integration CI test that clones a pinned Megatron-LM commit, creates a mock BPE vocab, sets relevant NVTE_* env vars, and runs a short Megatron-FSDP GPT training job with FP8, CPU offloading, and DCP checkpointing flags.

The two blocking issues flagged in earlier review rounds have been addressed: unset CUDA_DEVICE_MAX_CONNECTIONS is now a standalone shell statement (no longer swallowing python3), and the Python invocation uses direct backslash line-continuations rather than a bash -c "${COMMAND}" string that would split at newlines.

Confidence Score: 5/5

Safe to merge — all previously identified blocking issues have been resolved and only a minor P2 CI-speed suggestion remains.

The two P0/P1 issues from prior review rounds (unset swallowing python3, and the bash -c newline-splitting bug) are both fixed. The only remaining observation is a P2 suggestion to shallow-clone Megatron-LM to speed up CI; this does not affect correctness.

No files require special attention.

Vulnerabilities

No security concerns identified. The script clones from the official NVIDIA Megatron-LM repository and checks out a pinned commit SHA, which limits supply-chain risk.

Important Files Changed

Filename Overview
qa/L1_pytorch_mcore_fsdp_integration/test.sh New CI test script; previous blocking issues (unset swallowing python3, bash -c newline splitting) are fixed; one P2 suggestion to shallow-clone Megatron-LM for CI speed
qa/L1_pytorch_mcore_fsdp_integration/.gitignore Correctly ignores the cloned Megatron-LM directory and generated vocab.json
qa/L1_pytorch_mcore_fsdp_integration/merges.txt Stub BPE merges file (version header only) consistent with the sibling test's approach for mock data

Sequence Diagram

sequenceDiagram
    participant CI as CI Runner
    participant GH as GitHub (Megatron-LM)
    participant FS as Filesystem
    participant GPU as GPU (GB200)

    CI->>FS: Check if Megatron-LM dir exists
    alt Not present
        CI->>GH: git clone Megatron-LM
        CI->>GH: git checkout 8cbc45b (pinned)
    end
    CI->>FS: Write mock vocab.json (4096 tokens)
    CI->>FS: unset CUDA_DEVICE_MAX_CONNECTIONS
    CI->>FS: export NVTE_* env vars
    CI->>GPU: python3 -m torch.distributed.launch pretrain_gpt.py (FSDP, FP8, BF16)
    GPU-->>CI: Train 10 iters + final eval
    CI-->>CI: Exit 0 (pass) or non-zero (fail)
Loading

Reviews (8): Last reviewed commit: "Remove CPU initialization, add FW args." | Re-trigger Greptile

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Cory Ye <44509866+cspades@users.noreply.github.com>
Signed-off-by: Cory Ye <cye@nvidia.com>
timmoon10
timmoon10 previously approved these changes Apr 7, 2026
Copy link
Copy Markdown
Collaborator

@timmoon10 timmoon10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pending CI

@timmoon10
Copy link
Copy Markdown
Collaborator

Pipeline 47956532

Signed-off-by: Cory Ye <cye@nvidia.com>
Signed-off-by: Cory Ye <cye@nvidia.com>
@cspades cspades force-pushed the cye/mfsdp-te-e2e-test branch from 5fb4871 to fce5369 Compare April 8, 2026 00:51
@cspades
Copy link
Copy Markdown
Member Author

cspades commented Apr 8, 2026

Depends on this: NVIDIA/Megatron-LM#4133

This PR correctly uses decoupled_grad depending on the FP8 recipe matching the distributed optimizer logic in Megatron-Core.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants