Add Megatron-FSDP E2E integration test to TE CI/CD (L1). by cspades · Pull Request #2845 · NVIDIA/TransformerEngine

cspades · 2026-04-07T18:25:42Z

Description

Adds Megatron-FSDP E2E integration tests to the TransformerEngine CI/CD.

Details

Megatron-FSDP has experienced various issues with TransformerEngine in the past months, from race conditions on gradients with deterministic FlashAttention, to FSDP2 / Megatron-FSDP DCP checkpointing or decoupled_grad bugs related to FusedAdam, and other less obvious CPU offloading & Tensor API bugs that are difficult to catch without running Megatron-FSDP. This functional test aims to reduce the frequency of that.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Cory Ye <cye@nvidia.com>

greptile-apps · 2026-04-07T18:31:04Z

Greptile Summary

This PR adds a new L1_pytorch_mcore_fsdp_integration CI test that clones a pinned Megatron-LM commit, creates a mock BPE vocab, sets relevant NVTE_* env vars, and runs a short Megatron-FSDP GPT training job with FP8, CPU offloading, and DCP checkpointing flags.

The two blocking issues flagged in earlier review rounds have been addressed: unset CUDA_DEVICE_MAX_CONNECTIONS is now a standalone shell statement (no longer swallowing python3), and the Python invocation uses direct backslash line-continuations rather than a bash -c "${COMMAND}" string that would split at newlines.

Confidence Score: 5/5

Safe to merge — all previously identified blocking issues have been resolved and only a minor P2 CI-speed suggestion remains.

The two P0/P1 issues from prior review rounds (unset swallowing python3, and the bash -c newline-splitting bug) are both fixed. The only remaining observation is a P2 suggestion to shallow-clone Megatron-LM to speed up CI; this does not affect correctness.

No files require special attention.

Vulnerabilities

No security concerns identified. The script clones from the official NVIDIA Megatron-LM repository and checks out a pinned commit SHA, which limits supply-chain risk.

Important Files Changed

Filename	Overview
qa/L1_pytorch_mcore_fsdp_integration/test.sh	New CI test script; previous blocking issues (unset swallowing python3, bash -c newline splitting) are fixed; one P2 suggestion to shallow-clone Megatron-LM for CI speed
qa/L1_pytorch_mcore_fsdp_integration/.gitignore	Correctly ignores the cloned Megatron-LM directory and generated vocab.json
qa/L1_pytorch_mcore_fsdp_integration/merges.txt	Stub BPE merges file (version header only) consistent with the sibling test's approach for mock data

Sequence Diagram

sequenceDiagram
    participant CI as CI Runner
    participant GH as GitHub (Megatron-LM)
    participant FS as Filesystem
    participant GPU as GPU (GB200)

    CI->>FS: Check if Megatron-LM dir exists
    alt Not present
        CI->>GH: git clone Megatron-LM
        CI->>GH: git checkout 8cbc45b (pinned)
    end
    CI->>FS: Write mock vocab.json (4096 tokens)
    CI->>FS: unset CUDA_DEVICE_MAX_CONNECTIONS
    CI->>FS: export NVTE_* env vars
    CI->>GPU: python3 -m torch.distributed.launch pretrain_gpt.py (FSDP, FP8, BF16)
    GPU-->>CI: Train 10 iters + final eval
    CI-->>CI: Exit 0 (pass) or non-zero (fail)

_{Reviews (8): Last reviewed commit: "Remove CPU initialization, add FW args." | Re-trigger Greptile}

qa/L1_pytorch_mcore_fsdp_integration/test.sh

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Cory Ye <44509866+cspades@users.noreply.github.com>

qa/L1_pytorch_mcore_fsdp_integration/test.sh

Signed-off-by: Cory Ye <cye@nvidia.com>

timmoon10

LGTM, pending CI

timmoon10 · 2026-04-07T20:53:21Z

Pipeline 47956532

Signed-off-by: Cory Ye <cye@nvidia.com>

cspades · 2026-04-08T01:47:11Z

Depends on this: NVIDIA/Megatron-LM#4133

This PR correctly uses decoupled_grad depending on the FP8 recipe matching the distributed optimizer logic in Megatron-Core.

Add Megatron-FSDP E2E integration test to TE CI/CD (L1).

5023d50

Signed-off-by: Cory Ye <cye@nvidia.com>

greptile-apps bot reviewed Apr 7, 2026

View reviewed changes

qa/L1_pytorch_mcore_fsdp_integration/test.sh Outdated Show resolved Hide resolved

qa/L1_pytorch_mcore_fsdp_integration/test.sh Outdated Show resolved Hide resolved

Update qa/L1_pytorch_mcore_fsdp_integration/test.sh

4374e7f

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Cory Ye <44509866+cspades@users.noreply.github.com>

greptile-apps bot reviewed Apr 7, 2026

View reviewed changes

qa/L1_pytorch_mcore_fsdp_integration/test.sh Outdated Show resolved Hide resolved

Explicit torchrun invoke.

dcd334a

Signed-off-by: Cory Ye <cye@nvidia.com>

timmoon10 previously approved these changes Apr 7, 2026

View reviewed changes

Edits.

e521694

Signed-off-by: Cory Ye <cye@nvidia.com>

cspades dismissed timmoon10’s stale review via e521694 April 7, 2026 21:39

cspades force-pushed the cye/mfsdp-te-e2e-test branch from f93ebbc to e521694 Compare April 7, 2026 23:13

Remove CPU initialization, add FW args.

fce5369

Signed-off-by: Cory Ye <cye@nvidia.com>

cspades force-pushed the cye/mfsdp-te-e2e-test branch from 5fb4871 to fce5369 Compare April 8, 2026 00:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Megatron-FSDP E2E integration test to TE CI/CD (L1).#2845

Add Megatron-FSDP E2E integration test to TE CI/CD (L1).#2845
cspades wants to merge 5 commits intoNVIDIA:mainfrom
cspades:cye/mfsdp-te-e2e-test

cspades commented Apr 7, 2026

Uh oh!

greptile-apps bot commented Apr 7, 2026 •

edited

Loading

Vulnerabilities

Uh oh!

Uh oh!

Uh oh!

Uh oh!

timmoon10 left a comment

Uh oh!

timmoon10 commented Apr 7, 2026

Uh oh!

cspades commented Apr 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cspades commented Apr 7, 2026

Description

Details

Type of change

Changes

Checklist:

Uh oh!

greptile-apps bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Vulnerabilities

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

timmoon10 left a comment

Choose a reason for hiding this comment

Uh oh!

timmoon10 commented Apr 7, 2026

Uh oh!

cspades commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps bot commented Apr 7, 2026 •

edited

Loading

cspades commented Apr 8, 2026 •

edited

Loading