[PyT][Test] Add xfailing FSDP2 memory leak detection tests#2803
[PyT][Test] Add xfailing FSDP2 memory leak detection tests#2803vthumbe1503 merged 3 commits intoNVIDIA:mainfrom
Conversation
Greptile SummaryThis PR adds a new test file
Confidence Score: 5/5Safe to merge — all three previously raised concerns are resolved and no new P0/P1 issues found. All prior review issues (standalone runner crash, stale layer-count comment, unused MEASURED_STEPS) are addressed. The only remaining note is a P2 style suggestion about the tolerance floor in the bf16 control test. Code follows existing patterns, xfail decoration is correct with strict=False, and the conftest.py fixture wiring is sound. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[test_fsdp2_mem_leak_tests\nouter pytest runner] -->|torchrun -m pytest| B[run_fsdp2_mem_leak.py]
B --> C[test_bf16_no_excess_forward_memory\ncontrol - PASS]
B --> D[test_bf16_no_excess_backward_memory\ncontrol - PASS]
B --> E[test_fp8_temp_accumulation_across_layers\nxfail - Issue 2681]
B --> F[test_transpose_cache_retained_after_backward\nxfail - Issue 2717]
C --> G[_LayerMemoryTracker hooks\nper-layer forward increments]
G --> H{max_deviation ≤\n10% avg + 1KiB?}
H -->|yes| I[PASS]
D --> J[_measure_backward_memory_delta\nbf16 vs bf16]
J --> K{abs excess ≤\n256 KiB?}
K -->|yes| I
E --> L[bf16 baseline\n_measure_forward_increments]
E --> M[FP8 model\n_measure_forward_increments]
L & M --> N{fp8_avg - bf16_avg ≤\n50 KiB/layer?}
N -->|no - xfail expected| O[XFAIL]
F --> P[bf16 baseline\n_measure_backward_memory_delta]
F --> Q[FP8 model\n_measure_backward_memory_delta]
P & Q --> R{fp8_delta - bf16_delta ≤\n256 KiB?}
R -->|no - xfail expected| O
Reviews (5): Last reviewed commit: "Merge branch 'main' into pstjohn/fsdp2-m..." | Re-trigger Greptile |
Add tests that demonstrate two known memory issues with FSDP2 + FP8: - Issue NVIDIA#2681: FP8 weight copies created during te.autocast() forward pass accumulate across layers instead of being freed between layers, defeating FSDP2's memory efficiency. Detected by comparing per-layer forward memory increments against a bf16 baseline using layer hooks. - Issue NVIDIA#2717: Transpose cache tensors (_create_transpose) allocated during backward persist until the next forward pass instead of being freed after backward completes. Detected by comparing the backward memory delta (post_bwd - post_fwd) against a bf16 baseline. New tests: - test_bf16_no_excess_forward_memory: control, validates per-layer measurement - test_bf16_no_excess_backward_memory: control, validates backward delta comparison - test_fp8_temp_accumulation_across_layers: xfail, detects NVIDIA#2681 - test_transpose_cache_retained_after_backward: xfail, detects NVIDIA#2717 All parametrized over 5 FP8 recipes x {no_quant_init, quant_init}. Signed-off-by: Peter St. John <pstjohn@nvidia.com>
29cd628 to
27a505f
Compare
|
LGTM. |
|
/te-ci L1 pytorch |
vthumbe1503
left a comment
There was a problem hiding this comment.
CI is green. Changes LGTM. Hopefully this PR fixes the xfailing tests.
Add tests that demonstrate two known memory issues with FSDP2 + FP8: - Issue #2681: FP8 weight copies created during te.autocast() forward pass accumulate across layers instead of being freed between layers, defeating FSDP2's memory efficiency. Detected by comparing per-layer forward memory increments against a bf16 baseline using layer hooks. - Issue #2717: Transpose cache tensors (_create_transpose) allocated during backward persist until the next forward pass instead of being freed after backward completes. Detected by comparing the backward memory delta (post_bwd - post_fwd) against a bf16 baseline. New tests: - test_bf16_no_excess_forward_memory: control, validates per-layer measurement - test_bf16_no_excess_backward_memory: control, validates backward delta comparison - test_fp8_temp_accumulation_across_layers: xfail, detects #2681 - test_transpose_cache_retained_after_backward: xfail, detects #2717 All parametrized over 5 FP8 recipes x {no_quant_init, quant_init}. Signed-off-by: Peter St. John <pstjohn@nvidia.com> Co-authored-by: vthumbe1503 <vthumbe@nvidia.com>
Summary
Issue #2681: FP8 weight copy accumulation during forward
FP8 weight copies created by
te.autocast()accumulate across layers (~0.68 MiB/layer excess over bf16 baseline). Detected for all 5 recipes withno_quant_init.Issue #2717: Transpose cache retained after backward
_create_transposetensors persist after backward until the next forward frees them (~3 MiB excess over bf16). Detected forDelayedScalingandFloat8CurrentScalingwithquant_init.New tests (in
run_fsdp2_mem_leak.py)test_bf16_no_excess_forward_memorytest_bf16_no_excess_backward_memorytest_fp8_temp_accumulation_across_layerstest_transpose_cache_retained_after_backwardAll FP8 tests parametrized over 5 recipes × {no_quant_init, quant_init}.
Test plan
pytest tests/pytorch/distributed/test_torch_fsdp2.py— all 4 outer tests pass (including existing model and fused_adam tests)