Compare microbatch forward outputs and gradients #246

xmfan · 2025-11-11T22:51:09Z

Stacked PRs:

Currently the forward matches per microbatch (no batch invariance)

Intended usage:

> torchrun --standalone --nproc-per-node 4 examples/example_ds3_local_map.py --rng-seed 42; torchrun --standalone --nproc-per-node 8 examples/example_ds3_pp.py --rng-seed 42

(a) [14:59:59] ~/core/a/autoparallel (mybranch) > diff out/0/diff.log out/1/diff.log 
(a) [15:00:07] ~/core/a/autoparallel (mybranch) > diff out/0/weights.log out/1/pp_weights.log 
--- out/0/weights.log   2025-11-19 14:23:31.313739075 -0800
+++ out/1/pp_weights.log        2025-11-19 14:24:33.369228991 -0800
@@ -60,9 +60,12 @@
 name='freqs_cis' hash=DTensor(real=54976837666734080, imag=9351734845035773952))
 name='layers.0.moe.expert_bias' hash=DTensor(0)
 name='layers.0.moe.tokens_per_expert' hash=DTensor(0)
+name='freqs_cis' hash=DTensor(real=54976837666734080, imag=9351734845035773952))
 name='layers.1.moe.expert_bias' hash=DTensor(0)
 name='layers.1.moe.tokens_per_expert' hash=DTensor(0)
+name='freqs_cis' hash=DTensor(real=54976837666734080, imag=9351734845035773952))
 name='layers.2.moe.expert_bias' hash=DTensor(0)
 name='layers.2.moe.tokens_per_expert' hash=DTensor(0)
+name='freqs_cis' hash=DTensor(real=54976837666734080, imag=9351734845035773952))
 name='layers.3.moe.expert_bias' hash=DTensor(0)
 name='layers.3.moe.tokens_per_expert' hash=DTensor(0)

Currently, fw ins are the same, but the forward is being ran with different rng state between the two setups so there's some numerical differences

stack-info: PR: #246, branch: xmfan/stack/20

wconstab · 2025-11-14T18:33:44Z

granted the rng affects the grads, why does the diff show 'none' rather than a different hash?

wconstab · 2025-11-14T18:36:55Z

examples/example_ds3_local_map.py

+    if rng_seed is not None:
+        numerics_logger = NumericsLogger(logs_dir)
+    with AutoParallel(
+        model, input_fn, mesh, dynamic=True, numerics_logger=None


should this be numerics_logger = numerics_logger?

it's too noisy, it logs the intermediates for each op in the graph. i haven't thought of how to address it yet

wconstab · 2025-11-14T18:41:16Z

examples/example_ds3_pp.py

+            return
+
+        rank = torch.distributed.get_rank()
+        if rank == 4:


can you somehow not hardcode this

take a look at the new logic

sanketpurandare · 2025-11-14T19:25:18Z

autoparallel/graph_pp_runner.py

    action: _Action,
    ctx: _PipelineContext,
    numerics_logs: Optional[list[str]] = None,
+    forward_hook: Callable | None = None,


nit: Optional[Callable]

sanketpurandare · 2025-11-14T19:27:55Z

autoparallel/utils.py

        if self.rank == 0:
            print(f"Weight hashes written to {path}")
+
+    def log_pp_grads(self, orig_mod, stage_mods, num_world_stages, ranks):


What is num_world_stages?

number of stages

sanketpurandare · 2025-11-14T19:32:53Z

examples/example_ds3_pp.py

+        rank = torch.distributed.get_rank()
+        if rank == 4:
+            numerics_logger.log_diff(
+                output, rank=4, prefix=f"mb{action.microbatch_index} fwd out"


Yeah, very confusing. Also do we care about pp_rank or global rank? Finally v style schedules will have last stage on rank 0?

i just want to log from the last pp stage, and want to log it once only

take a look at the new logic

sanketpurandare · 2025-11-14T19:35:03Z

But for the backward, all grads are None
Currently, fw ins are the same, but the forward is being ran with different rng state between the two setups so there's some numerical differences

If we land #250 first it fixes the grad issue.

sanketpurandare · 2025-11-14T19:36:03Z

granted the rng affects the grads, why does the diff show 'none' rather than a different hash?

There was a bug in gradient accumulation that is fixed by #250

stack-info: PR: #246, branch: xmfan/stack/20

sanketpurandare · 2025-11-19T23:35:40Z

@xmfan Would it be possible to add numerics logging logic to the GraphPipelineStage class. In this way when we create the stage we can pass in the numerics logging args to the stage itself. Then from the stage object you can grab the variables or callables you want and create log for any of the action methods. This way you don't need to change the signature of stage_forward etc.

stack-info: PR: #246, branch: xmfan/stack/20

xmfan · 2025-11-20T20:53:01Z

verified failures are due to recent nightlies

xmfan added a commit that referenced this pull request Nov 11, 2025

Log forward intermediates hashes w/pp vs w/o pp

79bf049

stack-info: PR: #246, branch: xmfan/stack/20

xmfan force-pushed the xmfan/stack/19 branch from 0813cd5 to 580144b Compare November 11, 2025 22:51

xmfan force-pushed the xmfan/stack/20 branch from 72c4ffc to 79bf049 Compare November 11, 2025 22:51

This was referenced Nov 11, 2025

Log weight hashes for DSv3 w/ pp vs w/o pp #240

Merged

Custom opify triton kernel until local_map functionalization is fixed #245

Merged

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 11, 2025

xmfan changed the title ~~Log forward intermediates hashes w/pp vs w/o pp~~ Log forward intermediates/output hashes w/o pp Nov 11, 2025

xmfan changed the base branch from xmfan/stack/19 to main November 12, 2025 00:04

xmfan added a commit that referenced this pull request Nov 12, 2025

Log forward intermediates hashes w/pp vs w/o pp

4b0b462

stack-info: PR: #246, branch: xmfan/stack/20

xmfan force-pushed the xmfan/stack/20 branch from 79bf049 to 4b0b462 Compare November 12, 2025 00:04

xmfan changed the title ~~Log forward intermediates/output hashes w/o pp~~ Log forward intermediates hashes w/pp vs w/o pp Nov 12, 2025

xmfan changed the base branch from main to xmfan/stack/19 November 12, 2025 00:05

xmfan changed the base branch from xmfan/stack/19 to main November 12, 2025 05:02

xmfan added a commit that referenced this pull request Nov 12, 2025

Log forward intermediates hashes w/pp vs w/o pp

b9d82ef

stack-info: PR: #246, branch: xmfan/stack/20

xmfan force-pushed the xmfan/stack/20 branch from 4b0b462 to b9d82ef Compare November 12, 2025 05:02

xmfan changed the base branch from main to xmfan/stack/19 November 12, 2025 05:02

xmfan changed the base branch from xmfan/stack/19 to main November 12, 2025 05:09

xmfan added a commit that referenced this pull request Nov 12, 2025

Log forward intermediates hashes w/pp vs w/o pp

adbd32c

stack-info: PR: #246, branch: xmfan/stack/20

xmfan force-pushed the xmfan/stack/20 branch from b9d82ef to adbd32c Compare November 12, 2025 05:09

xmfan changed the base branch from main to xmfan/stack/19 November 12, 2025 05:09

xmfan changed the base branch from xmfan/stack/19 to main November 12, 2025 06:50

xmfan added a commit that referenced this pull request Nov 12, 2025

Log forward intermediates hashes w/pp vs w/o pp

f984301

stack-info: PR: #246, branch: xmfan/stack/20

xmfan force-pushed the xmfan/stack/20 branch from adbd32c to f984301 Compare November 12, 2025 06:50

xmfan changed the base branch from main to xmfan/stack/19 November 12, 2025 06:50

xmfan marked this pull request as ready for review November 12, 2025 07:18

xmfan requested review from bdhirsh, sanketpurandare and wconstab November 12, 2025 07:19

xmfan force-pushed the xmfan/stack/19 branch from 6e8451c to 59670d0 Compare November 13, 2025 20:08

xmfan marked this pull request as draft November 13, 2025 20:09

xmfan changed the base branch from xmfan/stack/19 to main November 13, 2025 22:55

xmfan added a commit that referenced this pull request Nov 13, 2025

Compare microbatch forward outputs and gradients

e5c0227

stack-info: PR: #246, branch: xmfan/stack/20

xmfan force-pushed the xmfan/stack/20 branch from f984301 to e5c0227 Compare November 13, 2025 22:55

xmfan changed the title ~~Log forward intermediates hashes w/pp vs w/o pp~~ Compare microbatch forward outputs and gradients Nov 13, 2025

xmfan marked this pull request as ready for review November 13, 2025 22:57

xmfan added a commit that referenced this pull request Nov 14, 2025

Compare microbatch forward outputs and gradients

7c45448

stack-info: PR: #246, branch: xmfan/stack/20

xmfan force-pushed the xmfan/stack/20 branch from e5c0227 to 7c45448 Compare November 14, 2025 02:28

xmfan mentioned this pull request Nov 14, 2025

Grad scaling parity between both pp and non-pp #251

Closed

wconstab reviewed Nov 14, 2025

View reviewed changes

sanketpurandare reviewed Nov 14, 2025

View reviewed changes

xmfan added a commit that referenced this pull request Nov 19, 2025

Compare microbatch forward outputs and gradients

6e72707

stack-info: PR: #246, branch: xmfan/stack/20

xmfan force-pushed the xmfan/stack/20 branch from 7c45448 to 6e72707 Compare November 19, 2025 23:00

xmfan force-pushed the xmfan/stack/20 branch from 6e72707 to b8546e1 Compare November 20, 2025 00:43

xmfan mentioned this pull request Nov 20, 2025

Add numerics comparison script #259

Merged

xmfan force-pushed the xmfan/stack/20 branch from b8546e1 to 2ef8efe Compare November 20, 2025 06:03

sanketpurandare approved these changes Nov 20, 2025

View reviewed changes

Compare microbatch forward outputs and gradients

fc61cd1

stack-info: PR: #246, branch: xmfan/stack/20

xmfan force-pushed the xmfan/stack/20 branch from 2ef8efe to fc61cd1 Compare November 20, 2025 20:05

xmfan merged commit 10d8208 into main Nov 20, 2025
4 of 6 checks passed

yushangdi mentioned this pull request Nov 27, 2025

[debug mode] support pipeline parallelism pytorch/pytorch#169152

Open

Compare microbatch forward outputs and gradients #246

Compare microbatch forward outputs and gradients #246

Uh oh!

Conversation

xmfan commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wconstab commented Nov 14, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanketpurandare commented Nov 14, 2025

Uh oh!

sanketpurandare commented Nov 14, 2025

Uh oh!

sanketpurandare commented Nov 19, 2025

Uh oh!

xmfan commented Nov 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xmfan commented Nov 11, 2025 •

edited

Loading