ZeRO 1/2: wait on all IPG-bucket producer streams in average_tensor (#8061)#8080
Open
arunshar wants to merge 1 commit into
Open
ZeRO 1/2: wait on all IPG-bucket producer streams in average_tensor (#8061)#8080arunshar wants to merge 1 commit into
arunshar wants to merge 1 commit into
Conversation
…eepspeedai#8061) With overlap_comm, the per-parameter gradient copies into the contiguous IPG bucket can be issued on multiple streams (e.g. under torch.compile, gradient hooks run on different autograd streams). average_tensor waited the reduction stream on only the current stream before reducing the bucket, so the reduction could read the bucket before another producer finished, corrupting gradients (NaN loss). Track the set of producer streams per IPG bucket and wait on all of them. The single-stream path is unchanged (the set is just {current_stream}), so there is no behavior change when overlap_comm copies stay on one stream. Adds CPU unit tests in tests/unit/v1/zero/test_overlap_comm_record_stream.py for the producer-stream wait, the empty fallback to current_stream, and the IPGBucket.copy_streams reset. Fixes deepspeedai#8061. Signed-off-by: Arun Sharma <sharm485@umn.edu>
a2d31ca to
35b4262
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Fixes #8061. In ZeRO stage 1/2 with
overlap_comm,average_tensorwaits the reduction stream on only the current stream before reducing the contiguous IPG gradient bucket:But the per-parameter gradient copies that fill the bucket (
reduce_independent_p_g_buckets_and_remove_grads, thenew_grad_tensor.copy_(...)intobucket.buffer[bucket.index]) can be issued on multiple streams. That is exactly the scenario #8061 reports undertorch.compile, where gradient hooks run on different autograd streams and several device streams write slices into the same IPG buffer. Waiting on only the current stream lets the all-reduce read the bucket before the other producers finish, corrupting gradients (NaN loss from step 1).Change
Implements the fix direction proposed in #8061 (record the streams used for the IPG bucket copies, then make the reduction stream wait on all of them):
IPGBucketgains acopy_streamsset, cleared inclear().bucket.buffer[bucket.index], recordcurrent_stream()(overlap path only).average_tensorwaits the reduction stream on every recorded producer stream, falling back tocurrent_stream()when the set is empty (e.g. the extra-large-param path that reduces without a bucket copy).The single-stream case is unchanged: when all copies are on one stream,
copy_streams == {current_stream}, so the wait is identical to before, and there is no behavior change for the common path.Tests / validation
tests/unit/v1/zero/test_overlap_comm_record_stream.py(fakes + monkeypatch, no GPU): the reduction stream waits on all producer streams, the empty-set fallback tocurrent_stream, andIPGBucket.copy_streamsreset. The two pre-existing tests in that file still pass (5 passed).pre-commit run --filesis green on both changed files (yapf, flake8, check-torchdist, check-license, codespell).contiguous_gradients, overlap_comm on vs off, eager andtorch.compile): before and after this change, suspect == baseline on every repeat with byte-identical final-param hashes, and the baseline-vs-baseline determinism gate passes. So the fix does not change results when grad-bucket copies stay on a single stream.Honest scope on reproduction
I could not deterministically trigger the multi-stream NaN on the available hardware (A40, small MLP): neither the
torch.compileA/B nor a synthetic two-stream microbenchmark surfaced it (PyTorch's caching allocator inserts implicit cross-stream syncs that mask the race in a microbenchmark, and a plaintorch.compile(model)on this model kept the grad-bucket copies on one stream). This PR is therefore offered as the minimal correct synchronization for the clearly-missing producer-stream wait identified inaverage_tensor, validated for no-regression and with unit coverage; a reviewer with the originaltorch.compilemulti-stream repro from #8061 can confirm the NaN is resolved.Opened as a draft.