Skip to content

fix(npu): synchronize stream after backward for dp replicate training#60

Open
danieldale2026 wants to merge 1 commit into
OpenMOSS:feat/npufrom
danieldale2026:fix/npu-sync-after-backward-dp-replicate
Open

fix(npu): synchronize stream after backward for dp replicate training#60
danieldale2026 wants to merge 1 commit into
OpenMOSS:feat/npufrom
danieldale2026:fix/npu-sync-after-backward-dp-replicate

Conversation

@danieldale2026
Copy link
Copy Markdown

Summary

Fixes #59.

This PR fixes loss=nan in NPU training when dp_replicate_size > 1 by synchronizing the current NPU stream immediately after backward.

Details

  • Adds torch_npu import in mova/engine/trainer/accelerate/accelerate_trainer.py
  • Calls torch_npu.npu.current_stream().synchronize() right after self.accelerator.backward(loss)
  • Keeps the change scoped to NPU stream ordering after backward
  • Does not modify other training logic

Test

  • python3 -m compileall mova/engine/trainer/accelerate/accelerate_trainer.py
  • git diff --check

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant