[bugfix]Fix Megatron model device placement when use_cpu_initialization is enabled#9446
[bugfix]Fix Megatron model device placement when use_cpu_initialization is enabled#9446ShiroNyaa wants to merge 2 commits into
use_cpu_initialization is enabled#9446Conversation
There was a problem hiding this comment.
Code Review
This pull request modifies the condition under which the model is moved to the CUDA device, changing it from when CPU initialization is disabled to when it is enabled. The reviewer suggests moving the model to the GPU unconditionally instead, which would align with Megatron-LM's standard behavior and ensure that all CPU-initialized parameters are correctly transferred to the GPU.
| if args.use_cpu_initialization: | ||
| m.cuda(torch.cuda.current_device()) |
There was a problem hiding this comment.
Instead of conditionally moving the model to the CUDA device only when use_cpu_initialization is enabled, it is safer and more robust to move the model to the current CUDA device unconditionally. This aligns with Megatron-LM's standard behavior and ensures that any parameters initialized on the CPU (such as newly added adapter weights or embeddings) are correctly transferred to the GPU regardless of the initialization setting.
m.cuda(torch.cuda.current_device())|
Let's remove this. |
PR type
PR information
Fix Megatron model device placement when
use_cpu_initializationis enabled.Previously,
wrap_modelonly moved model modules to CUDA whenargs.use_cpu_initializationwas disabled. With CPU initialization enabled, some modules could remain on CPU before fp16/DDP wrapping, which may cause device mismatch errors during training, especially in LoRA tuning where frozen base modules such as embeddings are still used in forward.This change moves the model to the current CUDA device when
args.use_cpu_initializationis enabled, avoiding CPU/GPU tensor mismatch during Megatron training.Experiment results
Before this change, LoRA training with
use_cpu_initialization=Truecould fail with errors like: