Fix ZeRO-3: Use per-param dtype for output buffers in _allgather_params_coalesced#8073
Open
albertvillanova wants to merge 1 commit into
Open
Fix ZeRO-3: Use per-param dtype for output buffers in _allgather_params_coalesced#8073albertvillanova wants to merge 1 commit into
albertvillanova wants to merge 1 commit into
Conversation
Signed-off-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
b6007b5 to
c43a5a5
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR fixes the
_allgather_params_coalescedmethod inpartition_parameters.py. The change ensures that eachflat_tensoris created with the correct data type by referencing the corresponding parameter inparam_list, rather than always using the first parameter's data type.Fix #8072.
Problem
_allgather_params_coalescedallocates all output buffers using the dtype of the first parameter in param_list:This assumed every persistent parameter shares the same dtype. The assumption was incidentally maintained before 0.19.2 because
_configure_distributed_modelcalledmodule.bfloat16()unconditionally, normalising all persistent parameters (including PEFT LoRA adapters) to a uniform dtype.PR #8066 "Mixed-precision: per-policy param/buffer dtype cast (preserve fp32 buffers)" (commit b919284) correctly stopped casting ZeRO-Init model params, but exposed the latent bug: PEFT's default
autocast_adapter_dtype=Truekeeps LoRA adapters in fp32 even when the base model is bf16.persistent_parameterstherefore ends up with mixed dtypes (bf16 base-model params + fp32 LoRA params), and the mismatch between a bf16 output buffer and a fp32 input tensor raises:Solution
Allocate each output buffer with the dtype of its own parameter:
This removes the shared-dtype assumption at the source rather than relying on upstream callers to normalise dtypes before calling
_allgather_params_coalesced.Changes
_allgather_params_coalescedto use the data type of each parameter inparam_list, ensuring proper handling of mixed data types.