Refactor megatron to mcore_bridge#134
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the Megatron-Core integration by offloading model configuration, creation, and weight loading logic to the mcore_bridge dependency, which allows for the removal of significant internal boilerplate code. However, the review highlights several critical issues introduced during the refactoring: the removal of the _BASE_LAYER_SUFFIXES constant and the self.hf_config attribute will lead to runtime errors since they are still referenced in the codebase. Furthermore, the send_weights method contains a NameError due to the use of an undefined args variable, which should be replaced with values from the strategy configuration.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request refactors the Megatron-Core integration by delegating model configuration and creation to a new MegatronStrategy class and the mcore_bridge library, while removing the internal args.py and the TorchSampler module. The review identifies a critical AttributeError in multi_lora.py where get_target_modules is called on the wrong class. Additionally, the review highlights that the refactored _add_base_layer_suffix is now too generic and may incorrectly modify non-LoRA layer names, and that MultiLoraMegatron initializes parameter configurations without an optimizer, which could prevent the setup of training-specific features like gradient reduction overlap.
… into feat/mbridge
… into feat/mbridge
… into feat/mbridge
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a GRPO training script for the OlympiadBench dataset, updates the lazy dataset implementation, and includes several improvements to the checkpointing and reward computation logic. My feedback highlights an unsafe mutation of _model_keys during iteration, a likely bug in the packing detection logic, inefficient instantiation of reward classes, and excessive logging that could impact performance.
… into feat/mbridge
PR type
PR information
Write the detail information belongs to this PR.
Experiment results
Paste your experiment result here(if needed).