Fix RL LR-schedule regression: default schedule length to max_train_steps#4225
Open
py4 wants to merge 1 commit into
Open
Fix RL LR-schedule regression: default schedule length to max_train_steps#4225py4 wants to merge 1 commit into
py4 wants to merge 1 commit into
Conversation
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
…teps PR AI-Hypercomputer#4029 made get_optimizer (post_train/rl) size the LR warmup/decay schedule from learning_rate_schedule_steps, with a `<= 0` fallback to max_train_steps for the default (-1) case. That fallback is dead code: MaxTextConfig's validator (set_derived_and_validate_values) rewrites learning_rate_schedule_steps == -1 to `steps` (base.yml default 150_001) before get_optimizer runs. So a default RL run sizes warmup to 0.1 * 150_001 = 15_000 steps; on a 500-step run the LR never finishes warming up and is ~300x too low at the same step. rl.yml and rl_mt_jt.yml inherit these defaults and are affected. RL's run length is max_train_steps (num_batches * num_iterations * train_fraction * num_epoch), not `steps` (a pretraining concept). Default the schedule to max_train_steps and honor learning_rate_schedule_steps only when it diverges from `steps` (the validator makes them equal exactly when the user left it unset), which preserves the deliberate-override capability AI-Hypercomputer#4029 added. The change is RL-local; pretrain/SFT/DPO use create_learning_rate_schedule and are unaffected. Adds tests/post_training/unit/rl_lr_schedule_test.py, which builds the config through the real pyconfig path so the validator runs. The existing TestGetOptimizer uses a SimpleNamespace, which bypasses the validator and is why the regression shipped.
6df89c2 to
016a47f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
PR #4029 made
get_optimizer(post_train/rl) size the LR warmup/decay schedule fromlearning_rate_schedule_steps, with a<= 0fallback tomax_train_stepsfor the default (-1) case. That fallback is dead code:MaxTextConfig.set_derived_and_validate_valuesrewriteslearning_rate_schedule_steps == -1tosteps(base.yml default150_001) beforeget_optimizerruns. So a default RL run sizes warmup to0.1 * 150_001 = 15_000steps. On a 500-step run the LR never finishes warming up and is ~300x too low at the same step.rl.ymlandrl_mt_jt.ymlinherit these defaults and are affected.Fix
RL's run length is
max_train_steps(num_batches * num_iterations * train_fraction * num_epoch), notsteps(a pretraining concept). Default the schedule tomax_train_steps, and honorlearning_rate_schedule_stepsonly when it diverges fromsteps(the validator makes them equal exactly when the user left it unset). This restores correct default behavior while preserving the deliberate-override capability #4029 added. The change is RL-local; pretrain/SFT/DPO usecreate_learning_rate_schedule, wherestepsis the real run length, and are unaffected.Tests
Adds
tests/post_training/unit/rl_lr_schedule_test.py, which builds the config through the realpyconfig.initialize_pydanticpath so the validator runs. The existingTestGetOptimizeruses aSimpleNamespace, which bypasses the validator and is why the regression shipped.Verified on CPU against post-#4029 code: the regression test fails before the fix (
LR reached only 1.08e-08 by step 55 ... learning_rate_schedule_steps in effect=150001) and passes after; the fullrl_utils_test.py(28 tests) stays green.Checklist
learning_rate_schedule_stepsoverridetests/post_training/unit/rl_lr_schedule_test.py)