Skip to content

Fix RL LR-schedule regression: default schedule length to max_train_steps#4225

Open
py4 wants to merge 1 commit into
AI-Hypercomputer:mainfrom
py4:fix/rl-lr-schedule-steps
Open

Fix RL LR-schedule regression: default schedule length to max_train_steps#4225
py4 wants to merge 1 commit into
AI-Hypercomputer:mainfrom
py4:fix/rl-lr-schedule-steps

Conversation

@py4

@py4 py4 commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

What

PR #4029 made get_optimizer (post_train/rl) size the LR warmup/decay schedule from learning_rate_schedule_steps, with a <= 0 fallback to max_train_steps for the default (-1) case. That fallback is dead code: MaxTextConfig.set_derived_and_validate_values rewrites learning_rate_schedule_steps == -1 to steps (base.yml default 150_001) before get_optimizer runs. So a default RL run sizes warmup to 0.1 * 150_001 = 15_000 steps. On a 500-step run the LR never finishes warming up and is ~300x too low at the same step. rl.yml and rl_mt_jt.yml inherit these defaults and are affected.

Fix

RL's run length is max_train_steps (num_batches * num_iterations * train_fraction * num_epoch), not steps (a pretraining concept). Default the schedule to max_train_steps, and honor learning_rate_schedule_steps only when it diverges from steps (the validator makes them equal exactly when the user left it unset). This restores correct default behavior while preserving the deliberate-override capability #4029 added. The change is RL-local; pretrain/SFT/DPO use create_learning_rate_schedule, where steps is the real run length, and are unaffected.

Tests

Adds tests/post_training/unit/rl_lr_schedule_test.py, which builds the config through the real pyconfig.initialize_pydantic path so the validator runs. The existing TestGetOptimizer uses a SimpleNamespace, which bypasses the validator and is why the regression shipped.

Verified on CPU against post-#4029 code: the regression test fails before the fix (LR reached only 1.08e-08 by step 55 ... learning_rate_schedule_steps in effect=150001) and passes after; the full rl_utils_test.py (28 tests) stays green.

Checklist

  • Restores pre-get_optimizer: respect learning_rate_schedule_steps config knob #4029 default behavior (schedule tracks the RL run length)
  • Preserves explicit learning_rate_schedule_steps override
  • New CPU-only unit test added (tests/post_training/unit/rl_lr_schedule_test.py)
  • No change to non-RL paths (pretrain/SFT/DPO unaffected)
  • pyink and pylint clean

@google-cla

google-cla Bot commented Jun 22, 2026

Copy link
Copy Markdown

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@codecov

codecov Bot commented Jun 22, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

…teps

PR AI-Hypercomputer#4029 made get_optimizer (post_train/rl) size the LR warmup/decay schedule
from learning_rate_schedule_steps, with a `<= 0` fallback to max_train_steps for
the default (-1) case. That fallback is dead code: MaxTextConfig's validator
(set_derived_and_validate_values) rewrites learning_rate_schedule_steps == -1 to
`steps` (base.yml default 150_001) before get_optimizer runs. So a default RL
run sizes warmup to 0.1 * 150_001 = 15_000 steps; on a 500-step run the LR never
finishes warming up and is ~300x too low at the same step. rl.yml and
rl_mt_jt.yml inherit these defaults and are affected.

RL's run length is max_train_steps (num_batches * num_iterations *
train_fraction * num_epoch), not `steps` (a pretraining concept). Default the
schedule to max_train_steps and honor learning_rate_schedule_steps only when it
diverges from `steps` (the validator makes them equal exactly when the user left
it unset), which preserves the deliberate-override capability AI-Hypercomputer#4029 added. The
change is RL-local; pretrain/SFT/DPO use create_learning_rate_schedule and are
unaffected.

Adds tests/post_training/unit/rl_lr_schedule_test.py, which builds the config
through the real pyconfig path so the validator runs. The existing
TestGetOptimizer uses a SimpleNamespace, which bypasses the validator and is why
the regression shipped.
@py4 py4 force-pushed the fix/rl-lr-schedule-steps branch from 6df89c2 to 016a47f Compare June 22, 2026 20:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant