fix(pt): base LambdaLR on configured start_lr#5434
fix(pt): base LambdaLR on configured start_lr#5434OutisLi wants to merge 3 commits intodeepmodeling:masterfrom
Conversation
📝 WalkthroughWalkthroughAdds start_lr positivity/finiteness validation and reorders BaseLR initialization, extracts Trainer's LR-scheduler creation into _create_lr_scheduler (sets per-group initial_lr and returns a LambdaLR), updates Trainer to use it, and adds/repairs tests for restart and invalid start_lr values. ChangesLearning rate: validation, factory, and tests
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
This PR adjusts PyTorch learning-rate scheduler initialization during (re)start and finetune so that LambdaLR scaling is anchored to the configured start_lr, and checkpoint-resume behavior does not inadvertently rescale training due to stale initial_lr values or double-counted start_step.
Changes:
- Replace inline
LambdaLRconstruction with a helper that basesLambdaLRonlr_schedule.start_lr. - Reset optimizer param-group
initial_lrafter loading optimizer state, preventing stale checkpoint values from influencing resumed LR scaling. - Use
last_epoch=start_step-1and remove extra+ start_stepin the lambda to avoid double-counting resume position.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #5434 +/- ##
=======================================
Coverage 82.47% 82.47%
=======================================
Files 825 825
Lines 87721 87727 +6
Branches 4206 4206
=======================================
+ Hits 72344 72355 +11
+ Misses 14093 14089 -4
+ Partials 1284 1283 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
wanghan-iapcm
left a comment
There was a problem hiding this comment.
No regression test. better to provide a test that:
- Constructs a Trainer, runs N steps, saves checkpoint.
- Resumes from that checkpoint, runs M more steps.
- Asserts
scheduler.get_last_lr()[0]at step N+k matcheslr_schedule.value(N+k)to high tolerance.
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
🧹 Nitpick comments (1)
source/tests/pt/test_training.py (1)
774-834: ⚡ Quick winAdd a 60s timeout guard for this training test.
This test exercises real train/restart paths and should be bounded to avoid CI hangs.
⏱️ Proposed patch
import numpy as np import torch +import pytest @@ -class TestLearningRateRestart(unittest.TestCase): +@pytest.mark.timeout(60) +class TestLearningRateRestart(unittest.TestCase):As per coding guidelines
**/tests/**/*training*.py: Set training test timeouts to 60 seconds maximum for validation purposes, as real training takes hours or days.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@source/tests/pt/test_training.py` around lines 774 - 834, Add a 60s timeout to the test method so the real training/restart path cannot hang CI: import pytest at top of the file and decorate TestLearningRateRestart.test_restart_scheduler_matches_lr_schedule with `@pytest.mark.timeout`(60) (ensuring pytest is available in the test run). This keeps the test implementation (calls to get_trainer, trainer.run, restart_trainer.run, etc.) unchanged but bounds its total execution to 60 seconds.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@source/tests/pt/test_training.py`:
- Around line 774-834: Add a 60s timeout to the test method so the real
training/restart path cannot hang CI: import pytest at top of the file and
decorate TestLearningRateRestart.test_restart_scheduler_matches_lr_schedule with
`@pytest.mark.timeout`(60) (ensuring pytest is available in the test run). This
keeps the test implementation (calls to get_trainer, trainer.run,
restart_trainer.run, etc.) unchanged but bounds its total execution to 60
seconds.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 8bbf5b25-33cb-4d55-8308-f0d8c6a1bb8c
📒 Files selected for processing (3)
deepmd/dpmodel/utils/learning_rate.pysource/tests/pt/test_training.pysource/tests/universal/dpmodel/utils/test_learning_rate.py
njzjz-bot
left a comment
There was a problem hiding this comment.
The scheduler resume logic now looks consistent to me:
LambdaLRis anchored onlr_schedule.start_lr, so resumed training no longer inherits stale checkpointinitial_lrscaling.- The lambda receives the absolute scheduler step via
last_epoch=start_step-1, avoiding the previous doublestart_stepoffset. - The new restart test covers both stale checkpoint
initial_lrand post-resume LR alignment.
I also checked git diff --check and a lightweight isolated validation of the new invalid-start_lr path. I could not run the full project pytest locally because this checkout's uv dependency resolution currently hits a CPU/GPU PyTorch group conflict, but CI is green.
— OpenClaw 2026.4.22 (model: gpt-5.5)
start_lrasLambdaLR's base LR.initial_lrafter loading restart checkpoints so stale checkpoint values do not scale resumed training.start_stepby lettinglast_epochhandle scheduler resume position.Summary by CodeRabbit