fix(pt): base LambdaLR on configured start_lr by OutisLi · Pull Request #5434 · deepmodeling/deepmd-kit

OutisLi · 2026-05-07T03:50:56Z

Fix PyTorch LR scheduler construction to use the configured start_lr as LambdaLR's base LR.
Reset optimizer param group initial_lr after loading restart checkpoints so stale checkpoint values do not scale resumed training.
Avoid double-counting start_step by letting last_epoch handle scheduler resume position.

Summary by CodeRabbit

Refactor
- Improved internal learning-rate scheduler construction for more robust and maintainable training behavior.
Bug Fixes
- Added runtime validation to reject invalid starting learning-rate values, preventing training misconfiguration.
Tests
- Added a resume/restart learning-rate test and a unit test ensuring invalid start values are rejected; fixed a corrupted test.

coderabbitai · 2026-05-07T03:52:37Z

📝 Walkthrough

Walkthrough

Adds start_lr positivity/finiteness validation and reorders BaseLR initialization, extracts Trainer's LR-scheduler creation into _create_lr_scheduler (sets per-group initial_lr and returns a LambdaLR), updates Trainer to use it, and adds/repairs tests for restart and invalid start_lr values.

Changes

Learning rate: validation, factory, and tests

Layer / File(s)	Summary
BaseLR validation & init `deepmd/dpmodel/utils/learning_rate.py`	`BaseLR.__init__` now checks `start_lr` is positive and finite, derives `stop_lr`, computes `warmup_steps`/`warmup_start_lr`, validates step ranges, and stores core schedule fields.
WSD constructor `deepmd/dpmodel/utils/learning_rate.py`	`LearningRateWSD.__init__` no longer rechecks `start_lr`; it relies on BaseLR's validation while keeping WSD-specific validations.
Scheduler Factory Method `deepmd/pt/train/training.py`	New `Trainer._create_lr_scheduler` sets `initial_lr` on each optimizer param group from `lr_schedule.start_lr` and returns `LambdaLR` using `lr_schedule.value(step) / base_lr` with `last_epoch` derived from `start_step`.
Scheduler Integration `deepmd/pt/train/training.py`	Trainer replaces inline `LambdaLR` construction with a call to `_create_lr_scheduler(optimizer, lr_schedule, start_step)`.
Integration Test: Restart `source/tests/pt/test_training.py`	Adds `TestLearningRateRestart` which runs short training, edits saved checkpoint `initial_lr`, resumes, and asserts scheduler `get_last_lr()` matches `lr_schedule.value(...)` at restart and after extended steps.
Unit Tests: validation & fixes `source/tests/universal/dpmodel/utils/test_learning_rate.py`	Fixes corrupted test lines and adds `test_rejects_nonpositive_or_nonfinite_start_lr` asserting `LearningRateExp` raises `ValueError` mentioning `\"start_lr\"` for invalid values.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

deepmodeling/deepmd-kit#5142: Changes touching BaseLR.init and Trainer scheduler plumbing overlap with that PR.

Suggested reviewers

njzjz

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 44.44% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title accurately summarizes the main change: refactoring the LambdaLR scheduler to use the configured start_lr as its base, which is the core fix across all modified files.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

This PR adjusts PyTorch learning-rate scheduler initialization during (re)start and finetune so that LambdaLR scaling is anchored to the configured start_lr, and checkpoint-resume behavior does not inadvertently rescale training due to stale initial_lr values or double-counted start_step.

Changes:

Replace inline LambdaLR construction with a helper that bases LambdaLR on lr_schedule.start_lr.
Reset optimizer param-group initial_lr after loading optimizer state, preventing stale checkpoint values from influencing resumed LR scaling.
Use last_epoch=start_step-1 and remove extra + start_step in the lambda to avoid double-counting resume position.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

codecov · 2026-05-07T04:45:34Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.47%. Comparing base (0a481de) to head (aad461e).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #5434   +/-   ##
=======================================
  Coverage   82.47%   82.47%           
=======================================
  Files         825      825           
  Lines       87721    87727    +6     
  Branches     4206     4206           
=======================================
+ Hits        72344    72355   +11     
+ Misses      14093    14089    -4     
+ Partials     1284     1283    -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

wanghan-iapcm

No regression test. better to provide a test that:

Constructs a Trainer, runs N steps, saves checkpoint.
Resumes from that checkpoint, runs M more steps.
Asserts scheduler.get_last_lr()[0] at step N+k matches lr_schedule.value(N+k) to high tolerance.

for more information, see https://pre-commit.ci

coderabbitai

🧹 Nitpick comments (1)

source/tests/pt/test_training.py (1)
774-834: ⚡ Quick win

Add a 60s timeout guard for this training test.

This test exercises real train/restart paths and should be bounded to avoid CI hangs.
⏱️ Proposed patch
 import numpy as np
 import torch
+import pytest
@@
-class TestLearningRateRestart(unittest.TestCase):
+@pytest.mark.timeout(60)
+class TestLearningRateRestart(unittest.TestCase):
As per coding guidelines **/tests/**/*training*.py: Set training test timeouts to 60 seconds maximum for validation purposes, as real training takes hours or days.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@source/tests/pt/test_training.py` around lines 774 - 834, Add a 60s timeout
to the test method so the real training/restart path cannot hang CI: import
pytest at top of the file and decorate
TestLearningRateRestart.test_restart_scheduler_matches_lr_schedule with
`@pytest.mark.timeout`(60) (ensuring pytest is available in the test run). This
keeps the test implementation (calls to get_trainer, trainer.run,
restart_trainer.run, etc.) unchanged but bounds its total execution to 60
seconds.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@source/tests/pt/test_training.py`:
- Around line 774-834: Add a 60s timeout to the test method so the real
training/restart path cannot hang CI: import pytest at top of the file and
decorate TestLearningRateRestart.test_restart_scheduler_matches_lr_schedule with
`@pytest.mark.timeout`(60) (ensuring pytest is available in the test run). This
keeps the test implementation (calls to get_trainer, trainer.run,
restart_trainer.run, etc.) unchanged but bounds its total execution to 60
seconds.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 8bbf5b25-33cb-4d55-8308-f0d8c6a1bb8c

📥 Commits

Reviewing files that changed from the base of the PR and between 66b088c and aad461e.

📒 Files selected for processing (3)

deepmd/dpmodel/utils/learning_rate.py
source/tests/pt/test_training.py
source/tests/universal/dpmodel/utils/test_learning_rate.py

iProzd

The double-offset bug was introduced in #5154 , which added last_epoch=start_step-1 but kept +start_step in the lambda.

So UT for restart training was actually missing in #5154. Old version uses last_epoch=-1 and +start_step in the lambda, which was correct.

njzjz-bot

The scheduler resume logic now looks consistent to me:

LambdaLR is anchored on lr_schedule.start_lr, so resumed training no longer inherits stale checkpoint initial_lr scaling.
The lambda receives the absolute scheduler step via last_epoch=start_step-1, avoiding the previous double start_step offset.
The new restart test covers both stale checkpoint initial_lr and post-resume LR alignment.

I also checked git diff --check and a lightweight isolated validation of the new invalid-start_lr path. I could not run the full project pytest locally because this checkout's uv dependency resolution currently hits a CPU/GPU PyTorch group conflict, but CI is green.

— OpenClaw 2026.4.22 (model: gpt-5.5)

fix(pt): base LambdaLR on configured start_lr

66b088c

Copilot AI review requested due to automatic review settings May 7, 2026 03:50

github-actions Bot added the Python label May 7, 2026

dosubot Bot added the bug label May 7, 2026

OutisLi requested review from iProzd, njzjz and wanghan-iapcm May 7, 2026 03:53

Copilot started reviewing on behalf of OutisLi May 7, 2026 03:55 View session

Copilot AI reviewed May 7, 2026

View reviewed changes

Comment thread deepmd/pt/train/training.py

wanghan-iapcm requested changes May 7, 2026

View reviewed changes

OutisLi and others added 2 commits May 7, 2026 19:44

fixup

b997bae

[pre-commit.ci] auto fixes from pre-commit.com hooks

aad461e

for more information, see https://pre-commit.ci

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

OutisLi requested a review from wanghan-iapcm May 7, 2026 15:02

wanghan-iapcm approved these changes May 8, 2026

View reviewed changes

iProzd approved these changes May 8, 2026

View reviewed changes

njzjz-bot approved these changes May 8, 2026

View reviewed changes

njzjz approved these changes May 8, 2026

View reviewed changes

njzjz added this pull request to the merge queue May 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(pt): base LambdaLR on configured start_lr#5434

fix(pt): base LambdaLR on configured start_lr#5434
OutisLi wants to merge 3 commits intodeepmodeling:masterfrom
OutisLi:pr/lrfix

OutisLi commented May 7, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 7, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

codecov Bot commented May 7, 2026 •

edited

Loading

Uh oh!

wanghan-iapcm left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

iProzd left a comment

Uh oh!

njzjz-bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

OutisLi commented May 7, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

codecov Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

wanghan-iapcm left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

iProzd left a comment

Choose a reason for hiding this comment

Uh oh!

njzjz-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

OutisLi commented May 7, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 7, 2026 •

edited

Loading

codecov Bot commented May 7, 2026 •

edited

Loading