Skip to content

fix(pt): base LambdaLR on configured start_lr#5434

Queued
OutisLi wants to merge 3 commits intodeepmodeling:masterfrom
OutisLi:pr/lrfix
Queued

fix(pt): base LambdaLR on configured start_lr#5434
OutisLi wants to merge 3 commits intodeepmodeling:masterfrom
OutisLi:pr/lrfix

Conversation

@OutisLi
Copy link
Copy Markdown
Collaborator

@OutisLi OutisLi commented May 7, 2026

  • Fix PyTorch LR scheduler construction to use the configured start_lr as LambdaLR's base LR.
  • Reset optimizer param group initial_lr after loading restart checkpoints so stale checkpoint values do not scale resumed training.
  • Avoid double-counting start_step by letting last_epoch handle scheduler resume position.

Summary by CodeRabbit

  • Refactor
    • Improved internal learning-rate scheduler construction for more robust and maintainable training behavior.
  • Bug Fixes
    • Added runtime validation to reject invalid starting learning-rate values, preventing training misconfiguration.
  • Tests
    • Added a resume/restart learning-rate test and a unit test ensuring invalid start values are rejected; fixed a corrupted test.

Copilot AI review requested due to automatic review settings May 7, 2026 03:50
@github-actions github-actions Bot added the Python label May 7, 2026
@dosubot dosubot Bot added the bug label May 7, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 7, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

Adds start_lr positivity/finiteness validation and reorders BaseLR initialization, extracts Trainer's LR-scheduler creation into _create_lr_scheduler (sets per-group initial_lr and returns a LambdaLR), updates Trainer to use it, and adds/repairs tests for restart and invalid start_lr values.

Changes

Learning rate: validation, factory, and tests

Layer / File(s) Summary
BaseLR validation & init
deepmd/dpmodel/utils/learning_rate.py
BaseLR.__init__ now checks start_lr is positive and finite, derives stop_lr, computes warmup_steps/warmup_start_lr, validates step ranges, and stores core schedule fields.
WSD constructor
deepmd/dpmodel/utils/learning_rate.py
LearningRateWSD.__init__ no longer rechecks start_lr; it relies on BaseLR's validation while keeping WSD-specific validations.
Scheduler Factory Method
deepmd/pt/train/training.py
New Trainer._create_lr_scheduler sets initial_lr on each optimizer param group from lr_schedule.start_lr and returns LambdaLR using lr_schedule.value(step) / base_lr with last_epoch derived from start_step.
Scheduler Integration
deepmd/pt/train/training.py
Trainer replaces inline LambdaLR construction with a call to _create_lr_scheduler(optimizer, lr_schedule, start_step).
Integration Test: Restart
source/tests/pt/test_training.py
Adds TestLearningRateRestart which runs short training, edits saved checkpoint initial_lr, resumes, and asserts scheduler get_last_lr() matches lr_schedule.value(...) at restart and after extended steps.
Unit Tests: validation & fixes
source/tests/universal/dpmodel/utils/test_learning_rate.py
Fixes corrupted test lines and adds test_rejects_nonpositive_or_nonfinite_start_lr asserting LearningRateExp raises ValueError mentioning \"start_lr\" for invalid values.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • njzjz
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 44.44% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately summarizes the main change: refactoring the LambdaLR scheduler to use the configured start_lr as its base, which is the core fix across all modified files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts PyTorch learning-rate scheduler initialization during (re)start and finetune so that LambdaLR scaling is anchored to the configured start_lr, and checkpoint-resume behavior does not inadvertently rescale training due to stale initial_lr values or double-counted start_step.

Changes:

  • Replace inline LambdaLR construction with a helper that bases LambdaLR on lr_schedule.start_lr.
  • Reset optimizer param-group initial_lr after loading optimizer state, preventing stale checkpoint values from influencing resumed LR scaling.
  • Use last_epoch=start_step-1 and remove extra + start_step in the lambda to avoid double-counting resume position.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread deepmd/pt/train/training.py
@codecov
Copy link
Copy Markdown

codecov Bot commented May 7, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.47%. Comparing base (0a481de) to head (aad461e).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #5434   +/-   ##
=======================================
  Coverage   82.47%   82.47%           
=======================================
  Files         825      825           
  Lines       87721    87727    +6     
  Branches     4206     4206           
=======================================
+ Hits        72344    72355   +11     
+ Misses      14093    14089    -4     
+ Partials     1284     1283    -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Collaborator

@wanghan-iapcm wanghan-iapcm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No regression test. better to provide a test that:

  • Constructs a Trainer, runs N steps, saves checkpoint.
  • Resumes from that checkpoint, runs M more steps.
  • Asserts scheduler.get_last_lr()[0] at step N+k matches lr_schedule.value(N+k) to high tolerance.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
source/tests/pt/test_training.py (1)

774-834: ⚡ Quick win

Add a 60s timeout guard for this training test.

This test exercises real train/restart paths and should be bounded to avoid CI hangs.

⏱️ Proposed patch
 import numpy as np
 import torch
+import pytest
@@
-class TestLearningRateRestart(unittest.TestCase):
+@pytest.mark.timeout(60)
+class TestLearningRateRestart(unittest.TestCase):

As per coding guidelines **/tests/**/*training*.py: Set training test timeouts to 60 seconds maximum for validation purposes, as real training takes hours or days.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@source/tests/pt/test_training.py` around lines 774 - 834, Add a 60s timeout
to the test method so the real training/restart path cannot hang CI: import
pytest at top of the file and decorate
TestLearningRateRestart.test_restart_scheduler_matches_lr_schedule with
`@pytest.mark.timeout`(60) (ensuring pytest is available in the test run). This
keeps the test implementation (calls to get_trainer, trainer.run,
restart_trainer.run, etc.) unchanged but bounds its total execution to 60
seconds.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@source/tests/pt/test_training.py`:
- Around line 774-834: Add a 60s timeout to the test method so the real
training/restart path cannot hang CI: import pytest at top of the file and
decorate TestLearningRateRestart.test_restart_scheduler_matches_lr_schedule with
`@pytest.mark.timeout`(60) (ensuring pytest is available in the test run). This
keeps the test implementation (calls to get_trainer, trainer.run,
restart_trainer.run, etc.) unchanged but bounds its total execution to 60
seconds.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 8bbf5b25-33cb-4d55-8308-f0d8c6a1bb8c

📥 Commits

Reviewing files that changed from the base of the PR and between 66b088c and aad461e.

📒 Files selected for processing (3)
  • deepmd/dpmodel/utils/learning_rate.py
  • source/tests/pt/test_training.py
  • source/tests/universal/dpmodel/utils/test_learning_rate.py

@OutisLi OutisLi requested a review from wanghan-iapcm May 7, 2026 15:02
Copy link
Copy Markdown
Member

@iProzd iProzd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The double-offset bug was introduced in #5154 , which added last_epoch=start_step-1 but kept +start_step in the lambda.

So UT for restart training was actually missing in #5154. Old version uses last_epoch=-1 and +start_step in the lambda, which was correct.

Copy link
Copy Markdown
Contributor

@njzjz-bot njzjz-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scheduler resume logic now looks consistent to me:

  • LambdaLR is anchored on lr_schedule.start_lr, so resumed training no longer inherits stale checkpoint initial_lr scaling.
  • The lambda receives the absolute scheduler step via last_epoch=start_step-1, avoiding the previous double start_step offset.
  • The new restart test covers both stale checkpoint initial_lr and post-resume LR alignment.

I also checked git diff --check and a lightweight isolated validation of the new invalid-start_lr path. I could not run the full project pytest locally because this checkout's uv dependency resolution currently hits a CPU/GPU PyTorch group conflict, but CI is green.

— OpenClaw 2026.4.22 (model: gpt-5.5)

@njzjz njzjz added this pull request to the merge queue May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants