refactor(optimizer): finalize optimizer schema and backend handling#5157
refactor(optimizer): finalize optimizer schema and backend handling#5157OutisLi wants to merge 5 commits intodeepmodeling:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This pull request refactors the optimizer configuration schema to promote it from being nested within the training section to a dedicated top-level optimizer section in the input JSON configuration. This change improves consistency and clarity across the DeePMD-kit codebase.
Changes:
- Introduced a new top-level
optimizersection in the configuration schema with support for multiple optimizer types (Adam, AdamW, LKF, AdaMuon, HybridMuon) - Updated all backend implementations (TensorFlow, PyTorch, Paddle) to extract optimizer parameters from the new location
- Updated documentation to reference the new
optimizersection across multiple training guides
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
doc/train/training-advanced.md |
Added comprehensive documentation for the new optimizer section with examples and backend-specific support information |
doc/model/train-fitting-tensor.md |
Updated to reference the new optimizer section in the configuration structure |
doc/model/train-fitting-property.md |
Updated to reference the new optimizer section in the configuration structure |
doc/model/train-fitting-dos.md |
Updated to reference the new optimizer section in the configuration structure |
deepmd/utils/argcheck.py |
Defined optimizer schema with ArgsPlugin pattern, moved optimizer definitions from training_args to standalone optimizer_args, added support for all optimizer types with proper defaults |
deepmd/tf/train/trainer.py |
Updated to extract optimizer parameters from new optimizer section, added early validation for TensorFlow-unsupported features (weight_decay, non-Adam optimizers), implemented beta1/beta2 parameters |
deepmd/pt/train/training.py |
Updated to extract optimizer parameters from new optimizer section, refactored get_opt_param to include optimizer-specific default values, expanded optimizer initialization with explicit beta parameters |
deepmd/pd/train/training.py |
Updated to extract optimizer parameters from new optimizer section, added beta1/beta2/weight_decay parameters to Adam optimizer initialization |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
📝 WalkthroughWalkthroughThis PR introduces a consolidated optimizer configuration system across all training backends (TensorFlow, PyTorch, Paddle) by relocating optimizer parameters from the Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant ConfigLoader
participant CompatLayer as Compat Layer<br/>(convert_optimizer)
participant Argcheck
participant Trainer
User->>ConfigLoader: Load config.json
ConfigLoader->>CompatLayer: is_deepmd_v1_input()?
alt Legacy Format (opt_type in training)
CompatLayer->>CompatLayer: convert_optimizer_to_new_format()
CompatLayer->>CompatLayer: Migrate opt_type→type,<br/>extract params
CompatLayer-->>ConfigLoader: Return updated config
end
ConfigLoader->>Argcheck: normalize() config
Argcheck->>Argcheck: Validate optimizer section<br/>via optimizer_args()
Argcheck->>Argcheck: Apply defaults per<br/>optimizer type
Argcheck-->>Trainer: Return normalized config
Trainer->>Trainer: Extract optimizer params<br/>from config["optimizer"]
Trainer->>Trainer: Initialize optimizer<br/>with type-specific params
Trainer-->>User: Ready for training
sequenceDiagram
participant PyTorchTrainer as PyTorch Trainer<br/>(multi-task)
participant ConfigParser
participant ModelLoop
ConfigParser->>PyTorchTrainer: Load config with optimizer
alt Per-Model optim_dict Present
PyTorchTrainer->>ModelLoop: Use per-model optimizer
else Global optimizer_params
PyTorchTrainer->>ConfigParser: Extract optimizer_params
ConfigParser->>PyTorchTrainer: Return unified optimizer_params
PyTorchTrainer->>ModelLoop: Apply to all models
end
ModelLoop->>ModelLoop: get_opt_param(optimizer_params)
ModelLoop->>ModelLoop: Initialize optimizer<br/>with type-specific params<br/>(beta1, beta2, weight_decay, etc.)
ModelLoop-->>PyTorchTrainer: Model with optimizer ready
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@doc/model/train-fitting-dos.md`:
- Line 19: The README uses non-descriptive link text "here"; change it to a
descriptive phrase and update the inline reference so readers know what they’re
clicking. Replace "here" with something like "training parameters reference" or
"training keyword definitions" while keeping the same target file
(train-se-e2-a.md), and ensure the sentence still mentions the relevant keywords
{ref}`model[standard]/fitting_net <model[standard]/fitting_net>` and {ref}`loss
<loss>` so readers can find the training parameter explanations for fitting the
dos.
In `@doc/model/train-fitting-property.md`:
- Line 17: Replace the ambiguous link text "here" with a descriptive phrase so
the sentence reads clearly and improves accessibility; specifically update the
sentence referencing train-se-atten.md to use descriptive link text such as "SE
attention training documentation" (or similar) instead of "here", keeping the
surrounding refs ({ref}`model`, {ref}`learning_rate`, {ref}`optimizer`,
{ref}`loss`, {ref}`training`) and the callouts to
{ref}`model[standard]/fitting_net` and {ref}`loss` intact.
In `@doc/model/train-fitting-tensor.md`:
- Line 33: Replace the non-descriptive link text "here" in the sentence
describing training JSON (the line referencing `input.json`, `ener` mode and the
sections {ref}`model <model>`, {ref}`learning_rate <learning_rate>`,
{ref}`optimizer <optimizer>`, {ref}`loss <loss>` and {ref}`training <training>`)
with a descriptive link label that reflects the target content (e.g., "training
keywords and meanings" or "training keyword reference") and point it to
train-se-e2-a.md so the link text clearly conveys what the reader will find.
📜 Review details
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
deepmd/pd/train/training.pydeepmd/pt/train/training.pydeepmd/tf/train/trainer.pydeepmd/utils/argcheck.pydoc/model/train-fitting-dos.mddoc/model/train-fitting-property.mddoc/model/train-fitting-tensor.mddoc/train/training-advanced.md
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-12-12T13:40:14.334Z
Learnt from: CR
Repo: deepmodeling/deepmd-kit PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-12T13:40:14.334Z
Learning: Test PyTorch training with `cd examples/water/se_e2_a && dp --pt train input_torch.json --skip-neighbor-stat` and allow timeout of 60+ seconds for validation
Applied to files:
doc/model/train-fitting-tensor.md
📚 Learning: 2025-12-12T13:40:14.334Z
Learnt from: CR
Repo: deepmodeling/deepmd-kit PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-12T13:40:14.334Z
Learning: Test TensorFlow training with `cd examples/water/se_e2_a && dp train input.json --skip-neighbor-stat` and allow timeout of 60+ seconds for validation
Applied to files:
doc/model/train-fitting-tensor.md
📚 Learning: 2026-01-10T04:28:18.703Z
Learnt from: OutisLi
Repo: deepmodeling/deepmd-kit PR: 5130
File: deepmd/pt/optimizer/adamuon.py:40-109
Timestamp: 2026-01-10T04:28:18.703Z
Learning: The coefficients (3.4445, -4.7750, 2.0315) in the `zeropower_via_newtonschulz5` function in `deepmd/pt/optimizer/adamuon.py` are standard across AdaMuon implementations and should not be modified for consistency with the reference implementations.
Applied to files:
deepmd/pt/train/training.py
🧬 Code graph analysis (2)
deepmd/utils/argcheck.py (1)
deepmd/utils/plugin.py (2)
register(41-59)register(122-141)
deepmd/pd/train/training.py (1)
deepmd/pt/train/training.py (1)
get_opt_param(160-197)
🪛 markdownlint-cli2 (0.18.1)
doc/model/train-fitting-property.md
17-17: Link text should be descriptive
(MD059, descriptive-link-text)
🪛 Ruff (0.14.11)
deepmd/pt/train/training.py
178-178: Avoid specifying long messages outside the exception class
(TRY003)
deepmd/tf/train/trainer.py
130-132: Avoid specifying long messages outside the exception class
(TRY003)
134-137: Avoid specifying long messages outside the exception class
(TRY003)
325-327: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (41)
- GitHub Check: Agent
- GitHub Check: Build C++ (cpu, cpu)
- GitHub Check: Build C++ (clang, clang)
- GitHub Check: Build C++ (cuda120, cuda)
- GitHub Check: Build C++ (rocm, rocm)
- GitHub Check: Build wheels for cp311-macosx_x86_64
- GitHub Check: Build wheels for cp311-win_amd64
- GitHub Check: Build wheels for cp310-manylinux_aarch64
- GitHub Check: Test C++ (true, false, false, true)
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Build wheels for cp311-macosx_arm64
- GitHub Check: Test C++ (true, true, true, false)
- GitHub Check: Test Python (8, 3.13)
- GitHub Check: Test C++ (false, false, false, true)
- GitHub Check: Test C++ (false, true, true, false)
- GitHub Check: Test Python (11, 3.10)
- GitHub Check: Test Python (11, 3.13)
- GitHub Check: Test Python (10, 3.10)
- GitHub Check: Test Python (9, 3.13)
- GitHub Check: Test Python (10, 3.13)
- GitHub Check: Test Python (8, 3.10)
- GitHub Check: Test Python (3, 3.13)
- GitHub Check: Test Python (9, 3.10)
- GitHub Check: Test Python (4, 3.10)
- GitHub Check: Test Python (7, 3.13)
- GitHub Check: Test Python (6, 3.13)
- GitHub Check: Test Python (12, 3.10)
- GitHub Check: Test Python (12, 3.13)
- GitHub Check: Test Python (1, 3.13)
- GitHub Check: Test Python (3, 3.10)
- GitHub Check: Test Python (6, 3.10)
- GitHub Check: Test Python (5, 3.13)
- GitHub Check: Test Python (7, 3.10)
- GitHub Check: Test Python (4, 3.13)
- GitHub Check: Test Python (2, 3.13)
- GitHub Check: Test Python (2, 3.10)
- GitHub Check: Test Python (5, 3.10)
- GitHub Check: Test Python (1, 3.10)
- GitHub Check: Analyze (python)
- GitHub Check: Analyze (c-cpp)
- GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
🔇 Additional comments (12)
deepmd/pt/train/training.py (4)
126-129: Verify legacy optimizer keys are handled.The trainer now reads only the top-level
optimizersection whenoptim_dictis absent. If existing configs still settraining.opt_type/weight_decay/adam_beta2, those values will be ignored. Please confirm this is an intentional breaking change or add a fallback/deprecation error to make it explicit.Also applies to: 319-320
160-197: LGTM: per-optimizer defaults centralized.Nice consolidation of default hyperparameters and validation in
get_opt_param.
775-793: Nice: AdaMuon now restores optimizer state + scheduler.Keeps resume behavior consistent with Adam/AdamW.
733-764: No changes needed —fusedparameter is guaranteed in all supported PyTorch versions.The project requires PyTorch ≥2.7 (GPU) or ≥2.8 (CPU) per
pyproject.toml. Bothtorch.optim.Adam(fused since 1.13) andtorch.optim.AdamW(fused since 2.0) fully support thefusedparameter across all pinned versions. The current implementation is correct and doesn't require guards.Likely an incorrect or invalid review comment.
deepmd/utils/argcheck.py (3)
2568-2852: LGTM: optimizer schema centralized underoptimizer.Clear variant-based definitions and docstrings.
3683-3684: No review comment needed.
3754-3768: LGTM: optimizer args wired intogen_argsfor single/multi-task.deepmd/pd/train/training.py (2)
116-163: LGTM: optimizer params now sourced from the top-level config.Parsing the
optimizerblock once keeps the trainer aligned with the new schema.Also applies to: 250-266
595-606: LGTM: optimizer hyperparams are now plumbed through.deepmd/tf/train/trainer.py (2)
123-137: Optimizer parsing/validation looks solid.
323-348: Optimizer construction is consistent across branches.doc/train/training-advanced.md (1)
48-63: Optimizer section is clear and well-scoped.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5157 +/- ##
==========================================
- Coverage 82.10% 82.10% -0.01%
==========================================
Files 745 745
Lines 74875 74951 +76
Branches 3615 3615
==========================================
+ Hits 61478 61539 +61
- Misses 12234 12249 +15
Partials 1163 1163 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@deepmd/pt/train/training.py`:
- Around line 160-189: get_opt_param currently silently defaults to "Adam" when
callers use the legacy key "opt_type" or omit "type"; change get_opt_param to
check for a legacy "opt_type" key first and if present copy it into the
canonical "type" with a deprecation warning (use warnings.warn(...,
DeprecationWarning) or the module logger) before validating opt_type, and if
neither key is present raise a ValueError instead of silently defaulting to
"Adam"; reference the function get_opt_param and the symbols opt_type,
params["type"], and params["opt_type"] when implementing this guard and warning.
In `@deepmd/utils/compat.py`:
- Around line 426-477: The conversion currently only runs when "opt_type" is
present so legacy optimizer keys left in training_cfg (e.g., adam_beta1,
weight_decay) are ignored; update the logic in the block around
training_cfg/optimizer_keys (symbols: training_cfg, optimizer_keys,
extracted_cfg, optimizer_cfg) to trigger extraction when any legacy key is
present (not just opt_type), include the legacy alias "muon_min_2d_dim" by
adding it to optimizer_keys and mapping it to "min_2d_dim" when moving keys,
then proceed to convert opt_type->type if present and merge into optimizer_cfg
(preserving conversion precedence) and keep the existing defaulting to
default_optimizer/type behavior.
🧹 Nitpick comments (2)
deepmd/pd/train/training.py (1)
152-173: Consider removing unused kf_ parameters from extraction.*The Paddle backend only supports the Adam optimizer (validated at line 161-162), but
get_opt_paramstill extracts Kalman Filter parameters (kf_blocksize,kf_start_pref_e, etc.) that are only relevant for the LKF optimizer. While this doesn't cause functional issues, removing these unused extractions would improve code clarity and align better with the backend's actual capabilities.The early validation for unsupported optimizer types is good (fail-fast approach).
♻️ Optional: Remove unused kf_* parameters
opt_param = { - "kf_blocksize": params.get("kf_blocksize"), - "kf_start_pref_e": params.get("kf_start_pref_e"), - "kf_limit_pref_e": params.get("kf_limit_pref_e"), - "kf_start_pref_f": params.get("kf_start_pref_f"), - "kf_limit_pref_f": params.get("kf_limit_pref_f"), "adam_beta1": params.get("adam_beta1"), "adam_beta2": params.get("adam_beta2"), "weight_decay": params.get("weight_decay"), }deepmd/tf/train/trainer.py (1)
324-349: Minor: Duplicate optimizer type validation.The optimizer type check at lines 325-328 is redundant since the same validation already occurs in
_init_param(lines 130-133), which runs before_build_optimizeris called. The second check could be removed for cleaner code, though keeping it as a defensive guard is also acceptable.♻️ Optional: Remove redundant validation
def _build_optimizer(self): - if self.optimizer_type != "Adam": - raise RuntimeError( - f"Unsupported optimizer type {self.optimizer_type} for TensorFlow backend." - ) if self.run_opt.is_distrib:
129b456 to
7f73f71
Compare
de34fe0 to
5331422
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
deepmd/utils/argcheck.py (1)
4035-4044:⚠️ Potential issue | 🔴 CriticalVerify migration runs before strict normalization in all entrypoints.
While all main training entrypoints (TensorFlow train/compress/change_bias, PaddlePaddle, PyTorch) correctly call
update_deepmd_input()beforenormalize(), the NVNMD entrypoint atdeepmd/tf/nvnmd/entrypoints/mapt.pycallsnormalize()without prior migration. Since optimizer keys are now schema-driven, any legacy configs will fail strict validation if not migrated first. This applies to all code paths that validate configs withstrict=True.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deepmd/utils/argcheck.py` around lines 4035 - 4044, The NVNMD entrypoint (mapt) calls normalize() before migration, causing strict schema validation failures for legacy optimizer keys; update the entrypoint to call update_deepmd_input(config) (the same migration used by other training entrypoints) before calling normalize(config, strict=True), and audit any other NVNMD-related code paths to ensure update_deepmd_input() runs prior to normalize() when strict validation is used.
♻️ Duplicate comments (1)
deepmd/pt/train/training.py (1)
195-200:⚠️ Potential issue | 🟠 MajorDo not silently ignore legacy
opt_type.At Line 195, configs that still provide
opt_typebut nottypesilently become Adam, which can change training behavior without warning.🔧 Suggested guard
- opt_type = params.get("type", "Adam") + opt_type = params.get("type") + if opt_type is None and "opt_type" in params: + log.warning( + "optimizer.opt_type is deprecated; use optimizer.type or run convert_optimizer_to_new_format." + ) + opt_type = params["opt_type"] + if opt_type is None: + opt_type = "Adam"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deepmd/pt/train/training.py` around lines 195 - 200, The current code silently falls back to "Adam" when configs include a legacy "opt_type" key but no "type", so update the opt selection logic in training.py (the block that computes opt_type and opt_param) to detect a legacy params.get("opt_type") when "type" is absent and refuse silent fallback: either map/migrate it explicitly (e.g., set params["type"]=params.pop("opt_type") for supported legacy names) or raise a clear ValueError telling the user to rename "opt_type" to "type" (include the offending value in the message); ensure opt_param still pops "type" afterwards and that the error/message references the original opt_type value to aid debugging.
🧹 Nitpick comments (2)
deepmd/tf/train/trainer.py (1)
347-350: Consolidate duplicated optimizer-type validation.The same
optimizer_type != "Adam"guard is already enforced in_init_param; keeping one source of truth reduces drift.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deepmd/tf/train/trainer.py` around lines 347 - 350, Remove the duplicated optimizer-type guard in the shown block and rely on the single validation in _init_param: delete the conditional that raises for self.optimizer_type != "Adam" (the redundant RuntimeError) so the optimizer validation lives only in _init_param; if you want safety, replace the block with a simple assertion or a call to the existing _init_param validation helper (e.g., call self._init_param() or assert self.optimizer_type == "Adam") rather than duplicating the error handling.deepmd/pt/train/training.py (1)
799-802: Document fused optimizer device support or add explicit device validation.Line 801 enables
fused=Trueon all non-CPU devices. While PyTorch's fused optimizer officially supports CUDA, CPU, MPS, XPU, HPU, and MTIA, it will fail at runtime on unsupported backends (e.g., TPU/XLA). Consider either:
- Adding a comment documenting the supported device assumption, or
- Explicitly validating the device type:
if DEVICE.type in ("cuda", "mps", "xpu", "hpu", "mtia"): extra["fused"] = True🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deepmd/pt/train/training.py` around lines 799 - 802, The code sets extra = {"betas": adam_betas, "fused": DEVICE.type != "cpu"} inside the Adam/AdamW branch (when self.opt_type in ("Adam", "AdamW")), which will enable fused optimizers on unsupported backends; update this by validating DEVICE.type explicitly before setting fused (e.g., only set fused=True when DEVICE.type is one of the supported backends like "cuda","mps","xpu","hpu","mtia") or add a clear comment documenting the supported devices; modify the extra construction in the Adam/AdamW handling (the extra dict created near the Adam/AdamW branch) to perform this device whitelist check and set fused accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@deepmd/pd/train/training.py`:
- Line 123: Validate the optimizer config immediately after reading it instead
of defaulting to {}, i.e., check that config contains an "optimizer" mapping and
required fields for the chosen optimizer (for Adam ensure keys like "lr",
"betas", "eps", "weight_decay" or whatever keys your Adam usage at lines 686-688
expects are present); if missing, raise a clear ValueError with a message
stating which key is missing. Update the code around the optimizer_params
assignment (the variable optimizer_params in deepmd/pd/train/training.py) to
perform this validation (and reuse the same validation for other code paths
noted at lines ~161-172 and ~278), so the later Adam construction code that
reads those keys cannot raise a late KeyError.
- Around line 167-169: The current Paddle trainer only allows "Adam" via a
runtime check on opt_type (params.get("type")) in training.py; update validation
so optimizer types are filtered at config-load time instead of only at trainer
instantiation: add a backend-aware allowed list (e.g., PADDLE_ALLOWED_OPTS =
{"Adam"} and PT_ALLOWED_OPTS = {"Adam","AdamW","LKF","AdaMuon","HybridMuon"})
and validate params["type"] against the Paddle set when the backend is Paddle,
or better yet add schema-level backend-specific allowed values for the optimizer
type so configs for non-Paddle backends are rejected earlier; ensure the error
message references the backend and lists allowed values instead of the generic
raise ValueError currently using opt_type.
In `@deepmd/tf/train/trainer.py`:
- Around line 141-145: The code in trainer.py that sets self.optimizer_beta1,
self.optimizer_beta2, and self.optimizer_weight_decay calls float(...) on values
that may be None (optimizer_param.get("adam_beta1"), etc.), which raises
TypeError; fix by providing safe defaults or guarding None before conversion
(e.g., use optimizer_param.get("adam_beta1", <default>) and
optimizer_param.get("adam_beta2", <default>) and
optimizer_param.get("weight_decay", 0.0) or check for None and only cast when
present), so that self.optimizer_beta1, self.optimizer_beta2, and
self.optimizer_weight_decay always receive valid float values while keeping
self.optimizer_type parsing unchanged.
---
Outside diff comments:
In `@deepmd/utils/argcheck.py`:
- Around line 4035-4044: The NVNMD entrypoint (mapt) calls normalize() before
migration, causing strict schema validation failures for legacy optimizer keys;
update the entrypoint to call update_deepmd_input(config) (the same migration
used by other training entrypoints) before calling normalize(config,
strict=True), and audit any other NVNMD-related code paths to ensure
update_deepmd_input() runs prior to normalize() when strict validation is used.
---
Duplicate comments:
In `@deepmd/pt/train/training.py`:
- Around line 195-200: The current code silently falls back to "Adam" when
configs include a legacy "opt_type" key but no "type", so update the opt
selection logic in training.py (the block that computes opt_type and opt_param)
to detect a legacy params.get("opt_type") when "type" is absent and refuse
silent fallback: either map/migrate it explicitly (e.g., set
params["type"]=params.pop("opt_type") for supported legacy names) or raise a
clear ValueError telling the user to rename "opt_type" to "type" (include the
offending value in the message); ensure opt_param still pops "type" afterwards
and that the error/message references the original opt_type value to aid
debugging.
---
Nitpick comments:
In `@deepmd/pt/train/training.py`:
- Around line 799-802: The code sets extra = {"betas": adam_betas, "fused":
DEVICE.type != "cpu"} inside the Adam/AdamW branch (when self.opt_type in
("Adam", "AdamW")), which will enable fused optimizers on unsupported backends;
update this by validating DEVICE.type explicitly before setting fused (e.g.,
only set fused=True when DEVICE.type is one of the supported backends like
"cuda","mps","xpu","hpu","mtia") or add a clear comment documenting the
supported devices; modify the extra construction in the Adam/AdamW handling (the
extra dict created near the Adam/AdamW branch) to perform this device whitelist
check and set fused accordingly.
In `@deepmd/tf/train/trainer.py`:
- Around line 347-350: Remove the duplicated optimizer-type guard in the shown
block and rely on the single validation in _init_param: delete the conditional
that raises for self.optimizer_type != "Adam" (the redundant RuntimeError) so
the optimizer validation lives only in _init_param; if you want safety, replace
the block with a simple assertion or a call to the existing _init_param
validation helper (e.g., call self._init_param() or assert self.optimizer_type
== "Adam") rather than duplicating the error handling.
ℹ️ Review info
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
deepmd/pd/train/training.pydeepmd/pt/train/training.pydeepmd/tf/entrypoints/train.pydeepmd/tf/train/trainer.pydeepmd/tf/utils/compat.pydeepmd/utils/argcheck.pydeepmd/utils/compat.pydoc/model/train-fitting-dos.mddoc/model/train-fitting-property.mddoc/model/train-fitting-tensor.md
💤 Files with no reviewable changes (1)
- deepmd/tf/entrypoints/train.py
🚧 Files skipped from review as they are similar to previous changes (3)
- doc/model/train-fitting-dos.md
- deepmd/utils/compat.py
- doc/model/train-fitting-property.md
| self.restart_training = restart_model is not None | ||
| model_params = config["model"] | ||
| training_params = config["training"] | ||
| optimizer_params = config.get("optimizer", {}) |
There was a problem hiding this comment.
Fail fast on missing/incomplete optimizer config.
Line 123 falls back to {}, then Lines 686-688 assume Adam keys exist and can raise a late KeyError. Validate optimizer presence and required Adam fields earlier for a clear config error.
Proposed fix
- optimizer_params = config.get("optimizer", {})
+ optimizer_params = config.get("optimizer")
+ if optimizer_params is None:
+ raise ValueError(
+ "Missing top-level 'optimizer' section. "
+ "Please run config migration/normalization before training."
+ )
def get_opt_param(params: dict[str, Any]) -> tuple[str, dict[str, Any]]:
@@
opt_type = params.get("type", "Adam")
if opt_type != "Adam":
raise ValueError(f"Not supported optimizer type '{opt_type}'")
opt_param = dict(params)
opt_param.pop("type", None)
+ required_adam_keys = ("adam_beta1", "adam_beta2", "weight_decay")
+ missing = [k for k in required_adam_keys if k not in opt_param]
+ if missing:
+ raise ValueError(
+ f"Missing Adam optimizer parameters: {missing}. "
+ "Please run argcheck.normalize() on the config."
+ )
return opt_type, opt_paramAlso applies to: 161-172, 278-278, 684-688
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@deepmd/pd/train/training.py` at line 123, Validate the optimizer config
immediately after reading it instead of defaulting to {}, i.e., check that
config contains an "optimizer" mapping and required fields for the chosen
optimizer (for Adam ensure keys like "lr", "betas", "eps", "weight_decay" or
whatever keys your Adam usage at lines 686-688 expects are present); if missing,
raise a clear ValueError with a message stating which key is missing. Update the
code around the optimizer_params assignment (the variable optimizer_params in
deepmd/pd/train/training.py) to perform this validation (and reuse the same
validation for other code paths noted at lines ~161-172 and ~278), so the later
Adam construction code that reads those keys cannot raise a late KeyError.
| opt_type = params.get("type", "Adam") | ||
| if opt_type != "Adam": | ||
| raise ValueError(f"Not supported optimizer type '{opt_type}'") |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
rg -n --type=py 'Not supported optimizer type|opt_type|optimizer.*type|AdamW|LKF|AdaMuon|HybridMuon' \
deepmd/utils/argcheck.py \
deepmd/utils/compat.py \
deepmd/pd/train/training.py \
deepmd/pt/train/training.pyRepository: deepmodeling/deepmd-kit
Length of output: 6542
🏁 Script executed:
rg -n --type=py 'doc_only_pt_supported|backend.*optimizer|pytorch.*only' deepmd/utils/argcheck.py | head -20Repository: deepmodeling/deepmd-kit
Length of output: 1743
🏁 Script executed:
# Check the definition of doc_only_pt_supported and how it's enforced
rg -n --type=py -B2 -A2 'doc_only_pt_supported\s*=' deepmd/utils/argcheck.py | head -40Repository: deepmodeling/deepmd-kit
Length of output: 266
🏁 Script executed:
# Search for where schema validation happens and how backend type is handled
rg -n --type=py 'def check_training_config|backend.*type|validate.*optimizer' deepmd/utils/argcheck.py | head -30Repository: deepmodeling/deepmd-kit
Length of output: 888
🏁 Script executed:
# Search for how the schema validation works and if backend is considered
rg -n --type=py 'class.*Variant|def.*check|backend.*validate' deepmd/utils/argcheck.py | head -30Repository: deepmodeling/deepmd-kit
Length of output: 511
🏁 Script executed:
# Check if there's a check function that validates optimizer types based on backend
rg -n --type=py 'def check_' deepmd/utils/argcheck.py | grep -i 'train\|opt'Repository: deepmodeling/deepmd-kit
Length of output: 49
🏁 Script executed:
# See if there's any backend-aware validation in the training section
rg -n --type=py -A10 'def get_training_' deepmd/utils/argcheck.py | head -50Repository: deepmodeling/deepmd-kit
Length of output: 49
🏁 Script executed:
# Check the training_extra_check function to see if it validates based on backend
sed -n '3916,3980p' deepmd/utils/argcheck.pyRepository: deepmodeling/deepmd-kit
Length of output: 2435
🏁 Script executed:
# Look at the optimizer variant type args definition
sed -n '3014,3040p' deepmd/utils/argcheck.pyRepository: deepmodeling/deepmd-kit
Length of output: 783
🏁 Script executed:
# Check if validators for optimizer types check backend
rg -n --type=py -B5 -A10 'register.*Adam.*doc=' deepmd/utils/argcheck.py | head -60Repository: deepmodeling/deepmd-kit
Length of output: 633
🏁 Script executed:
# Check if backend is passed to schema validation and used to filter optimizer options
rg -n --type=py 'backend.*optimizer|get_all_argument|opt_args_plugin' deepmd/utils/argcheck.py | head -40Repository: deepmodeling/deepmd-kit
Length of output: 977
🏁 Script executed:
# Look at the opt_args_plugin definition and how it registers optimizers
rg -n --type=py -B5 'opt_args_plugin\s*=' deepmd/utils/argcheck.pyRepository: deepmodeling/deepmd-kit
Length of output: 210
🏁 Script executed:
# Check how schema validation is called with backend info
rg -n --type=py 'check_' deepmd/utils/argcheck.py | grep -v 'extra_check\|_check' | head -20Repository: deepmodeling/deepmd-kit
Length of output: 156
🏁 Script executed:
# Find where backend is available during schema validation
rg -n --type=py 'def check_value|backend.*check|strict=True' deepmd/utils/argcheck.py | head -30Repository: deepmodeling/deepmd-kit
Length of output: 111
🏁 Script executed:
# Check how the schema is used in the actual training initialization
rg -n --type=py -B5 -A5 'optimizer_params.*get\|get.*optimizer' deepmd/pd/train/training.py | head -50Repository: deepmodeling/deepmd-kit
Length of output: 49
🏁 Script executed:
# Look at where the config is loaded and validated before being passed to trainers
rg -n --type=py 'check_value|load_config|validate' deepmd/train/trainer.py | head -30Repository: deepmodeling/deepmd-kit
Length of output: 130
🏁 Script executed:
# Find where backend is determined and how it's used
rg -n --type=py 'backend.*=' deepmd/pd/train/training.py | head -20Repository: deepmodeling/deepmd-kit
Length of output: 250
🏁 Script executed:
# Search for where config validation happens before trainer init
fd -e py -path "*/deepmd/*" -type f | xargs rg -l 'check_value|validate.*config' | head -20Repository: deepmodeling/deepmd-kit
Length of output: 356
🏁 Script executed:
# Look at the actual flow in Paddle trainer init
sed -n '1,100p' deepmd/pd/train/training.pyRepository: deepmodeling/deepmd-kit
Length of output: 2053
🏁 Script executed:
# Look at the Paddle Trainer __init__ to see where optimizer params are handled
sed -n '100,200p' deepmd/pd/train/training.pyRepository: deepmodeling/deepmd-kit
Length of output: 4558
🏁 Script executed:
# Check if there's validation that happens before get_opt_param is called
rg -n --type=py -B10 'get_opt_param' deepmd/pd/train/training.pyRepository: deepmodeling/deepmd-kit
Length of output: 1438
🏁 Script executed:
# Look for where the schema validation happens in the training pipeline
rg -n --type=py 'argcheck.*check|validate' deepmd/*.py deepmd/**/*.py 2>/dev/null | grep -v test | head -20Repository: deepmodeling/deepmd-kit
Length of output: 145
🏁 Script executed:
# Find where the config is loaded and validated
rg -n --type=py 'check_value|from deepmd.utils.argcheck' deepmd/pd/train/training.pyRepository: deepmodeling/deepmd-kit
Length of output: 49
🏁 Script executed:
# Look for the main entry point that calls Trainer with validated config
fd -type f -name "*.py" | xargs rg -l 'argcheck.*check_value|Trainer.*config' | head -10Repository: deepmodeling/deepmd-kit
Length of output: 237
🏁 Script executed:
# Check what happens in the training script/entrypoint
fd -type f -name "train.py" -o -name "main.py" | head -5Repository: deepmodeling/deepmd-kit
Length of output: 237
🏁 Script executed:
# Find entry points for training
fd -name "train.py" deepmd/Repository: deepmodeling/deepmd-kit
Length of output: 297
🏁 Script executed:
# Check if argcheck validation happens in a CLI wrapper or main entry
rg -n 'argcheck.*check_value' deepmd/Repository: deepmodeling/deepmd-kit
Length of output: 49
🏁 Script executed:
# Look at how config flows from CLI to Trainer
rg -n 'def main|Trainer\(' deepmd/train/ 2>/dev/null | head -20Repository: deepmodeling/deepmd-kit
Length of output: 49
🏁 Script executed:
# Find training entry points
find deepmd -name "train.py" -o -name "*cli*.py" | head -10Repository: deepmodeling/deepmd-kit
Length of output: 134
🏁 Script executed:
# Search for where argcheck.check_value is called
rg 'argcheck.check_value|from deepmd.utils.argcheck import' deepmd/ | head -30Repository: deepmodeling/deepmd-kit
Length of output: 324
🏁 Script executed:
# Look for where config is validated before Trainer init
rg -A5 'check_value' deepmd/ | head -50Repository: deepmodeling/deepmd-kit
Length of output: 302
🏁 Script executed:
# Check Paddle entrypoint where validation might happen
cat deepmd/pd/entrypoints/main.py | head -100Repository: deepmodeling/deepmd-kit
Length of output: 2188
🏁 Script executed:
# See the full check_value function and how it validates
rg -B5 -A15 'def check_value' deepmd/utils/argcheck.pyRepository: deepmodeling/deepmd-kit
Length of output: 49
🏁 Script executed:
# Search for check_value function differently
rg 'def check_value' deepmd/Repository: deepmodeling/deepmd-kit
Length of output: 49
🏁 Script executed:
# Look at the base module which was imported
rg -A20 'base.check_value' deepmd/utils/argcheck.pyRepository: deepmodeling/deepmd-kit
Length of output: 166
🏁 Script executed:
# Check if there's backend-aware validation in a normalize function
rg -B5 -A25 'def normalize' deepmd/utils/argcheck.py | head -80Repository: deepmodeling/deepmd-kit
Length of output: 469
🏁 Script executed:
# Check where normalize is called in Paddle training
rg -B10 -A5 'normalize\(' deepmd/pd/entrypoints/main.pyRepository: deepmodeling/deepmd-kit
Length of output: 678
🏁 Script executed:
# Check if the config dict has backend info when normalized
rg 'config.*=' deepmd/pd/entrypoints/main.py | head -20Repository: deepmodeling/deepmd-kit
Length of output: 656
🏁 Script executed:
# Let's see the full picture of where training config is loaded and validated
sed -n '65,150p' deepmd/pd/entrypoints/main.pyRepository: deepmodeling/deepmd-kit
Length of output: 3237
🏁 Script executed:
# Check if check_value in base.Argument filters based on variant registration
rg -B10 -A30 'class Argument' deepmd/utils/argcheck.py | grep -A30 'def check_value'Repository: deepmodeling/deepmd-kit
Length of output: 49
🏁 Script executed:
# Look at how Variant works in the schema
rg -B5 -A30 'class Variant' deepmd/utils/argcheck.py | head -80Repository: deepmodeling/deepmd-kit
Length of output: 49
🏁 Script executed:
# Key question: does schema validation actually enforce PyTorch-only markers?
rg 'doc_only_pt_supported' deepmd/utils/argcheck.py | wc -lRepository: deepmodeling/deepmd-kit
Length of output: 69
🏁 Script executed:
# Let's verify the actual implementation of check_value in the base Argument class
rg -A50 'def check_value.*self' deepmd/utils/argcheck_base.py 2>/dev/null | head -80Repository: deepmodeling/deepmd-kit
Length of output: 49
🏁 Script executed:
# Check what happens when validate against a Variant - does it restrict options?
find deepmd -name "argcheck_base.py" -o -name "argcheck.py" | xargs cat | rg -A40 'class Variant' | head -100Repository: deepmodeling/deepmd-kit
Length of output: 49
🏁 Script executed:
# Simpler approach: does the schema prevent you from setting an unsupported type?
# Let's check if there's any filtering when normalize is called
rg -B5 -A20 'def normalize_value' deepmd/utils/argcheck.py | head -80Repository: deepmodeling/deepmd-kit
Length of output: 49
Paddle optimizer type validation is limited to Adam while PyTorch accepts multiple types; consider schema-level backend filtering.
The inconsistency is confirmed: Paddle trainer accepts only Adam (lines 167-169), while PyTorch accepts Adam, AdamW, LKF, AdaMuon, HybridMuon. The argcheck schema documentation explicitly documents this difference (supported types: "TensorFlow/Paddle: Adam; PyTorch: Adam, AdamW, LKF, AdaMuon, HybridMuon"), and doc_only_pt_supported markers are applied to unsupported optimizer registrations. However, the schema validation does not appear to filter optimizer types by backend at config-load time; instead, unsupported types are caught at trainer instantiation via the runtime check on line 168. While explicit Paddle-side validation prevents crashes, adding schema-level backend-aware filtering would catch configuration errors earlier and prevent confusing error messages deep in trainer initialization.
🧰 Tools
🪛 Ruff (0.15.2)
[warning] 169-169: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@deepmd/pd/train/training.py` around lines 167 - 169, The current Paddle
trainer only allows "Adam" via a runtime check on opt_type (params.get("type"))
in training.py; update validation so optimizer types are filtered at config-load
time instead of only at trainer instantiation: add a backend-aware allowed list
(e.g., PADDLE_ALLOWED_OPTS = {"Adam"} and PT_ALLOWED_OPTS =
{"Adam","AdamW","LKF","AdaMuon","HybridMuon"}) and validate params["type"]
against the Paddle set when the backend is Paddle, or better yet add
schema-level backend-specific allowed values for the optimizer type so configs
for non-Paddle backends are rejected earlier; ensure the error message
references the backend and lists allowed values instead of the generic raise
ValueError currently using opt_type.
| optimizer_param = jdata.get("optimizer", {}) | ||
| self.optimizer_type = optimizer_param.get("type", "Adam") | ||
| self.optimizer_beta1 = float(optimizer_param.get("adam_beta1")) | ||
| self.optimizer_beta2 = float(optimizer_param.get("adam_beta2")) | ||
| self.optimizer_weight_decay = float(optimizer_param.get("weight_decay")) |
There was a problem hiding this comment.
Avoid float(None) in optimizer parsing.
At Line 143-Line 145, missing optimizer keys will raise TypeError before your backend-specific errors are reached.
🛠️ Suggested hardening
- self.optimizer_beta1 = float(optimizer_param.get("adam_beta1"))
- self.optimizer_beta2 = float(optimizer_param.get("adam_beta2"))
- self.optimizer_weight_decay = float(optimizer_param.get("weight_decay"))
+ self.optimizer_beta1 = float(optimizer_param.get("adam_beta1", 0.9))
+ self.optimizer_beta2 = float(optimizer_param.get("adam_beta2", 0.999))
+ self.optimizer_weight_decay = float(optimizer_param.get("weight_decay", 0.0))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@deepmd/tf/train/trainer.py` around lines 141 - 145, The code in trainer.py
that sets self.optimizer_beta1, self.optimizer_beta2, and
self.optimizer_weight_decay calls float(...) on values that may be None
(optimizer_param.get("adam_beta1"), etc.), which raises TypeError; fix by
providing safe defaults or guarding None before conversion (e.g., use
optimizer_param.get("adam_beta1", <default>) and
optimizer_param.get("adam_beta2", <default>) and
optimizer_param.get("weight_decay", 0.0) or check for None and only cast when
present), so that self.optimizer_beta1, self.optimizer_beta2, and
self.optimizer_weight_decay always receive valid float values while keeping
self.optimizer_type parsing unchanged.
Summary by CodeRabbit
New Features
optimizerconfiguration section.beta1,beta2, andweight_decayfor Adam-based optimizers.Documentation
Backward Compatibility