fix(tf): dispatch --init-model to checkpoint pre-inspection#5718
fix(tf): dispatch --init-model to checkpoint pre-inspection#5718wanghan-iapcm wants to merge 1 commit into
Conversation
RunOptions records `dp train --init-model` as init_mode == "init_from_model", but DPTrainer.build() dispatched on the literal "init_model", so the branch never matched and _init_from_ckpt was skipped for --init-model. That pre-inspection is what imports the source checkpoint's meta graph and sets self.ckpt_meta when the checkpoint is a compressed_model, before the graph is built with ckpt_meta. With the mismatch, compressed-checkpoint --init-model builds the graph without its checkpoint metadata. Uncompressed --init-model still worked because variables are restored later in _init_session (which uses the correct "init_from_model" literal) and needs no ckpt_meta, which masked the bug. Fix the literal to "init_from_model". The 4-way init dispatch is extracted from the heavyweight build() into a small _init_from_run_opt() helper so it can be unit-tested in isolation; this is why the mismatch went uncaught. Adds a regression test that drives the dispatch with a stub trainer and mocked initializers: it fails on the old literal (init_from_model routes nowhere) and passes with the fix, and also covers restart, init_from_frz_model, finetune, and scratch. Fix deepmodeling#5679
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThe inline ChangesTrainer init-mode refactor
Estimated code review effort: 2 (Simple) | ~10 minutes Possibly related issues
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| """ | ||
|
|
||
| import types | ||
| import unittest |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #5718 +/- ##
==========================================
- Coverage 81.26% 81.14% -0.13%
==========================================
Files 988 988
Lines 110877 110877
Branches 4234 4234
==========================================
- Hits 90103 89966 -137
- Misses 19249 19383 +134
- Partials 1525 1528 +3 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
njzjz-bot
left a comment
There was a problem hiding this comment.
Reviewed the RunOptions/DPTrainer dispatch path and the new regression test. The fix changes the pre-build checkpoint inspection dispatch to the RunOptions literal init_from_model, which matches the documented bug in #5679 and keeps the existing restart/frozen/finetune branches behaviorally unchanged.
The added test covers all supported init modes plus the scratch no-op, and it would fail on the previous init_model literal. CI is green, including the Python test matrix and CodeQL. I do not see a blocking issue.
One non-blocking cleanup: GitHub Advanced Security noted the mixed import unittest / from unittest import mock style in the new test file. It can be cleaned up, but I would not block this fix on it.
Reviewed by OpenClaw 2026.6.8 (844f405) (model: custom-chat-jinzhezeng-group/gpt-5.5).
Problem
Fixes #5679.
RunOptionsrecordsdp train --init-modelasinit_mode == "init_from_model"(deepmd/tf/train/run_options.py), butDPTrainer.build()dispatched the init step on the literal"init_model". The two strings never matched, so_init_from_ckpt(...)was skipped for--init-model. That pre-inspection is the step that imports the source checkpoint's meta graph and, when the checkpoint is acompressed_model, setsself.ckpt_metabefore the graph is built (the graph build later consumesckpt_meta). With the mismatch, a compressed-checkpoint--init-modelrun builds the graph without its checkpoint metadata.The bug was masked for the common case because uncompressed
--init-modelstill works: the variables are restored later in_init_session, which uses the correct"init_from_model"literal and does not needckpt_meta. So only compressed-checkpoint initialization was actually exposed, and no test exercised it.Fix
Correct the dispatch literal to
"init_from_model". To make the dispatch unit-testable — the reason the mismatch went uncaught is that it lived inline in the heavyweightbuild()— the four-way init dispatch is extracted into a small_init_from_run_opt()helper. The restart, init-from-frozen-model, and finetune branches already used the correct literals and are unchanged in behavior.Test
Adds
source/tests/tf/test_trainer_init_mode.py, which drives_init_from_run_opton a stub trainer with the three concrete initializers mocked and asserts eachinit_moderoutes correctly. On the old literal,init_from_modelroutes to nothing (the test fails); with the fix it reaches_init_from_ckpt(init_model). The test also coversrestart,init_from_frz_model,finetune, and the scratch no-op. This dispatch previously had no coverage.Summary by CodeRabbit
Bug Fixes
Tests