Skip to content

fix(tf): dispatch --init-model to checkpoint pre-inspection#5718

Open
wanghan-iapcm wants to merge 1 commit into
deepmodeling:masterfrom
wanghan-iapcm:fix-tf-init-model-mode
Open

fix(tf): dispatch --init-model to checkpoint pre-inspection#5718
wanghan-iapcm wants to merge 1 commit into
deepmodeling:masterfrom
wanghan-iapcm:fix-tf-init-model-mode

Conversation

@wanghan-iapcm

@wanghan-iapcm wanghan-iapcm commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

Problem

Fixes #5679. RunOptions records dp train --init-model as init_mode == "init_from_model" (deepmd/tf/train/run_options.py), but DPTrainer.build() dispatched the init step on the literal "init_model". The two strings never matched, so _init_from_ckpt(...) was skipped for --init-model. That pre-inspection is the step that imports the source checkpoint's meta graph and, when the checkpoint is a compressed_model, sets self.ckpt_meta before the graph is built (the graph build later consumes ckpt_meta). With the mismatch, a compressed-checkpoint --init-model run builds the graph without its checkpoint metadata.

The bug was masked for the common case because uncompressed --init-model still works: the variables are restored later in _init_session, which uses the correct "init_from_model" literal and does not need ckpt_meta. So only compressed-checkpoint initialization was actually exposed, and no test exercised it.

Fix

Correct the dispatch literal to "init_from_model". To make the dispatch unit-testable — the reason the mismatch went uncaught is that it lived inline in the heavyweight build() — the four-way init dispatch is extracted into a small _init_from_run_opt() helper. The restart, init-from-frozen-model, and finetune branches already used the correct literals and are unchanged in behavior.

Test

Adds source/tests/tf/test_trainer_init_mode.py, which drives _init_from_run_opt on a stub trainer with the three concrete initializers mocked and asserts each init_mode routes correctly. On the old literal, init_from_model routes to nothing (the test fails); with the fix it reaches _init_from_ckpt(init_model). The test also covers restart, init_from_frz_model, finetune, and the scratch no-op. This dispatch previously had no coverage.

Summary by CodeRabbit

  • Bug Fixes

    • Improved training startup so model initialization now follows the selected setup mode more consistently, including restore, restart, fine-tuning, and frozen-model workflows.
    • If no initialization mode is selected, training now proceeds without extra initialization steps.
  • Tests

    • Added regression coverage for all supported initialization paths to help prevent future startup regressions.

RunOptions records `dp train --init-model` as init_mode == "init_from_model",
but DPTrainer.build() dispatched on the literal "init_model", so the branch
never matched and _init_from_ckpt was skipped for --init-model. That
pre-inspection is what imports the source checkpoint's meta graph and sets
self.ckpt_meta when the checkpoint is a compressed_model, before the graph is
built with ckpt_meta. With the mismatch, compressed-checkpoint --init-model
builds the graph without its checkpoint metadata. Uncompressed --init-model
still worked because variables are restored later in _init_session (which uses
the correct "init_from_model" literal) and needs no ckpt_meta, which masked the
bug.

Fix the literal to "init_from_model". The 4-way init dispatch is extracted from
the heavyweight build() into a small _init_from_run_opt() helper so it can be
unit-tested in isolation; this is why the mismatch went uncaught.

Adds a regression test that drives the dispatch with a stub trainer and mocked
initializers: it fails on the old literal (init_from_model routes nowhere) and
passes with the fix, and also covers restart, init_from_frz_model, finetune,
and scratch.

Fix deepmodeling#5679
@dosubot dosubot Bot added the bug label Jul 3, 2026
@github-actions github-actions Bot added the Python label Jul 3, 2026
@wanghan-iapcm wanghan-iapcm requested a review from njzjz July 3, 2026 06:24
@coderabbitai

coderabbitai Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: d5e94e3f-4fdf-4477-86ff-c5cd321dfe42

📥 Commits

Reviewing files that changed from the base of the PR and between dd38b35 and bb30614.

📒 Files selected for processing (2)
  • deepmd/tf/train/trainer.py
  • source/tests/tf/test_trainer_init_mode.py

📝 Walkthrough

Walkthrough

The inline run_opt.init_mode dispatch logic in DPTrainer.build() was extracted into a new _init_from_run_opt() method that routes to existing initializer methods based on init mode, with no fallback for unrecognized modes. A new test module validates the dispatch routing for all supported init modes.

Changes

Trainer init-mode refactor

Layer / File(s) Summary
Extract init-mode dispatch into helper method
deepmd/tf/train/trainer.py
build() now calls a new _init_from_run_opt() method that routes init_from_frz_model, init_from_model, restart, and finetune modes to their respective initializers, silently skipping unrecognized modes.
Regression tests for dispatch routing
source/tests/tf/test_trainer_init_mode.py
New test module bypasses DPTrainer construction, patches initializer methods, and asserts correct routing per init mode, including a no-op check for init_from_scratch.

Estimated code review effort: 2 (Simple) | ~10 minutes

Possibly related issues

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main fix: routing --init-model to checkpoint pre-inspection in TensorFlow training.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

"""

import types
import unittest
@codecov

codecov Bot commented Jul 3, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 81.14%. Comparing base (dd38b35) to head (bb30614).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5718      +/-   ##
==========================================
- Coverage   81.26%   81.14%   -0.13%     
==========================================
  Files         988      988              
  Lines      110877   110877              
  Branches     4234     4234              
==========================================
- Hits        90103    89966     -137     
- Misses      19249    19383     +134     
- Partials     1525     1528       +3     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@njzjz-bot njzjz-bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the RunOptions/DPTrainer dispatch path and the new regression test. The fix changes the pre-build checkpoint inspection dispatch to the RunOptions literal init_from_model, which matches the documented bug in #5679 and keeps the existing restart/frozen/finetune branches behaviorally unchanged.

The added test covers all supported init modes plus the scratch no-op, and it would fail on the previous init_model literal. CI is green, including the Python test matrix and CodeQL. I do not see a blocking issue.

One non-blocking cleanup: GitHub Advanced Security noted the mixed import unittest / from unittest import mock style in the new test file. It can be cleaned up, but I would not block this fix on it.

Reviewed by OpenClaw 2026.6.8 (844f405) (model: custom-chat-jinzhezeng-group/gpt-5.5).

@njzjz njzjz enabled auto-merge July 4, 2026 01:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Code scan] Inspect TensorFlow --init-model checkpoints before graph build

4 participants