Skip to content

fix(jax): write the checkpoint pointer beside save_ckpt#5726

Open
wanghan-iapcm wants to merge 1 commit into
deepmodeling:masterfrom
wanghan-iapcm:fix-jax-ckpt-pointer
Open

fix(jax): write the checkpoint pointer beside save_ckpt#5726
wanghan-iapcm wants to merge 1 commit into
deepmodeling:masterfrom
wanghan-iapcm:fix-jax-ckpt-pointer

Conversation

@wanghan-iapcm

Copy link
Copy Markdown
Collaborator

Problem

Fixes #5678. JAX training writes checkpoint directories and the stable .jax link relative to save_ckpt (which may include a directory), but always wrote the checkpoint pointer file to the current working directory with a value that still carried the directory prefix, e.g. runs/water/model.ckpt.jax. The freeze entrypoint looks for the pointer inside the folder it is given and resolves the pointer's value relative to that folder (checkpoint_folder / pointer). So for save_ckpt = runs/water/model.ckpt, the pointer was written to ./checkpoint (not runs/water/checkpoint) and, even if relocated, its value would have double-prefixed to runs/water/runs/water/model.ckpt.jax. Passing runs/water to freeze or restart-style tooling could not find or resolve the checkpoint, even though the matching checkpoint directory and .jax link were written there.

Fix

Write the pointer into Path(save_ckpt).parent and store a value relative to that directory (the basename only). For the default bare save_ckpt (parent is .) the pointer stays in the CWD with the same value, so existing behavior is unchanged; only directory-valued save_ckpt is affected.

Test

Adds source/tests/jax/test_checkpoint_pointer.py, which drives _save_checkpoint with the checkpoint I/O mocked. The directory case asserts the pointer lands beside the checkpoint (subdir/checkpoint) with a basename value (model.ckpt.jax) and not in the CWD — this fails on master — and a bare-name control asserts the pointer stays in the CWD. The trainer's pointer writing previously had no coverage; the existing freeze test hand-wrote a correct pointer and never exercised the writer.

JAX training writes checkpoint directories and the stable .jax link relative to
save_ckpt (which may include a directory), but always wrote the "checkpoint"
pointer file to the current working directory with a value that still carried
the directory prefix (e.g. "runs/water/model.ckpt.jax"). The freeze entrypoint
looks for the pointer inside the folder it is given and resolves the value
relative to that folder, so a directory-valued save_ckpt both misplaced the
pointer and double-prefixed the resolved path, breaking freeze and restart-style
tooling.

Write the pointer into Path(save_ckpt).parent and store a value relative to that
directory (the basename only). For the default bare save_ckpt (parent == "."),
the pointer stays in the CWD with the same value, so existing behavior is
unchanged.

Adds source/tests/jax/test_checkpoint_pointer.py, which drives _save_checkpoint
with the checkpoint I/O mocked: the directory case asserts the pointer lands
beside the checkpoint with a basename value and not in the CWD (fails on master),
and a bare-name control asserts the pointer stays in the CWD. The trainer's
pointer writing previously had no test; the existing freeze test hand-wrote a
correct pointer and never exercised it.

Fix deepmodeling#5678
@coderabbitai

coderabbitai Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Warning

Review limit reached

@wanghan-iapcm, you've reached your PR review limit, so we couldn't start this review.

Next review available in: 18 minutes

Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available.
You're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please refer docs for additional details.

Review details
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: e8d05c5b-a26a-4997-aabf-54f872a5c3f6

📥 Commits

Reviewing files that changed from the base of the PR and between bbc3908 and e8c36e0.

📒 Files selected for processing (2)
  • deepmd/jax/train/trainer.py
  • source/tests/jax/test_checkpoint_pointer.py
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@dosubot dosubot Bot added the bug label Jul 3, 2026
@wanghan-iapcm wanghan-iapcm requested a review from njzjz July 3, 2026 17:36
@github-actions github-actions Bot added the Python label Jul 3, 2026

import os
import tempfile
import unittest
@codecov

codecov Bot commented Jul 3, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 80.78%. Comparing base (dd38b35) to head (e8c36e0).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5726      +/-   ##
==========================================
- Coverage   81.26%   80.78%   -0.48%     
==========================================
  Files         988      988              
  Lines      110877   110887      +10     
  Branches     4234     4232       -2     
==========================================
- Hits        90103    89580     -523     
- Misses      19249    19782     +533     
  Partials     1525     1525              

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Code scan] Write JAX checkpoint pointers beside save_ckpt

2 participants