Add log-spaced checkpoint schedule#113
Open
amazloumi wants to merge 1 commit into
Open
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests.
🚀 New features to boost your workflow:
|
Naeemkh
reviewed
May 27, 2026
Member
Naeemkh
left a comment
There was a problem hiding this comment.
Please update the CHANGELOG.md file.
Naeemkh
requested changes
May 27, 2026
Member
Naeemkh
left a comment
There was a problem hiding this comment.
There are three related parameter that is not clear to the user how to use them.
- In the schedule, without the comment, the user cannot understand if it is base 2 or base 10. May be change the log to
log2. - The
log_until, gives the impression that it is just up to this point, but based on the comment it is only valid for the dynamic logging. - If you use
keep_last_n, regardless of the previous parameters, all logs will be deleted except the last 3 of them.
Please take a look at them.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
CheckpointConfig.schedule("interval" | "log", default"interval") andlog_until(default 512). Inlogmode,should_save()saves at step 0 and each power of two up tolog_until, then everyintervalstep thereafter — the Pythia/PolyPythias cadence for studying early-training dynamics.CheckpointConfig.should_save(step)and route the save decision inscripts/train.pythrough it (was inlinestep % interval == 0). Defaultintervalmode is byte-for-byte unchanged.keep_last_nguard to allow<= 0= "retain all checkpoints".CheckpointManager._cleanup()already early-returns onkeep_last_n <= 0; this exposes it via config so dense early checkpoints survive retention during a dynamics study.log_untilvalidation.Testing
uv run ruff check kempnerforge/ tests/passesuv run ruff format --check kempnerforge/ tests/ scripts/passesuv run pyright kempnerforge/passes (0 errors)uv run pytest tests/unit/ -v --timeout=60passesuv run torchrun --nproc_per_node=4 -m pytest tests/distributed/ -vuv run pytest tests/e2e/ --e2e -vCloses #111