Skip to content

Add log-spaced checkpoint schedule#113

Open
amazloumi wants to merge 1 commit into
mainfrom
feat/log-checkpoint-schedule
Open

Add log-spaced checkpoint schedule#113
amazloumi wants to merge 1 commit into
mainfrom
feat/log-checkpoint-schedule

Conversation

@amazloumi
Copy link
Copy Markdown
Member

Summary

  • Add CheckpointConfig.schedule ("interval" | "log", default "interval") and log_until (default 512). In log mode, should_save() saves at step 0 and each power of two up to log_until, then every interval step thereafter — the Pythia/PolyPythias cadence for studying early-training dynamics.
  • Add CheckpointConfig.should_save(step) and route the save decision in scripts/train.py through it (was inline step % interval == 0). Default interval mode is byte-for-byte unchanged.
  • Relax the keep_last_n guard to allow <= 0 = "retain all checkpoints". CheckpointManager._cleanup() already early-returns on keep_last_n <= 0; this exposes it via config so dense early checkpoints survive retention during a dynamics study.
  • Add unit tests for both schedules, keep-all retention, and log_until validation.

Testing

  • uv run ruff check kempnerforge/ tests/ passes
  • uv run ruff format --check kempnerforge/ tests/ scripts/ passes
  • uv run pyright kempnerforge/ passes (0 errors)
  • uv run pytest tests/unit/ -v --timeout=60 passes
  • If distributed code changed: uv run torchrun --nproc_per_node=4 -m pytest tests/distributed/ -v
  • If training loop / parallelism / optimizers changed: uv run pytest tests/e2e/ --e2e -v

Closes #111

@amazloumi amazloumi requested review from Naeemkh and mmshad May 27, 2026 00:52
@codecov
Copy link
Copy Markdown

codecov Bot commented May 27, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

Files with missing lines Coverage Δ
kempnerforge/config/checkpoint.py 100.00% <100.00%> (+8.69%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Member

@Naeemkh Naeemkh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the CHANGELOG.md file.

Copy link
Copy Markdown
Member

@Naeemkh Naeemkh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are three related parameter that is not clear to the user how to use them.

  • In the schedule, without the comment, the user cannot understand if it is base 2 or base 10. May be change the log to log2.
  • The log_until, gives the impression that it is just up to this point, but based on the comment it is only valid for the dynamic logging.
  • If you use keep_last_n, regardless of the previous parameters, all logs will be deleted except the last 3 of them.

Please take a look at them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add log-spaced checkpoint schedule

2 participants