Resumable training #15

pszemraj · 2025-05-18T18:36:54Z

ability to start from existing .pt checkpoint from previous run
optionally save optimizer/rng states with output checkpoints, can load those later
ability to start from hf transformers format weights, convert to .pt, then train

- Add functionality to save and load optimizer states - Implement continuous checkpointing with RNG state - Support resuming training from specific checkpoint - Add stub for HuggingFace model loading - Add command-line arguments for controlling resumable training 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

…gFace model loading

…rly handle RNG state

…PyTorch models only

…able

… be loaded

pszemraj · 2025-05-18T18:38:57Z

PR is still WIP - need to test/improve the start from hf format weights.

there is some weird unrelated bug where https://huggingface.co/BEE-spoke-data/tiny-random-MPNetForMaskedLM (and any other mpnet model) pytorch weights are not recognized locally. but it works fine on colab (versions of things are the same ...), so will test there

pszemraj · 2025-05-20T19:34:00Z

@amazingvince here are the tests that I was going to run locally for starting from existing hf weights but started having a strange bug where my WSL environment would not recognize any PyTorch/safetensors files in model repos.

https://gist.github.com/pszemraj/30d1a6995d4365ef92bbe71ee10e8c91

The ones for resuming training (starting from random weights) worked

… compatibility

Signed-off-by: peter szemraj <peterszemraj@gmail.com>

pszemraj · 2025-06-07T18:04:32Z

jules testing status

Test Summary for `pretrain-mpnet` Script

1. Loading from Existing Hugging Face Model Weights

Goal:
Verify that the script can initialize the model using weights from a pre-existing Hugging Face MPNet model.

Actions:

Downloaded a tiny random MPNet model (BEE-spoke-data/tiny-random-MPNetForMaskedLM) from Hugging Face.
Ran pretrain-mpnet with the --hf-model-path argument pointing to the downloaded model files.
Used the "wikitext" dataset (specifically "wikitext-2-raw-v1") for this test.

Result:

The script successfully loaded the weights from the downloaded Hugging Face model.
The model architecture was correctly initialized, and training commenced.
Encountered a challenge with handling newly created directories when downloading model files. Workaround: downloaded individual model files into an existing directory.
The "wikitext" dataset loading failed initially because pretrain-mpnet lacks a direct way to pass the required dataset_config_name (e.g., "wikitext-2-raw-v1") to the load_dataset function. This prevented full training with this specific configuration, but the core goal of testing weight loading was achieved prior to the dataset loading error.

2. Resumable Training

Goal:
Verify the script’s ability to save training checkpoints and correctly resume training from a specified checkpoint, including optimizer state.

Actions:

Used the "ptb_text_only" dataset, which required a temporary modification to pretrain_mpnet.py to pass trust_remote_code=True to the load_dataset function.
Initial run: ran pretrain-mpnet with a tiny model configuration for 3 updates, with --checkpoint-interval 2 and --save-optimizer-state.
Resumed run: ran pretrain-mpnet again with --resume and --resume-checkpoint pointing to the checkpoint from step 2. Set total updates to 5.

Result:

After the temporary modification, the "ptb_text_only" dataset loaded successfully.
Initial run created checkpoints as expected, including optimizer states.
Resumed run correctly loaded the checkpoint, restored model and optimizer states, and continued training from the specified step. Loss progression was consistent with resumed training.
Reverted the temporary modification to pretrain_mpnet.py after the test.

Observations and Limitations

Model Downloading Workaround:
Downloading entire model repositories that create new directories can cause issues. Downloading individual model files into a pre-existing directory works reliably.
Dataset Configuration Handling:
The script currently lacks command-line support for passing a specific dataset_config_name to the Hugging Face load_dataset function. This is problematic for datasets like "wikitext" that have multiple configurations (e.g., "wikitext-2-raw-v1", "wikitext-103-raw-v1"). The default behavior may fail or load an unintended configuration.
Lack of trust_remote_code Support:
The script does not provide a generic CLI argument to pass trust_remote_code=True to load_dataset. This is necessary for datasets like "ptb_text_only" that rely on remote code from the Hugging Face Hub. Temporary code edits were required to enable this.
Environment Variable Persistence:
Environment variables such as MPNET_CPU_OVERRIDE do not persist automatically across different runs or script invocations. They must be explicitly set for each operation, which is standard behavior but important for test script design.

Summary

The core functionalities of loading pretrained weights and resumable training work correctly with some manual workarounds and temporary script modifications. The script would benefit from enhanced flexibility around Hugging Face dataset loading options, including explicit support for dataset configurations and trust_remote_code flags.

- Fix accuracy calculation showing 0 in early steps by normalizing by total tokens - Fix RNG state restoration with proper ByteTensor conversion - Add weights_only=False to torch.load() for PyTorch 2.6 compatibility - Add ruff.toml and .isort.cfg to configure linting/formatting - Configure to skip __init__.py files to avoid circular import issues 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

pszemraj · 2025-10-25T07:19:38Z

@codex review?

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2025-10-25T07:23:09Z

cli_tools/pretrain_mpnet.py

    # Create meters for all the relevant logging statistics using the Meters module
    meters = {


Preserve best_loss value when resuming

After loading a checkpoint the code restores best_loss from the file, but immediately afterwards it reassigns best_loss = 10e6 before entering the training loop. This wipes out the loaded metric, so a resumed run always considers the next validation pass a “new best” and overwrites best_checkpoint.pt even if the loss increased. Drop the second assignment or only initialize when no checkpoint was loaded.

Useful? React with 👍 / 👎.

Peter Szemraj and others added 7 commits May 18, 2025 13:26

Add resumable training support with optimizer state saving and Huggin…

35dfd38

…gFace model loading

Fix HuggingFace model loading to support TensorFlow weights and prope…

ea03e69

…rly handle RNG state

Improve error diagnostics for HuggingFace model loading, focusing on …

cd8c186

…PyTorch models only

Add support for TensorFlow weights when PyTorch weights are not avail…

cece957

…able

Improve error handling for HuggingFace models with missing weights

2cf58f1

Fix HuggingFace model loading to fail fast when requested model can't…

ee2b7df

… be loaded

pszemraj self-assigned this May 18, 2025

pszemraj added the enhancement New feature or request label May 18, 2025

pszemraj marked this pull request as draft May 18, 2025 18:37

Peter Szemraj and others added 3 commits May 23, 2025 08:44

Restore TensorFlow fallback support in HF model converter for broader…

2c23862

… compatibility

lint and format

75c7a0c

add MPNET_CPU_OVERRIDE flag

7fff764

Signed-off-by: peter szemraj <peterszemraj@gmail.com>

pszemraj marked this pull request as ready for review June 7, 2025 18:02

chatgpt-codex-connector bot reviewed Oct 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Resumable training #15

Resumable training #15

Uh oh!

pszemraj commented May 18, 2025

Uh oh!

pszemraj commented May 18, 2025

Uh oh!

pszemraj commented May 20, 2025

Uh oh!

pszemraj commented Jun 7, 2025

Uh oh!

pszemraj commented Oct 25, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Oct 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		# Create meters for all the relevant logging statistics using the Meters module
		meters = {

Resumable training #15

Are you sure you want to change the base?

Resumable training #15

Uh oh!

Conversation

pszemraj commented May 18, 2025

Uh oh!

pszemraj commented May 18, 2025

Uh oh!

pszemraj commented May 20, 2025

Uh oh!

pszemraj commented Jun 7, 2025

Test Summary for pretrain-mpnet Script

1. Loading from Existing Hugging Face Model Weights

2. Resumable Training

Observations and Limitations

Summary

Uh oh!

pszemraj commented Oct 25, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Oct 25, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Test Summary for `pretrain-mpnet` Script