Skip to content

Conversation

@pszemraj
Copy link
Owner

  • ability to start from existing .pt checkpoint from previous run
  • optionally save optimizer/rng states with output checkpoints, can load those later
  • ability to start from hf transformers format weights, convert to .pt, then train

Peter Szemraj and others added 7 commits May 18, 2025 13:26
- Add functionality to save and load optimizer states
- Implement continuous checkpointing with RNG state
- Support resuming training from specific checkpoint
- Add stub for HuggingFace model loading
- Add command-line arguments for controlling resumable training

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@pszemraj pszemraj self-assigned this May 18, 2025
@pszemraj pszemraj added the enhancement New feature or request label May 18, 2025
@pszemraj pszemraj marked this pull request as draft May 18, 2025 18:37
@pszemraj
Copy link
Owner Author

PR is still WIP - need to test/improve the start from hf format weights.

there is some weird unrelated bug where https://huggingface.co/BEE-spoke-data/tiny-random-MPNetForMaskedLM (and any other mpnet model) pytorch weights are not recognized locally. but it works fine on colab (versions of things are the same ...), so will test there

@pszemraj
Copy link
Owner Author

@amazingvince here are the tests that I was going to run locally for starting from existing hf weights but started having a strange bug where my WSL environment would not recognize any PyTorch/safetensors files in model repos.

https://gist.github.com/pszemraj/30d1a6995d4365ef92bbe71ee10e8c91

The ones for resuming training (starting from random weights) worked

Peter Szemraj and others added 3 commits May 23, 2025 08:44
@pszemraj pszemraj marked this pull request as ready for review June 7, 2025 18:02
@pszemraj
Copy link
Owner Author

pszemraj commented Jun 7, 2025

jules testing status

Test Summary for pretrain-mpnet Script


1. Loading from Existing Hugging Face Model Weights

Goal:
Verify that the script can initialize the model using weights from a pre-existing Hugging Face MPNet model.

Actions:

  • Downloaded a tiny random MPNet model (BEE-spoke-data/tiny-random-MPNetForMaskedLM) from Hugging Face.
  • Ran pretrain-mpnet with the --hf-model-path argument pointing to the downloaded model files.
  • Used the "wikitext" dataset (specifically "wikitext-2-raw-v1") for this test.

Result:

  • The script successfully loaded the weights from the downloaded Hugging Face model.
  • The model architecture was correctly initialized, and training commenced.
  • Encountered a challenge with handling newly created directories when downloading model files. Workaround: downloaded individual model files into an existing directory.
  • The "wikitext" dataset loading failed initially because pretrain-mpnet lacks a direct way to pass the required dataset_config_name (e.g., "wikitext-2-raw-v1") to the load_dataset function. This prevented full training with this specific configuration, but the core goal of testing weight loading was achieved prior to the dataset loading error.

2. Resumable Training

Goal:
Verify the script’s ability to save training checkpoints and correctly resume training from a specified checkpoint, including optimizer state.

Actions:

  • Used the "ptb_text_only" dataset, which required a temporary modification to pretrain_mpnet.py to pass trust_remote_code=True to the load_dataset function.
  • Initial run: ran pretrain-mpnet with a tiny model configuration for 3 updates, with --checkpoint-interval 2 and --save-optimizer-state.
  • Resumed run: ran pretrain-mpnet again with --resume and --resume-checkpoint pointing to the checkpoint from step 2. Set total updates to 5.

Result:

  • After the temporary modification, the "ptb_text_only" dataset loaded successfully.
  • Initial run created checkpoints as expected, including optimizer states.
  • Resumed run correctly loaded the checkpoint, restored model and optimizer states, and continued training from the specified step. Loss progression was consistent with resumed training.
  • Reverted the temporary modification to pretrain_mpnet.py after the test.

Observations and Limitations

  • Model Downloading Workaround:
    Downloading entire model repositories that create new directories can cause issues. Downloading individual model files into a pre-existing directory works reliably.

  • Dataset Configuration Handling:
    The script currently lacks command-line support for passing a specific dataset_config_name to the Hugging Face load_dataset function. This is problematic for datasets like "wikitext" that have multiple configurations (e.g., "wikitext-2-raw-v1", "wikitext-103-raw-v1"). The default behavior may fail or load an unintended configuration.

  • Lack of trust_remote_code Support:
    The script does not provide a generic CLI argument to pass trust_remote_code=True to load_dataset. This is necessary for datasets like "ptb_text_only" that rely on remote code from the Hugging Face Hub. Temporary code edits were required to enable this.

  • Environment Variable Persistence:
    Environment variables such as MPNET_CPU_OVERRIDE do not persist automatically across different runs or script invocations. They must be explicitly set for each operation, which is standard behavior but important for test script design.


Summary

The core functionalities of loading pretrained weights and resumable training work correctly with some manual workarounds and temporary script modifications. The script would benefit from enhanced flexibility around Hugging Face dataset loading options, including explicit support for dataset configurations and trust_remote_code flags.


- Fix accuracy calculation showing 0 in early steps by normalizing by total tokens
- Fix RNG state restoration with proper ByteTensor conversion
- Add weights_only=False to torch.load() for PyTorch 2.6 compatibility
- Add ruff.toml and .isort.cfg to configure linting/formatting
- Configure to skip __init__.py files to avoid circular import issues

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@pszemraj
Copy link
Owner Author

@codex review?

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 553 to 554
# Create meters for all the relevant logging statistics using the Meters module
meters = {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve best_loss value when resuming

After loading a checkpoint the code restores best_loss from the file, but immediately afterwards it reassigns best_loss = 10e6 before entering the training loop. This wipes out the loaded metric, so a resumed run always considers the next validation pass a “new best” and overwrites best_checkpoint.pt even if the loss increased. Drop the second assignment or only initialize when no checkpoint was loaded.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants