Skip to content

Conversation

@adamskrodzki
Copy link

First,

congratulations on this release. It is very interesting architecture.

I've took liberty of tinkering with this repository and made few improvements:

  1. To my understanding inference should be O(1) in function of sequence length. But I believe original code was O(n). I attempted to fix it, hopefully correctly. I've tried to generate sequences longer than trained seqence length and it works quite well for let say 2-3x generation, but beyond there is evident degradation I wonder if that is something I've messed up in this implementation.

  2. Added support for evaluation. On small dataset model trains very quickly but also quickly overfits

  3. Quality of life improvements like checkpoint saving, effective batch size (to be able to run code easily with any amount of VRAM) and some others

…training

Major enhancements to the BDH model architecture and training pipeline:

Model Architecture:
- Replace byte-level tokenization with dynamic character-level vocabulary
- Implement stateful attention mechanism for efficient generation
- Add support for past states in forward pass to enable incremental inference
- Refactor Attention class to support both training and generation modes
- Add RoPE (Rotary Position Embedding) time offset handling
- Create separate attention modules per layer for state management

Training Improvements:
- Add comprehensive vocabulary building from input text
- Implement checkpoint saving and loading functionality
- Add evaluation loss tracking with configurable frequency
- Support for effective batch size with gradient accumulation
- Add automatic optimal batch size calculation based on VRAM
- Enhanced logging with timing information for generation

Generation Optimization:
- Implement efficient stateful generation with O(1) per-token complexity
- Add generation timing and progress reporting
- Support for temperature and top-k sampling parameters
- Process initial prompt in parallel, then generate tokens sequentially

Configuration Updates:
- Adjust mlp_internal_dim_multiplier from 128 to 64
- Add support for training modes: scratch, continue, evaluate
- Configurable generation parameters and evaluation settings

Infrastructure:
- Add __pycache__ and *.pyc to .gitignore
- Enable torch._dynamo error suppression for compilation
- Add comprehensive error handling and memory management
@dxtrous
Copy link
Member

dxtrous commented Oct 8, 2025

@adamskrodzki thank you so much for this very nice contribution - it's likely to be helpful for many persons experimenting with the model!

We are considering keeping this repo super compact (as a nanoGPT-like stub for further development), and encouraging more advanced repos/forks - which would be described and linked in our README. Would you be open to such an approach? We can follow up on this in another channel.

To my understanding inference should be O(1) in function of sequence length. But I believe original code was O(n).

Correct for both! Again, this repo is really minimal.

I attempted to fix it, hopefully correctly. I've tried to generate sequences longer than trained seqence length and it works quite well for let say 2-3x generation, but beyond there is evident degradation I wonder if that is something I've messed up in this implementation.

That's probably fine. There are two tests you may want to take a look at:

  • As a general rule, larger models should stay "non-rambling" (consistent, closer to in-sample data) for a longer generation time than smaller ones.
  • If you are using top-k sampling in generation, see the impact of k.

Still, the pay-off here is limited: other than not going off rambling so soon, a model trained on "prose-only" pre-train, without any data/training harness elements that make SOTA reasoning models work, is very unlikely to have picked up any longer-term reasoning pattern.

@mosure
Copy link

mosure commented Oct 8, 2025

big aspirations welcome: https://github.com/mosure/burn_dragon_hatchling

@jploski
Copy link

jploski commented Nov 25, 2025

FYI - I uploaded a Hugging Face transformers-compatible implemenation of this model: https://github.com/jploski/bdh-transformers

@dxtrous
Copy link
Member

dxtrous commented Nov 25, 2025

FYI - I uploaded a Hugging Face transformers-compatible implemenation of this model: https://github.com/jploski/bdh-transformers

Thank you @jploski!

I would have a small request on a formal point. At Pathway we usually use MIT license for all of our public repos. So:

  • if you consider that your repo is a fork (derived from our repo), could you please change the license of your repo back to MIT?
  • if you consider that your repo is not a fork, could you please exclude us from the copyright notice, and (if you so wish) mention us in an acknowledgment?

@jploski
Copy link

jploski commented Nov 25, 2025

FYI - I uploaded a Hugging Face transformers-compatible implemenation of this model: https://github.com/jploski/bdh-transformers

Thank you @jploski!

I would have a small request on a formal point. At Pathway we usually use MIT license for all of our public repos. So:

* if you consider that your repo is a fork (derived from our repo), could you please change the license of your repo back to MIT?

I picked Apache 2.0 mainly because the HF transformers library is released under it. So I suppose it should make it easier for them to merge this code into the official library if they choose so.

* if you consider that your repo is not a fork, could you please exclude us from the copyright notice, and (if you so wish) mention us in an acknowledgment?

The standard version of the MIT license requires me to mention you in a copyright notice in any derived work. But if you explicitly waive that obligation, I can remove that mention.

My proposition: create a PR to my repository which removes the copyright notices that you wish removed. That way that is a clear record of the explicit waiving.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants