feat: implement dynamic vocabulary, stateful attention, and enhanced … #1

adamskrodzki · 2025-10-05T17:20:53Z

First,

congratulations on this release. It is very interesting architecture.

I've took liberty of tinkering with this repository and made few improvements:

To my understanding inference should be O(1) in function of sequence length. But I believe original code was O(n). I attempted to fix it, hopefully correctly. I've tried to generate sequences longer than trained seqence length and it works quite well for let say 2-3x generation, but beyond there is evident degradation I wonder if that is something I've messed up in this implementation.
Added support for evaluation. On small dataset model trains very quickly but also quickly overfits
Quality of life improvements like checkpoint saving, effective batch size (to be able to run code easily with any amount of VRAM) and some others

…training Major enhancements to the BDH model architecture and training pipeline: Model Architecture: - Replace byte-level tokenization with dynamic character-level vocabulary - Implement stateful attention mechanism for efficient generation - Add support for past states in forward pass to enable incremental inference - Refactor Attention class to support both training and generation modes - Add RoPE (Rotary Position Embedding) time offset handling - Create separate attention modules per layer for state management Training Improvements: - Add comprehensive vocabulary building from input text - Implement checkpoint saving and loading functionality - Add evaluation loss tracking with configurable frequency - Support for effective batch size with gradient accumulation - Add automatic optimal batch size calculation based on VRAM - Enhanced logging with timing information for generation Generation Optimization: - Implement efficient stateful generation with O(1) per-token complexity - Add generation timing and progress reporting - Support for temperature and top-k sampling parameters - Process initial prompt in parallel, then generate tokens sequentially Configuration Updates: - Adjust mlp_internal_dim_multiplier from 128 to 64 - Add support for training modes: scratch, continue, evaluate - Configurable generation parameters and evaluation settings Infrastructure: - Add __pycache__ and *.pyc to .gitignore - Enable torch._dynamo error suppression for compilation - Add comprehensive error handling and memory management

dxtrous · 2025-10-08T16:56:59Z

@adamskrodzki thank you so much for this very nice contribution - it's likely to be helpful for many persons experimenting with the model!

We are considering keeping this repo super compact (as a nanoGPT-like stub for further development), and encouraging more advanced repos/forks - which would be described and linked in our README. Would you be open to such an approach? We can follow up on this in another channel.

To my understanding inference should be O(1) in function of sequence length. But I believe original code was O(n).

Correct for both! Again, this repo is really minimal.

I attempted to fix it, hopefully correctly. I've tried to generate sequences longer than trained seqence length and it works quite well for let say 2-3x generation, but beyond there is evident degradation I wonder if that is something I've messed up in this implementation.

That's probably fine. There are two tests you may want to take a look at:

As a general rule, larger models should stay "non-rambling" (consistent, closer to in-sample data) for a longer generation time than smaller ones.
If you are using top-k sampling in generation, see the impact of k.

Still, the pay-off here is limited: other than not going off rambling so soon, a model trained on "prose-only" pre-train, without any data/training harness elements that make SOTA reasoning models work, is very unlikely to have picked up any longer-term reasoning pattern.

mosure · 2025-10-08T19:32:35Z

big aspirations welcome: https://github.com/mosure/burn_dragon_hatchling

jploski · 2025-11-25T12:01:12Z

FYI - I uploaded a Hugging Face transformers-compatible implemenation of this model: https://github.com/jploski/bdh-transformers

dxtrous · 2025-11-25T18:54:44Z

FYI - I uploaded a Hugging Face transformers-compatible implemenation of this model: https://github.com/jploski/bdh-transformers

Thank you @jploski!

I would have a small request on a formal point. At Pathway we usually use MIT license for all of our public repos. So:

if you consider that your repo is a fork (derived from our repo), could you please change the license of your repo back to MIT?
if you consider that your repo is not a fork, could you please exclude us from the copyright notice, and (if you so wish) mention us in an acknowledgment?

jploski · 2025-11-25T19:04:13Z

FYI - I uploaded a Hugging Face transformers-compatible implemenation of this model: https://github.com/jploski/bdh-transformers

Thank you @jploski!

I would have a small request on a formal point. At Pathway we usually use MIT license for all of our public repos. So:
* if you consider that your repo is a fork (derived from our repo), could you please change the license of your repo back to MIT?

I picked Apache 2.0 mainly because the HF transformers library is released under it. So I suppose it should make it easier for them to merge this code into the official library if they choose so.

* if you consider that your repo is not a fork, could you please exclude us from the copyright notice, and (if you so wish) mention us in an acknowledgment?

The standard version of the MIT license requires me to mention you in a copyright notice in any derived work. But if you explicitly waive that obligation, I can remove that mention.

My proposition: create a PR to my repository which removes the copyright notices that you wish removed. That way that is a clear record of the explicit waiving.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: implement dynamic vocabulary, stateful attention, and enhanced … #1

feat: implement dynamic vocabulary, stateful attention, and enhanced … #1

adamskrodzki commented Oct 5, 2025

Uh oh!

dxtrous commented Oct 8, 2025

Uh oh!

mosure commented Oct 8, 2025

Uh oh!

jploski commented Nov 25, 2025 •

edited

Loading

Uh oh!

dxtrous commented Nov 25, 2025

Uh oh!

jploski commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat: implement dynamic vocabulary, stateful attention, and enhanced … #1

Are you sure you want to change the base?

feat: implement dynamic vocabulary, stateful attention, and enhanced … #1

Conversation

adamskrodzki commented Oct 5, 2025

Uh oh!

dxtrous commented Oct 8, 2025

Uh oh!

mosure commented Oct 8, 2025

Uh oh!

jploski commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dxtrous commented Nov 25, 2025

Uh oh!

jploski commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jploski commented Nov 25, 2025 •

edited

Loading