-
Notifications
You must be signed in to change notification settings - Fork 183
feat: implement dynamic vocabulary, stateful attention, and enhanced … #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: implement dynamic vocabulary, stateful attention, and enhanced … #1
Conversation
…training Major enhancements to the BDH model architecture and training pipeline: Model Architecture: - Replace byte-level tokenization with dynamic character-level vocabulary - Implement stateful attention mechanism for efficient generation - Add support for past states in forward pass to enable incremental inference - Refactor Attention class to support both training and generation modes - Add RoPE (Rotary Position Embedding) time offset handling - Create separate attention modules per layer for state management Training Improvements: - Add comprehensive vocabulary building from input text - Implement checkpoint saving and loading functionality - Add evaluation loss tracking with configurable frequency - Support for effective batch size with gradient accumulation - Add automatic optimal batch size calculation based on VRAM - Enhanced logging with timing information for generation Generation Optimization: - Implement efficient stateful generation with O(1) per-token complexity - Add generation timing and progress reporting - Support for temperature and top-k sampling parameters - Process initial prompt in parallel, then generate tokens sequentially Configuration Updates: - Adjust mlp_internal_dim_multiplier from 128 to 64 - Add support for training modes: scratch, continue, evaluate - Configurable generation parameters and evaluation settings Infrastructure: - Add __pycache__ and *.pyc to .gitignore - Enable torch._dynamo error suppression for compilation - Add comprehensive error handling and memory management
|
@adamskrodzki thank you so much for this very nice contribution - it's likely to be helpful for many persons experimenting with the model! We are considering keeping this repo super compact (as a nanoGPT-like stub for further development), and encouraging more advanced repos/forks - which would be described and linked in our README. Would you be open to such an approach? We can follow up on this in another channel.
Correct for both! Again, this repo is really minimal.
That's probably fine. There are two tests you may want to take a look at:
Still, the pay-off here is limited: other than not going off rambling so soon, a model trained on "prose-only" pre-train, without any data/training harness elements that make SOTA reasoning models work, is very unlikely to have picked up any longer-term reasoning pattern. |
|
big aspirations welcome: https://github.com/mosure/burn_dragon_hatchling |
|
FYI - I uploaded a Hugging Face transformers-compatible implemenation of this model: https://github.com/jploski/bdh-transformers |
Thank you @jploski! I would have a small request on a formal point. At Pathway we usually use MIT license for all of our public repos. So:
|
I picked Apache 2.0 mainly because the HF transformers library is released under it. So I suppose it should make it easier for them to merge this code into the official library if they choose so.
The standard version of the MIT license requires me to mention you in a copyright notice in any derived work. But if you explicitly waive that obligation, I can remove that mention. My proposition: create a PR to my repository which removes the copyright notices that you wish removed. That way that is a clear record of the explicit waiving. |
First,
congratulations on this release. It is very interesting architecture.
I've took liberty of tinkering with this repository and made few improvements:
To my understanding inference should be O(1) in function of sequence length. But I believe original code was O(n). I attempted to fix it, hopefully correctly. I've tried to generate sequences longer than trained seqence length and it works quite well for let say 2-3x generation, but beyond there is evident degradation I wonder if that is something I've messed up in this implementation.
Added support for evaluation. On small dataset model trains very quickly but also quickly overfits
Quality of life improvements like checkpoint saving, effective batch size (to be able to run code easily with any amount of VRAM) and some others