I hope to implement some acceleration technologies for Large Language Models (LLMs) because I enjoy doing this myself and love the challenge of bringing research papers into real-world applications.
If there are any technologies you'd like to develop or discuss, feel free to reach out. Thanks!
I'm excited to dive deeper into AI research!
- 2024/12/16: Add the
Medusa-1 Training Script v2 - 2024/12/15: Add the
Medusa-1 Training Script - 2024/12/12: Update the KV Cache support for Speculative Decoding
- 2024/12/04: Add the
Kangaroo Training Script v2 - 2024/11/26: Add the
Kangaroo Training Script - 2024/11/22: Update the
Target Model Keep Generation Mechanismexperiment - 2024/11/18: Update the
Self-Speculative Decodingexperiment results ofgoogle--gemma-2-9b-it. - 2024/11/12: Reviewing implementation challenges for
Self-Speculative Decodingand evaluating model compatibility for improved efficiency. - 2024/11/10: Initial setup for
Self-Speculative Decodingcompleted; data pipeline in place for testing draft-and-verify. - 2024/11/08:
Speculative Decodingsuccessfully implemented. Verified improved inference time with no noticeable accuracy degradation.
- Batched Speculative Decoding:
- Prompt lookup decoding: Determine timeline after reviewing initial implementations.
- UAG Integration: Assess when to integrate after
MedusaandKangarooare in place.
- 2024/11/08 | Complete
Speculative Decodingfollowing the paper Fast Inference from Transformers via Speculative Decoding - 2024/11/15 | Implement
Self-Speculative Decodingas per Draft & Verify - Lossless Large Language Model Acceleration via Self-Speculative Decoding- LayerSkip model architecture
- Bayesian Optimization for Layer Skip Selection (AR)
- Adaption Draft-Exiting Mechanism
- Optimization
- Bayesian Optimization for Layer Skip Selection (Speed)
-
gemma-2-9b-itexperiment
- 2024/11/22 | Develop
Kangaroofollowing Kangaroo - Lossless Self-Speculative Decoding via Double Early Exiting- Kangaroo model
- Training Script
- Implement double early exits to improve speed.
- 2024/11/29 | Implement
Medusafrom Medusa - Simple LLM Inference Acceleration Framework with Multiple Decoding Heads- Medusa model
- Training Script (Medusa-1)
- Testing
- 2025/03 | Implement
Hydrafrom Hydra: Sequentially-Dependent Draft Heads for Medusa Decoding - 2025/03 | Implement
Lookahead Decodingfrom Break the Sequential Dependency of LLM Inference Using Lookahead Decoding - 2025/04 | Implement
Eaglefrom EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty - 2025/04 | Implement
Eagle-2from EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees - 2025/04 | Implement
Eagle-3from EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test
- TBD | Implement
Batched Speculative Decodingfrom The Synergy of Speculative Decoding and Batching in Serving Large Language Models - TBD | Implement
prompt lookup decodingfrom prompt-lookup-decoding GitHub - TBD | Implement
UAG(Universal Assisted Generation) from Universal Assisted Generation Blog