Abstract
Linear RNN architectures, like Mamba, can be competitive with Transformer models in language modeling while having advantageous deployment characteristics. Given the focus on training large-scale Transformer models, we consider the challenge of converting these pretrained models for deployment. We demonstrate that it is feasible to distill large Transformers into linear RNNs by reusing the linear projection weights from attention layers with academic GPU resources. The resulting hybrid model, which incorporates a quarter of the attention layers, achieves performance comparable to the original Transformer in chat benchmarks and outperforms open-source hybrid Mamba models trained from scratch with trillions of tokens in both chat benchmarks and general benchmarks. Moreover, we introduce a hardware-aware speculative decoding algorithm that accelerates the inference speed of Mamba and hybrid models. Overall we show how, with limited computation resources, we can remove many of the original attention layers and generate from the resulting model more efficiently. Our top-performing model, distilled from Llama3-8B-Instruct, achieves a 29.61 length-controlled win rate on AlpacaEval 2 against GPT-4 and 7.35 on MT-Bench, surpassing the best instruction-tuned linear RNN model.
Community
Funny coincidence - I was just tinkering with a similar idea yesterday. Swapped out the attention blocks in Phi-3.5-mini for RNNs. If anyone's curious, you can check out my experiment here: https://github.com/JosefAlbers/Phi-3-Vision-MLX/blob/main/assets/bytephi.py
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Clover-2: Accurate Inference for Regressive Lightweight Speculative Decoding (2024)
- Pruning Large Language Models to Intra-module Low-rank Architecture with Transitional Activations (2024)
- Longhorn: State Space Models are Amortized Online Learners (2024)
- S2D: Sorted Speculative Decoding For More Efficient Deployment of Nested Large Language Models (2024)
- How Effective are State Space Models for Machine Translation? (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Hey, Amazing work :)
We've summarised this and a few other papers in our blog. Hope you like it
- KTO: The infamous alignment algorithm
- OLMoE: Open Data, Weights, Code Mixture of Experts models
- Mamba in the LlaMA: Distilling from Transformers to Mamba
- PlanSearch: Improving Code Generation via Planning
https://datta0.substack.com/p/ai-unplugged-19-kto-for-model-alignment
Models citing this paper 21
Browse 21 models citing this paperDatasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper