ShortGPT: Layers in Large Language Models are More Redundant Than You Expect
Abstract
As Large Language Models (LLMs) continue to advance in performance, their size has escalated significantly, with current LLMs containing billions or even trillions of parameters. However, in this study, we discovered that many layers of LLMs exhibit high similarity, and some layers play a negligible role in network functionality. Based on this observation, we define a metric called Block Influence (BI) to gauge the significance of each layer in LLMs. We then propose a straightforward pruning approach: layer removal, in which we directly delete the redundant layers in LLMs based on their BI scores. Experiments demonstrate that our method, which we call ShortGPT, significantly outperforms previous state-of-the-art (SOTA) methods in model pruning. Moreover, ShortGPT is orthogonal to quantization-like methods, enabling further reduction in parameters and computation. The ability to achieve better results through simple layer removal, as opposed to more complex pruning techniques, suggests a high degree of redundancy in the model architecture.
Community
Many papers taking transformer block pruning approach are released recently.
SLEB focuses on block pruning's advantages to width pruning or 2:4 structured pruning from the perspective of speedup.
https://huggingface.co/papers/2402.09025
Shortened LLaMA reports performance of block pruned LLMs' after parameter-efficient retraining https://huggingface.co/papers/2402.02834
We are all contemporaries striving towards the same goals in this field.
Curious why they didnt't compare to greedy or beam search over layer removal sequences (scored by downstream perplexity) or even block-influence-greedy (remove lowest block influence, then rescore and compute block influence)
Your suggestion is intriguing and definitely merits consideration. However, I have a couple of reservations:
The downstream task specialization involves different models that may not be sufficiently generalizable, potentially compromising overall performance due to their specialized nature.
Additionally, the implementation of beam search and block-influence-greedy algorithms, while innovative, presents significant complexity and incurs considerable computational costs.
We have provided a detailed description of our layer removal setup in the appendix of our paper, where it can be thoroughly examined.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- LaCo: Large Language Model Pruning via Layer Collapse (2024)
- Head-wise Shareable Attention for Large Language Models (2024)
- BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation (2024)
- SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks (2024)
- Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
I love the block importance concept and I think its valuable to how we understand transformers. Im not sure if/when I would use this in production though given the trade-offs.
What are the trade-offs? I haven't seen them mentioned anywhere explicitly.
Can I get the information of which dataset is used to calculate Block Influence and extra details (sequence length, Did they used average of all tokens or sampled one...)?
Although some layers look "redundant" for MMLU, they are not really redundant for all tasks. Removing any layer in the middle results in an increase of perplexity, and has negative impacts on other tasks such as multilingual generation, math and logical reasoning.
ShortGPT: Redefining Efficiency in Large Language Models!
Links ๐:
๐ Subscribe: https://www.youtube.com/@Arxflix
๐ Twitter: https://x.com/arxflix
๐ LMNT (Partner): https://lmnt.com/
Models citing this paper 11
Browse 11 models citing this paperDatasets citing this paper 0
No dataset linking this paper