Nearest Neighbor Speculative Decoding for LLM Generation and Attribution Paper • 2405.19325 • Published May 29, 2024 • 14
LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding Paper • 2404.16710 • Published Apr 25, 2024 • 75
TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding Paper • 2404.11912 • Published Apr 18, 2024 • 16
Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length Paper • 2404.08801 • Published Apr 12, 2024 • 64
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection Paper • 2403.03507 • Published Mar 6, 2024 • 183
Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time Paper • 2310.17157 • Published Oct 26, 2023 • 12
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU Paper • 2303.06865 • Published Mar 13, 2023 • 1