Prompt Cache: Modular Attention Reuse for Low-Latency Inference Paper • 2311.04934 • Published Nov 7, 2023 • 28
Routing to the Expert: Efficient Reward-guided Ensemble of Large Language Models Paper • 2311.08692 • Published Nov 15, 2023 • 12
Memory Augmented Language Models through Mixture of Word Experts Paper • 2311.10768 • Published Nov 15, 2023 • 16
Unlocking Anticipatory Text Generation: A Constrained Approach for Faithful Decoding with Large Language Models Paper • 2312.06149 • Published Dec 11, 2023 • 2
Distributed Inference and Fine-tuning of Large Language Models Over The Internet Paper • 2312.08361 • Published Dec 13, 2023 • 25
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU Paper • 2312.12456 • Published Dec 16, 2023 • 40
Soaring from 4K to 400K: Extending LLM's Context with Activation Beacon Paper • 2401.03462 • Published Jan 7 • 27
Supervised Knowledge Makes Large Language Models Better In-context Learners Paper • 2312.15918 • Published Dec 26, 2023 • 8
Speculative Streaming: Fast LLM Inference without Auxiliary Models Paper • 2402.11131 • Published Feb 16 • 42
Smaller Language Models Are Better Instruction Evolvers Paper • 2412.11231 • Published 7 days ago • 24