LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention Paper • 2502.14866 • Published 17 days ago • 12
Instruction Pre-Training: Language Models are Supervised Multitask Learners Paper • 2406.14491 • Published Jun 20, 2024 • 90
Byte Latent Transformer: Patches Scale Better Than Tokens Paper • 2412.09871 • Published Dec 13, 2024 • 93