-
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
Paper • 2402.17764 • Published • 572 -
Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding
Paper • 2404.16710 • Published • 56 -
Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory
Paper • 2405.08707 • Published • 27 -
Token-Scaled Logit Distillation for Ternary Weight Generative Language Models
Paper • 2308.06744 • Published • 1
Collections
Discover the best community collections!
Collections including paper arxiv:2404.16710
-
Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting
Paper • 2404.18911 • Published • 29 -
Accelerating LLM Inference with Staged Speculative Decoding
Paper • 2308.04623 • Published • 21 -
An Emulator for Fine-Tuning Large Language Models using Small Language Models
Paper • 2310.12962 • Published • 13 -
The Curious Case of Neural Text Degeneration
Paper • 1904.09751 • Published • 2
-
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
Paper • 2402.17764 • Published • 572 -
BitNet: Scaling 1-bit Transformers for Large Language Models
Paper • 2310.11453 • Published • 94 -
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models
Paper • 2404.02258 • Published • 102 -
TransformerFAM: Feedback attention is working memory
Paper • 2404.09173 • Published • 42
-
Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation
Paper • 2403.16990 • Published • 24 -
ViTAR: Vision Transformer with Any Resolution
Paper • 2403.18361 • Published • 48 -
Getting it Right: Improving Spatial Consistency in Text-to-Image Models
Paper • 2404.01197 • Published • 29 -
Bigger is not Always Better: Scaling Properties of Latent Diffusion Models
Paper • 2404.01367 • Published • 19
-
Unleashing the Power of Pre-trained Language Models for Offline Reinforcement Learning
Paper • 2310.20587 • Published • 15 -
JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and Attention
Paper • 2310.00535 • Published • 2 -
Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla
Paper • 2307.09458 • Published • 9 -
The Impact of Depth and Width on Transformer Language Model Generalization
Paper • 2310.19956 • Published • 9
-
Accelerating LLM Inference with Staged Speculative Decoding
Paper • 2308.04623 • Published • 21 -
An Emulator for Fine-Tuning Large Language Models using Small Language Models
Paper • 2310.12962 • Published • 13 -
The Curious Case of Neural Text Degeneration
Paper • 1904.09751 • Published • 2 -
On Speculative Decoding for Multimodal Large Language Models
Paper • 2404.08856 • Published • 12
-
Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference
Paper • 2403.09636 • Published • 2 -
TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
Paper • 2404.11912 • Published • 16 -
Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache
Paper • 2401.02669 • Published • 11 -
Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding
Paper • 2404.16710 • Published • 56