Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models Paper • 2503.09573 • Published 2 days ago • 42
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs Paper • 2503.01743 • Published 11 days ago • 72
How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM? Paper • 2502.14502 • Published 22 days ago • 85
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention Paper • 2502.11089 • Published 26 days ago • 142
InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU Paper • 2502.08910 • Published 29 days ago • 143
TransMLA: Multi-head Latent Attention Is All You Need Paper • 2502.07864 • Published about 1 month ago • 47
Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling Paper • 2502.06703 • Published Feb 10 • 142
The Differences Between Direct Alignment Algorithms are a Blur Paper • 2502.01237 • Published Feb 3 • 112
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model Paper • 2502.02737 • Published Feb 4 • 203
OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models Paper • 2502.01061 • Published Feb 3 • 189