Balancing Pipeline Parallelism with Vocabulary Parallelism Paper • 2411.05288 • Published 12 days ago • 18
LoLCATS Collection Linearizing LLMs with high quality and efficiency. We linearize the full Llama 3.1 model family -- 8b, 70b, 405b -- for the first time! • 4 items • Updated Oct 14 • 14
To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning Paper • 2409.12183 • Published Sep 18 • 36
Enhancing Training Efficiency Using Packing with Flash Attention Paper • 2407.09105 • Published Jul 12 • 14
Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler Paper • 2408.13359 • Published Aug 23 • 22
Power-LM Collection Dense & MoE LLMs trained with power learning rate scheduler. • 4 items • Updated Oct 17 • 15
view article Article Introducing AuraFace: Open-Source Face Recognition and Identity Preservation Models By isidentical • Aug 26 • 36
To Code, or Not To Code? Exploring Impact of Code in Pre-training Paper • 2408.10914 • Published Aug 20 • 41
💻 Local SmolLMs Collection SmolLM models in MLC, ONNX and GGUF format for local applications + in-browser demos • 14 items • Updated Aug 20 • 45
🪐 SmolLM Collection A series of smol LLMs: 135M, 360M and 1.7B. We release base and Instruct models as well as the training corpus and some WebGPU demos • 12 items • Updated Aug 18 • 198
view article Article A failed experiment: Infini-Attention, and why we should keep trying? Aug 14 • 50
LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs Paper • 2408.07055 • Published Aug 13 • 65
Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters Paper • 2408.04093 • Published Aug 7 • 4
view article Article The case for specialized pre-training: ultra-fast foundation models for dedicated tasks By Pclanglais • Aug 4 • 26