Zebra: Extending Context Window with Layerwise Grouped Local-Global Attention Paper • 2312.08618 • Published Dec 14, 2023 • 11
SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention Paper • 2312.07987 • Published Dec 13, 2023 • 40
Cached Transformers: Improving Transformers with Differentiable Memory Cache Paper • 2312.12742 • Published Dec 20, 2023 • 12
SOLAR 10.7B: Scaling Large Language Models with Simple yet Effective Depth Up-Scaling Paper • 2312.15166 • Published Dec 23, 2023 • 56
Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models Paper • 2401.04658 • Published Jan 9 • 25
Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens Paper • 2401.17377 • Published Jan 30 • 34
Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey Paper • 2311.12351 • Published Nov 21, 2023 • 3
Learning and Leveraging World Models in Visual Representation Learning Paper • 2403.00504 • Published Mar 1 • 31