Papers
arxiv:2406.16747

Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers

Published on Jun 24
· Submitted by zlzheng on Jun 25
Authors:

Abstract

Accommodating long sequences efficiently in autoregressive Transformers, especially within an extended context window, poses significant challenges due to the quadratic computational complexity and substantial KV memory requirements inherent in self-attention mechanisms. In this work, we introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome these computational and memory obstacles while maintaining performance. Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query, thereby enabling gradient-based optimization. As a result, SPARSEK Attention offers linear time complexity and constant memory footprint during generation. Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods and provides significant speed improvements during both training and inference, particularly in language modeling and downstream tasks. Furthermore, our method can be seamlessly integrated into pre-trained Large Language Models (LLMs) with minimal fine-tuning, offering a practical solution for effectively managing long-range dependencies in diverse applications.

Community

Paper author Paper submitter

We introduce SparseK Attention, a novel sparse attention mechanism designed to overcome computational and memory obstacles while maintaining performance.

  • Incremental KV Selection. The SparseK operator supports incremental evaluation and thus offers linear time complexity and constant memory footprint during generation.
  • Computational and Memory Efficiency. SparseK Attention outperforms previous sparse attention methods and provides significant speed improvements during both training and inference.
  • Extension with IO-awareness. SparseK Attention can be integrated with IO-aware mechanisms, such as FlashAttention, resulting in increased speed and improved memory efficiency.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2406.16747 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2406.16747 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2406.16747 in a Space README.md to link it from this page.

Collections including this paper 4