When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training
Abstract
Extending context window sizes allows large language models (LLMs) to process longer sequences and handle more complex tasks. Rotary Positional Embedding (RoPE) has become the de facto standard due to its relative positional encoding properties that benefit long-context training. However, we observe that using RoPE with BFloat16 format results in numerical issues, causing it to deviate from its intended relative positional encoding, especially in long-context scenarios. This issue arises from BFloat16's limited precision and accumulates as context length increases, with the first token contributing significantly to this problem. To address this, we develop AnchorAttention, a plug-and-play attention method that alleviates numerical issues caused by BFloat16, improves long-context capabilities, and speeds up training. AnchorAttention reduces unnecessary attention computations, maintains semantic coherence, and boosts computational efficiency by treating the first token as a shared anchor with a consistent position ID, making it visible to all documents within the training context. Experiments on three types of LLMs demonstrate that AnchorAttention significantly improves long-context performance and reduces training time by over 50\% compared to standard full attention mechanisms, while preserving the original LLM's capabilities on general tasks. Our code is available at https://github.com/haonan3/AnchorContext.
Community
Long-context language models using Rotary Position Embedding (RoPE) with BFloat16 precision face broken relative positional encoding due to numerical errors, especially as context length increases. This breakdown is exacerbated by the first token's role in positional deviations. To address this, the paper introduces AnchorAttention, an attention mechanism that treats the first token as a shared anchor across documents, ensuring consistency in position IDs and reducing cumulative errors. AnchorAttention improves long-context performance, accelerates training, and requires minimal changes to existing models, outperforming traditional attention methods on benchmarks like RULER and LongBench.
The implementation of AnchorAttention supports several popular models, using the FlashAttention2 and FlexAttention. Code is at: https://github.com/haonan3/AnchorContext.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Why Does the Effective Context Length of LLMs Fall Short? (2024)
- On the token distance modeling ability of higher RoPE attention dimension (2024)
- DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads (2024)
- A Little Goes a Long Way: Efficient Long Context Training and Inference with Partial Contexts (2024)
- How to Train Long-Context Language Models (Effectively) (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper