Forgetting Transformer: Softmax Attention with a Forget Gate
Abstract
An essential component of modern recurrent sequence models is the forget gate. While Transformers do not have an explicit recurrent form, we show that a forget gate can be naturally incorporated into Transformers by down-weighting the unnormalized attention scores in a data-dependent way. We name this attention mechanism the Forgetting Attention and the resulting model the Forgetting Transformer (FoX). We show that FoX outperforms the Transformer on long-context language modeling, length extrapolation, and short-context downstream tasks, while performing on par with the Transformer on long-context downstream tasks. Moreover, it is compatible with the FlashAttention algorithm and does not require any positional embeddings. Several analyses, including the needle-in-the-haystack test, show that FoX also retains the Transformer's superior long-context capabilities over recurrent sequence models such as Mamba-2, HGRN2, and DeltaNet. We also introduce a "Pro" block design that incorporates some common architectural components in recurrent sequence models and find it significantly improves the performance of both FoX and the Transformer. Our code is available at https://github.com/zhixuan-lin/forgetting-transformer.
Community
The core method is summarized above. Highlights:
• No need for RoPE
• Hyperparameter-free
• FlashAttention-compatible
• Consistently better or on-par performance compared to the (RoPE-based) Transformer
• Great long-context capabilities, similar to the standard Transformer (yes, it learns not to forget if necessary!)
You can also see our post on X for an extended summary of our work. The code is available at https://github.com/zhixuan-lin/forgetting-transformer. We provide a plug-and-play Triton kernel with minimal dependencies. Try it today!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- DeltaProduct: Increasing the Expressivity of DeltaNet Through Products of Householders (2025)
- Sliding Window Attention Training for Efficient Large Language Models (2025)
- Linear Attention for Efficient Bidirectional Sequence Modeling (2025)
- ReGLA: Refining Gated Linear Attention (2025)
- Liger: Linearizing Large Language Models to Gated Recurrent Structures (2025)
- MoM: Linear Sequence Modeling with Mixture-of-Memories (2025)
- Efficient Language Modeling for Low-Resource Settings with Hybrid RNN-Transformer Architectures (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper