arxiv:2411.19574

KV Shifting Attention Enhances Language Modeling

Published on Nov 29, 2024

· Submitted by

xumingyu16 on Dec 6, 2024

Upvote

Authors:

Mingyu Xu ,

Bingning Wang ,

Abstract

The current large language models are mainly based on decode-only structure transformers, which have great in-context learning (ICL) capabilities. It is generally believed that the important foundation of its ICL capability is the induction heads mechanism, which requires at least two layers attention. In order to more efficiently implement the ability of the model's induction, we revisit the induction heads mechanism and proposed a KV shifting attention. We theoretically prove that the KV shifting attention reducing the model's requirements for the depth and width of the induction heads mechanism. Our experimental results demonstrate that KV shifting attention is beneficial to learning induction heads and language modeling, which lead to better performance or faster convergence from toy models to the pre-training models with more than 10 B parameters.

View arXiv page View PDF Add to collection

Community

xumingyu16

Paper author Paper submitter Dec 6, 2024

This comment has been hidden

librarian-bot

Dec 6, 2024

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

xumingyu16

Paper author Paper submitter Dec 6, 2024

We analyzed that induction heads have certain requirements for the width and depth of the transformer.

Therefore, we propose the KV shifting attention. In theory, it only needs half the depth and width of standard attention to implement the induced heads mechanism. In practice, KV shifting attention has significantly better effects than standard transformers in learning induction heads, multi hop reasoning, and math.

We also conducted large-scale text pre training with 2.9B and 19B parameter models, and achieved faster convergence speed and better benchmark performance with KV shifting attention. This may be an effort to improve model reasoning during the pre training phase.

DanielWang

Paper author Dec 8, 2024

Cool work Mingyu!

xumingyu16

Paper author Paper submitter Dec 9, 2024

I have discovered an open-source implementation for KVShifting attention.
If you want to get started quickly, you can use 8 A100 and verify it in 2 hours. https://github.com/erogol/BlaGPT

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2411.19574 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2411.19574 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2411.19574 in a Space README.md to link it from this page.

KV Shifting Attention Enhances Language Modeling

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 4