arxiv:2503.04973

Beyond RAG: Task-Aware KV Cache Compression for Comprehensive Knowledge Reasoning

Published on Mar 6

· Submitted by

giulio98 on Mar 11

Upvote

Authors:

Giulio Corallo ,

Orion Weller ,

Fabio Petroni ,

Paolo Papotti

Abstract

Incorporating external knowledge in large language models (LLMs) enhances their utility across diverse applications, but existing methods have trade-offs. Retrieval-Augmented Generation (RAG) fetches evidence via similarity search, but key information may fall outside top ranked results. Long-context models can process multiple documents but are computationally expensive and limited by context window size. Inspired by students condensing study material for open-book exams, we propose task-aware key-value (KV) cache compression, which compresses external knowledge in a zero- or few-shot setup. This enables LLMs to reason efficiently over a compacted representation of all relevant information. Experiments show our approach outperforms both RAG and task-agnostic compression methods. On LongBench v2, it improves accuracy by up to 7 absolute points over RAG with a 30x compression rate, while reducing inference latency from 0.43s to 0.16s. A synthetic dataset highlights that RAG performs well when sparse evidence suffices, whereas task-aware compression is superior for broad knowledge tasks.

View arXiv page View PDF Add to collection

Community

giulio98

Paper author Paper submitter about 19 hours ago

demo: https://huggingface.co/spaces/giulio98/beyondrag

MichaelBarryUK

about 8 hours ago

Nice work

I tried the demo, on the "Large Language Diffusion Models" arxiv paper (2502.09992v2) using 30x compression and it failed (starting hallucinating about llama), set it 4x compression and it worked perfectly.

But in the paper you claim 30x compression can match/surpass RAG performance. Are there other considerations at play to make this play nicely?

librarian-bot

about 5 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2503.04973 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2503.04973 in a dataset README.md to link it from this page.

Beyond RAG: Task-Aware KV Cache Compression for Comprehensive Knowledge Reasoning

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 1

Collections including this paper 1