Beyond RAG: Task-Aware KV Cache Compression for Comprehensive Knowledge Reasoning
Abstract
Incorporating external knowledge in large language models (LLMs) enhances their utility across diverse applications, but existing methods have trade-offs. Retrieval-Augmented Generation (RAG) fetches evidence via similarity search, but key information may fall outside top ranked results. Long-context models can process multiple documents but are computationally expensive and limited by context window size. Inspired by students condensing study material for open-book exams, we propose task-aware key-value (KV) cache compression, which compresses external knowledge in a zero- or few-shot setup. This enables LLMs to reason efficiently over a compacted representation of all relevant information. Experiments show our approach outperforms both RAG and task-agnostic compression methods. On LongBench v2, it improves accuracy by up to 7 absolute points over RAG with a 30x compression rate, while reducing inference latency from 0.43s to 0.16s. A synthetic dataset highlights that RAG performs well when sparse evidence suffices, whereas task-aware compression is superior for broad knowledge tasks.
Community
Nice work
I tried the demo, on the "Large Language Diffusion Models" arxiv paper (2502.09992v2) using 30x compression and it failed (starting hallucinating about llama), set it 4x compression and it worked perfectly.
But in the paper you claim 30x compression can match/surpass RAG performance. Are there other considerations at play to make this play nicely?
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Efficient Prompt Compression with Evaluator Heads for Long-Context Transformer Inference (2025)
- Vendi-RAG: Adaptively Trading-Off Diversity And Quality Significantly Improves Retrieval Augmented Generation With LLMs (2025)
- Can LLMs Maintain Fundamental Abilities under KV Cache Compression? (2025)
- Activation-aware Probe-Query: Effective Key-Value Retrieval for Long-Context LLMs Inference (2025)
- Long-Context Inference with Retrieval-Augmented Speculative Decoding (2025)
- KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse (2025)
- Dialogue Without Limits: Constant-Sized KV Caches for Extended Responses in LLMs (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper