SCBench: A KV Cache-Centric Analysis of Long-Context Methods
Abstract
Long-context LLMs have enabled numerous downstream applications but also introduced significant challenges related to computational and memory efficiency. To address these challenges, optimizations for long-context inference have been developed, centered around the KV cache. However, existing benchmarks often evaluate in single-request, neglecting the full lifecycle of the KV cache in real-world use. This oversight is particularly critical, as KV cache reuse has become widely adopted in LLMs inference frameworks, such as vLLM and SGLang, as well as by LLM providers, including OpenAI, Microsoft, Google, and Anthropic. To address this gap, we introduce SCBench(SharedContextBench), a comprehensive benchmark for evaluating long-context methods from a KV cachecentric perspective: 1) KV cache generation, 2) KV cache compression, 3) KV cache retrieval, 4) KV cache loading. Specifically, SCBench uses test examples with shared context, ranging 12 tasks with two shared context modes, covering four categories of long-context capabilities: string retrieval, semantic retrieval, global information, and multi-task. With it, we provide an extensive KV cache-centric analysis of eight categories long-context solutions, including Gated Linear RNNs, Mamba-Attention hybrids, and efficient methods such as sparse attention, KV cache dropping, quantization, retrieval, loading, and prompt compression. The evaluation is conducted on 8 long-context LLMs. Our findings show that sub-O(n) memory methods suffer in multi-turn scenarios, while sparse encoding with O(n) memory and sub-O(n^2) pre-filling computation perform robustly. Dynamic sparsity yields more expressive KV caches than static patterns, and layer-level sparsity in hybrid architectures reduces memory usage with strong performance. Additionally, we identify attention distribution shift issues in long-generation scenarios. https://aka.ms/SCBench.
Community
🖼️ SCbench: A KV cache-centric Analysis for Long-Context Methods
Previous long-context benchmarks only focus on single-turn, but actually most real world long-context scenarios is multi-turn using KV cache reuse. We propose SCBench, which using a KV cache-centric perspective to analysis different long-context methods, including KV Cache Generation, Compression, Retrieval, and Loading. And SCBench including 12 tasks 4 long-context capabilities with two shared context modes (e.g. Multi-Turn, Multi-Request). Based on that, we got some insights:
- Sub-(O) memory is almost infeasible in multi-turn decoding;
- Task performance shows varying decline trends;
- All long-context methods experience performance degradation as the budget decreases;
- Long-generation scenarios exhibit distribution shift issues.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- EMS: Adaptive Evict-then-Merge Strategy for Head-wise KV Cache Compression Based on Global-Local Importance (2024)
- Batch-Max: Higher LLM Throughput using Larger Batch Sizes and KV Cache Compression (2024)
- Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity (2024)
- Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning (2024)
- ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference (2024)
- TURBOATTENTION: Efficient Attention Approximation For High Throughputs LLMs (2024)
- EPIC: Efficient Position-Independent Context Caching for Serving Large Language Models (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper