arxiv:2412.10319

SCBench: A KV Cache-Centric Analysis of Long-Context Methods

Published on Dec 13

· Submitted by

iofu728 on Dec 16

Upvote

Authors:

Yucheng Li ,

Huiqiang Jiang ,

Qianhui Wu ,

Xufang Luo ,

Chengruidong Zhang ,

Dongsheng Li ,

Jianfeng Gao ,

Yuqing Yang ,

Abstract

Long-context LLMs have enabled numerous downstream applications but also introduced significant challenges related to computational and memory efficiency. To address these challenges, optimizations for long-context inference have been developed, centered around the KV cache. However, existing benchmarks often evaluate in single-request, neglecting the full lifecycle of the KV cache in real-world use. This oversight is particularly critical, as KV cache reuse has become widely adopted in LLMs inference frameworks, such as vLLM and SGLang, as well as by LLM providers, including OpenAI, Microsoft, Google, and Anthropic. To address this gap, we introduce SCBench(SharedContextBench), a comprehensive benchmark for evaluating long-context methods from a KV cachecentric perspective: 1) KV cache generation, 2) KV cache compression, 3) KV cache retrieval, 4) KV cache loading. Specifically, SCBench uses test examples with shared context, ranging 12 tasks with two shared context modes, covering four categories of long-context capabilities: string retrieval, semantic retrieval, global information, and multi-task. With it, we provide an extensive KV cache-centric analysis of eight categories long-context solutions, including Gated Linear RNNs, Mamba-Attention hybrids, and efficient methods such as sparse attention, KV cache dropping, quantization, retrieval, loading, and prompt compression. The evaluation is conducted on 8 long-context LLMs. Our findings show that sub-O(n) memory methods suffer in multi-turn scenarios, while sparse encoding with O(n) memory and sub-O(n^2) pre-filling computation perform robustly. Dynamic sparsity yields more expressive KV caches than static patterns, and layer-level sparsity in hybrid architectures reduces memory usage with strong performance. Additionally, we identify attention distribution shift issues in long-generation scenarios. https://aka.ms/SCBench.

View arXiv page View PDF Add to collection

Community

iofu728

Paper author Paper submitter about 23 hours ago

🖼️ SCbench: A KV cache-centric Analysis for Long-Context Methods

Previous long-context benchmarks only focus on single-turn, but actually most real world long-context scenarios is multi-turn using KV cache reuse. We propose SCBench, which using a KV cache-centric perspective to analysis different long-context methods, including KV Cache Generation, Compression, Retrieval, and Loading. And SCBench including 12 tasks 4 long-context capabilities with two shared context modes (e.g. Multi-Turn, Multi-Request). Based on that, we got some insights:

Sub-(O) memory is almost infeasible in multi-turn decoding;
Task performance shows varying decline trends;
All long-context methods experience performance degradation as the budget decreases;
Long-generation scenarios exhibit distribution shift issues.

librarian-bot

15 minutes ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2412.10319 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2412.10319 in a Space README.md to link it from this page.