Papers
arxiv:2407.00066

Compress then Serve: Serving Thousands of LoRA Adapters with Little Overhead

Published on Jun 17, 2024
Authors:
,
,
,
,
,

Abstract

Fine-tuning large language models (LLMs) with low-rank adapters (LoRAs) has become common practice, often yielding numerous copies of the same LLM differing only in their LoRA updates. This paradigm presents challenges for systems that serve real-time responses to queries that each involve a different LoRA. Prior works optimize the design of such systems but still require continuous loading and offloading of LoRAs, as it is infeasible to store thousands of LoRAs in GPU memory. To mitigate this issue, we investigate the efficacy of compression when serving LoRA adapters. We consider compressing adapters individually via SVD and propose a method for joint compression of LoRAs into a shared basis paired with LoRA-specific scaling matrices. Our experiments with up to 500 LoRAs demonstrate that compressed LoRAs preserve performance while offering major throughput gains in realistic serving scenarios with over a thousand LoRAs, maintaining 75% of the throughput of serving a single LoRA.

Community

Sign up or log in to comment

Models citing this paper 1,000

Browse 1,000+ models citing this paper

Datasets citing this paper 1,102

Browse 1,102 datasets citing this paper

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2407.00066 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.