README.md · squeeze-ai-lab/sq-llama-65b-w4-s5 at main

SqueezeLLM is a post-training quantization framework that incorporates a new method called Dense-and-Sparse Quantization to enable efficient LLM serving.

TLDR: Deploying LLMs is difficult due to their large memory size. This can be addressed with reduced precision quantization. But a naive method hurts performance. We address this with a new Dense-and-Sparse Quantization method. Dense-and-Sparse splits weight matrices into two components: A dense component that can be heavily quantized without affecting model performance, as well as a sparse part that preserves sensitive and outlier parts of the weight matrices With this approach, we are able to serve larger models with smaller memory footprint, the same latency, and yet higher accuracy and quality. For more details please check out our paper.

Model description

4-bit quantized LLaMA 65B model using SqueezeLLM. More details can be found in the paper.

Base Model: LLaMA 65B
Bitwidth: 4-bit
Sparsity Level: 0.05%

squeeze-ai-lab
/

sq-llama-65b-w4-s5

Model description

Links

license: other