arxiv:2411.16158

MixPE: Quantization and Hardware Co-design for Efficient LLM Inference

Published on Nov 25, 2024

Authors:

Lancheng Zou ,

Abstract

Transformer-based large language models (LLMs) have achieved remarkable success as model sizes continue to grow, yet their deployment remains challenging due to significant computational and memory demands. Quantization has emerged as a promising solution, and state-of-the-art quantization algorithms for LLMs introduce the need for mixed-precision matrix multiplication (mpGEMM), where lower-precision weights are multiplied with higher-precision activations. Despite its benefits, current hardware accelerators such as GPUs and TPUs lack native support for efficient mpGEMM, leading to inefficient de<PRE_TAG>quantization</POST_TAG> operations in the main sequential loop. To address this limitation, we introduce MixPE, a specialized mixed-precision processing element designed for efficient low-bit <PRE_TAG>quantization</POST_TAG> in LLM inference. MixPE leverages two key innovations to minimize de<PRE_TAG>quantization</POST_TAG> overhead and unlock the full potential of low-bit <PRE_TAG>quantization</POST_TAG>. First, recognizing that scale and zero point are shared within each quantization group, we propose performing de<PRE_TAG>quantization</POST_TAG> after per-group mpGEMM, significantly reducing de<PRE_TAG>quantization</POST_TAG> overhead. Second, instead of relying on conventional multipliers, MixPE utilizes efficient shift\&add operations for multiplication, optimizing both computation and energy efficiency. Our experimental results demonstrate that MixPE surpasses the state-of-the-art quantization accelerators by 2.6times speedup and 1.4times energy reduction.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2411.16158 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2411.16158 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2411.16158 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.