INT4 Weight-only Quantization and Deployment (W4A16)

LMDeploy adopts AWQ algorithm for 4bit weight-only quantization. By developed the high-performance cuda kernel, the 4bit quantized model inference achieves up to 2.4x faster than FP16.

LMDeploy supports the following NVIDIA GPU for W4A16 inference:

  • Turing(sm75): 20 series, T4

  • Ampere(sm80,sm86): 30 series, A10, A16, A30, A100

  • Ada Lovelace(sm90): 40 series

Before proceeding with the quantization and inference, please ensure that lmdeploy is installed.

pip install lmdeploy[all]

This article comprises the following sections:

Inference

Trying the following codes, you can perform the batched offline inference with the quantized model:

from lmdeploy import pipeline, TurbomindEngineConfig
engine_config = TurbomindEngineConfig(model_format='awq')
pipe = pipeline("internlm/internlm2_5-7b-chat-4bit", backend_config=engine_config)
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)

For more information about the pipeline parameters, please refer to here.

Evaluation

Please overview this guide about model evaluation with LMDeploy.

Service

LMDeploy's api_server enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup:

lmdeploy serve api_server internlm/internlm2_5-7b-chat-4bit --backend turbomind --model-format awq

The default port of api_server is 23333. After the server is launched, you can communicate with server on terminal through api_client:

lmdeploy serve api_client http://0.0.0.0:23333

You can overview and try out api_server APIs online by swagger UI at http://0.0.0.0:23333, or you can also read the API specification from here.

Downloads last month
82
Inference Examples
Inference API (serverless) does not yet support model repos that contain custom code.

Space using internlm/internlm2_5-7b-chat-4bit 1

Collection including internlm/internlm2_5-7b-chat-4bit