File size: 2,477 Bytes
a059820 881ce4c 13eba74 881ce4c 13eba74 881ce4c 13eba74 881ce4c 13eba74 881ce4c 13eba74 881ce4c 13eba74 881ce4c 13eba74 881ce4c 13eba74 881ce4c 13eba74 881ce4c 13eba74 881ce4c 13eba74 881ce4c 13eba74 881ce4c 13eba74 881ce4c 3db11ab 881ce4c 13eba74 881ce4c 13eba74 881ce4c 13eba74 881ce4c 13eba74 881ce4c 13eba74 881ce4c 13eba74 881ce4c 13eba74 881ce4c 13eba74 881ce4c 13eba74 ae0e8e4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 |
---
license: apache-2.0
pipeline_tag: text-generation
---
<div align="center">
<img src="https://raw.githubusercontent.com/InternLM/lmdeploy/0be9e7ab6fe9a066cfb0a09d0e0c8d2e28435e58/resources/lmdeploy-logo.svg" width="450"/>
</div>
# INT4 Weight-only Quantization and Deployment (W4A16)
LMDeploy adopts [AWQ](https://arxiv.org/abs/2306.00978) algorithm for 4bit weight-only quantization. By developed the high-performance cuda kernel, the 4bit quantized model inference achieves up to 2.4x faster than FP16.
LMDeploy supports the following NVIDIA GPU for W4A16 inference:
- Turing(sm75): 20 series, T4
- Ampere(sm80,sm86): 30 series, A10, A16, A30, A100
- Ada Lovelace(sm90): 40 series
Before proceeding with the quantization and inference, please ensure that lmdeploy is installed.
```shell
pip install lmdeploy[all]
```
This article comprises the following sections:
<!-- toc -->
- [Inference](#inference)
- [Evaluation](#evaluation)
- [Service](#service)
<!-- tocstop -->
## Inference
Trying the following codes, you can perform the batched offline inference with the quantized model:
```python
from lmdeploy import pipeline, TurbomindEngineConfig
engine_config = TurbomindEngineConfig(model_format='awq')
pipe = pipeline("internlm/internlm2-chat-20b-4bits", backend_config=engine_config)
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)
```
For more information about the pipeline parameters, please refer to [here](https://github.com/InternLM/lmdeploy/blob/main/docs/en/inference/pipeline.md).
## Evaluation
Please overview [this guide](https://opencompass.readthedocs.io/en/latest/advanced_guides/evaluation_turbomind.html) about model evaluation with LMDeploy.
## Service
LMDeploy's `api_server` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup:
```shell
lmdeploy serve api_server internlm/internlm2-chat-20b-4bits --backend turbomind --model-format awq
```
The default port of `api_server` is `23333`. After the server is launched, you can communicate with server on terminal through `api_client`:
```shell
lmdeploy serve api_client http://0.0.0.0:23333
```
You can overview and try out `api_server` APIs online by swagger UI at `http://0.0.0.0:23333`, or you can also read the API specification from [here](https://github.com/InternLM/lmdeploy/blob/main/docs/en/serving/restful_api.md).
|