|
--- |
|
license: apache-2.0 |
|
inference: false |
|
--- |
|
|
|
# MistralLite-AWQ Model |
|
|
|
MistralLite-AWQ is a version of the [MistralLite](https://huggingface.co/amazon/MistralLite) model that was |
|
quantized using the AWQ method developed by [Lin et al. (2023)](https://arxiv.org/abs/2306.00978). |
|
The MistralLite-AWQ models are approximately **70% smaller** than those of MistralLite whilst maintaining comparable performance. |
|
|
|
Please refer to the [original MistralLite model card](https://huggingface.co/amazon/MistralLite) for details about the model |
|
preparation and training processes. |
|
|
|
## MistralLite-AWQ Variants |
|
|
|
| Branch | Approx. Model Size | `q_group_size` | `w_bit` | `version` | |
|
|--------|---:|---------------:|--------:|-----------| |
|
| [main](https://huggingface.co/amazon/MistralLite-AWQ/tree/main) | 3.9 GB | 128 | 4 | GEMM | |
|
| [MistralLite-AWQ-64g-4b-GEMM](https://huggingface.co/amazon/MistralLite-AWQ/tree/MistralLite-AWQ-64g-4b-GEMM) | 4.0 GB | 64 | 4 | GEMM | |
|
| [MistralLite-AWQ-32g-4b-GEMM](https://huggingface.co/amazon/MistralLite-AWQ/tree/MistralLite-AWQ-32g-4b-GEMM) | 4.3 GB | 32 | 4 | GEMM | |
|
|
|
## Dependencies |
|
- [`autoawq==0.2.5`](https://pypi.org/project/autoawq/0.2.5/) – [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) was used to quantize the MistralLite model. |
|
- [`vllm==0.4.2`](https://pypi.org/project/vllm/0.4.2/) – [vLLM](https://github.com/vllm-project/vllm) was used to host models for benchmarking. |
|
|
|
## Evaluations |
|
|
|
### Long Context |
|
|
|
The following benchmark results are shown as _accuracy_ (%) values, unless stated otherwise. |
|
|
|
#### Topic Retrieval |
|
|
|
See https://lmsys.org/blog/2023-06-29-longchat/ |
|
|
|
| Model Name | n_topics=05 | n_topics=10 | n_topics=15 | n_topics=20 | n_topics=25 | |
|
|:---------------------------------------------------|--------------:|--------------:|--------------:|--------------:|--------------:| |
|
| _n_tokens_ (approx.) = | _3048_ | _5966_ | _8903_ | _11832_ | _14757_ | |
|
| MistralLite | 100 | 100 | 100 | 100 | 98 | |
|
| **MistralLite-AWQ** | **100** | **100** | **100**| **100** | **98** | |
|
| **MistralLite-AWQ-64g-4b-GEMM** | **100** | **100** | **100**| **100** | **98** | |
|
| **MistralLite-AWQ-32g-4b-GEMM** | **100** | **100** | **100**| **100** | **98** | |
|
| Mistral-7B-Instruct-v0.1 | 96 | 52 | 2 | 0 | 0 | |
|
| Mistral-7B-Instruct-v0.2 | 100 | 100 | 100 | 100 | 100 | |
|
| Mixtral-8x7B-v0.1 | 0 | 0 | 0 | 0 | 0 | |
|
| Mixtral-8x7B-Instruct-v0.1 | 100 | 100 | 100 | 100 | 100 | |
|
|
|
#### [Line Retrieval](https://lmsys.org/blog/2023-06-29-longchat/#longeval-results) |
|
|
|
See https://lmsys.org/blog/2023-06-29-longchat/#longeval-results |
|
|
|
| Model Name | n_lines=200 | n_lines=300 | n_lines=400 | n_lines=500 | n_lines=600 | n_lines=680 | |
|
|:----------|-------------:|-------------:|------------:|-----------:|-----------:|-----------:| |
|
| _n_tokens_ (approx.) = | _4317_ | _6415_ | _8510_ | _10610_ | _12698_ | _14373_ | |
|
| MistralLite | 100 | 94 | 86 | 82 | 76 | 66 | |
|
| **MistralLite-AWQ** | **96**| **94**| **88** | **80** | **70**| **62** | |
|
| **MistralLite-AWQ-64g-4b-GEMM** | **96**| **96**| **90** | **70** | **72**| **60** | |
|
| **MistralLite-AWQ-32g-4b-GEMM** | **98**| **96**| **84** | **76** | **70**| **62** | |
|
| Mistral-7B-Instruct-v0.1 | 96 | 56 | 38 | 36 | 30 | 30 | |
|
| Mistral-7B-Instruct-v0.2 | 100 | 100 | 96 | 98 | 96 | 84 | |
|
| Mixtral-8x7B-v0.1 | 54 | 38 | 56 | 66 | 62 | 38 | |
|
| Mixtral-8x7B-Instruct-v0.1 | 100 | 100 | 100 | 100 | 100 | 100 | |
|
|
|
#### Pass Key Retrieval |
|
|
|
See https://github.com/epfml/landmark-attention/blob/main/llama/run_test.py#L101 |
|
|
|
| Model Name | n_garbage=12000 | n_garbage=20000 | n_garbage=31000 | n_garbage=38000 | n_garbage=45000 | n_garbage=60000 | |
|
|:----------|-------------:|-------------:|------------:|-----------:|-----------:|-----------:| |
|
| _n_tokens_ (approx.) = | _3272_ | _5405_ | _8338_ | _10205_ | _12071_ | _16072_ | |
|
| MistralLite | 100 | 100 | 100 | 100 | 100 | 100| |
|
| **MistralLite-AWQ** | **100** | **100**| **100**| **100** | **100**| **100**| |
|
| **MistralLite-AWQ-64g-4b-GEMM** | **100** | **100**| **100**| **100** | **100**| **100**| |
|
| **MistralLite-AWQ-32g-4b-GEMM** | **100** | **100**| **100**| **100** | **100**| **100**| |
|
| Mistral-7B-Instruct-v0.1 | 100 | 50 | 30 | 20 | 10 | 10 | |
|
| Mistral-7B-Instruct-v0.2 | 100 | 100 | 100 | 100 | 100 | 100 | |
|
| Mixtral-8x7B-v0.1 | 100 | 100 | 100 | 100 | 100 | 100 | |
|
| Mixtral-8x7B-Instruct-v0.1 | 100 | 100 | 100 | 90 | 100 | 100 | |
|
|
|
|
|
#### QuALITY (Question Answering with Long Input Texts, Yes! |
|
|
|
See https://nyu-mll.github.io/quality/ |
|
|
|
|Model Name| Test set Accuracy | Hard subset Accuracy| |
|
|:----------|-------------:|-------------:| |
|
| MistralLite | 56.8 | 74.5 | |
|
| **MistralLite-AWQ** | **55.3** | **71.8** | |
|
| **MistralLite-AWQ-64g-4b-GEMM** | **55.2** | **72.9** | |
|
| **MistralLite-AWQ-32g-4b-GEMM** | **56.6** | **72.8** | |
|
| Mistral-7B-Instruct-v0.1 | 45.2 | 58.9 | |
|
| Mistral-7B-Instruct-v0.2 | 55.5 | 74 | |
|
| Mixtral-8x7B-v0.1 | 75 | 74.1 | |
|
| Mixtral-8x7B-Instruct-v0.1 | 68.7 | 83.3 | |
|
|
|
## Usage |
|
|
|
## Inference via vLLM HTTP Host |
|
|
|
### Launch Host |
|
```bash |
|
python -m vllm.entrypoints.openai.api_server \ |
|
--model amazon/MistralLite-AWQ \ |
|
--quantization awq |
|
``` |
|
|
|
### Query Host |
|
```bash |
|
curl -X POST http://localhost:8000/v1/completions \ |
|
-H "Content-Type: application/json" \ |
|
-d '{ "model": "amazon/MistralLite-AWQ", |
|
"prompt": "<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>", |
|
"temperature": 0, |
|
"echo": false |
|
}' |
|
``` |
|
|
|
## Inference via [vLLM Offline Inference](https://docs.vllm.ai/en/latest/getting_started/examples/offline_inference.html) |
|
```python |
|
from vllm import LLM, SamplingParams |
|
|
|
prompts = [ |
|
"<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>", |
|
] |
|
sampling_params = SamplingParams(temperature=0, max_tokens=100) |
|
|
|
llm = LLM(model="amazon/MistralLite-AWQ") |
|
|
|
outputs = llm.generate(prompts, sampling_params) |
|
|
|
# Print the outputs. |
|
for output in outputs: |
|
prompt = output.prompt |
|
generated_text = output.outputs[0].text |
|
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") |
|
|
|
``` |
|
|
|
## License |
|
|
|
Apache 2.0 |
|
|
|
## Limitations |
|
|
|
Before using the MistralLite-AWQ model, it is important to perform your own |
|
independent assessment, and take measures to ensure that your use would comply |
|
with your own specific quality control practices and standards, and that your |
|
use would comply with the local rules, laws, regulations, licenses and terms |
|
that apply to you, and your content. |
|
|