Update README.md
Browse files
README.md
CHANGED
@@ -8,11 +8,9 @@ pipeline_tag: text-generation
|
|
8 |
---
|
9 |
## Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-3bit-metaoffload-HQQ
|
10 |
This is a version of the
|
11 |
-
<a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1"> Mixtral-8x7B-Instruct-v0.1 model</a> quantized with a mix of 4-bit and 3-bit via Half-Quadratic Quantization (HQQ).
|
12 |
|
13 |
-
|
14 |
-
|
15 |
-
Contrary to the <a href="https://huggingface.co/mobiuslabsgmbh/Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bitgs8-metaoffload-HQQ"> 2bitgs8 model </a> that was designed to use less GPU memory, this one uses about 22GB for the folks who want to get better quality and use the maximum VRAM available on 24GB GPUs.
|
16 |
|
17 |
![image/gif](https://cdn-uploads.huggingface.co/production/uploads/636b945ef575d3705149e982/-gwGOZHDb9l5VxLexIhkM.gif)
|
18 |
|
@@ -25,10 +23,10 @@ Contrary to the <a href="https://huggingface.co/mobiuslabsgmbh/Mixtral-8x7B-Inst
|
|
25 |
| Runtime VRAM | 94 GB | <b> 22.3 GB</b> |
|
26 |
| ARC (25-shot) | 70.22 | 69.62 |
|
27 |
| Hellaswag (10-shot)| 87.63 | |
|
28 |
-
| MMLU (5-shot) | 71.16 |
|
29 |
| TruthfulQA-MC2 | 64.58 | 62.63 |
|
30 |
| Winogrande (5-shot)| 81.37 | 81.06 |
|
31 |
-
| GSM8K (5-shot)| 60.73 |
|
32 |
| Average| 72.62 | |
|
33 |
|
34 |
### Basic Usage
|
@@ -50,7 +48,6 @@ model = HQQModelForCausalLM.from_quantized(model_id)
|
|
50 |
from hqq.core.quantize import *
|
51 |
HQQLinear.set_backend(HQQBackend.ATEN_BACKPROP)
|
52 |
|
53 |
-
|
54 |
def chat_processor(chat, max_new_tokens=100, do_sample=True):
|
55 |
tokenizer.use_default_system_prompt = False
|
56 |
streamer = transformers.TextIteratorStreamer(tokenizer, timeout=10.0, skip_prompt=True, skip_special_tokens=True)
|
@@ -81,7 +78,6 @@ def chat_processor(chat, max_new_tokens=100, do_sample=True):
|
|
81 |
outputs = chat_processor("How do I build a car?", max_new_tokens=1000, do_sample=False)
|
82 |
```
|
83 |
|
84 |
-
|
85 |
### Quantization
|
86 |
|
87 |
You can reproduce the model using the following quant configs:
|
|
|
8 |
---
|
9 |
## Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-3bit-metaoffload-HQQ
|
10 |
This is a version of the
|
11 |
+
<a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1"> Mixtral-8x7B-Instruct-v0.1 model</a> quantized with a mix of 4-bit and 3-bit via Half-Quadratic Quantization (HQQ). More specifically, the attention layers are quantized to 4-bit and the experts are quantized to 3-bit.
|
12 |
|
13 |
+
Contrary to the <a href="https://huggingface.co/mobiuslabsgmbh/Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bitgs8-metaoffload-HQQ"> 2bitgs8 model </a> that was designed to use less GPU memory, this one uses about ~22GB for the folks who want to get better quality and use the maximum VRAM available on 24GB GPUs.
|
|
|
|
|
14 |
|
15 |
![image/gif](https://cdn-uploads.huggingface.co/production/uploads/636b945ef575d3705149e982/-gwGOZHDb9l5VxLexIhkM.gif)
|
16 |
|
|
|
23 |
| Runtime VRAM | 94 GB | <b> 22.3 GB</b> |
|
24 |
| ARC (25-shot) | 70.22 | 69.62 |
|
25 |
| Hellaswag (10-shot)| 87.63 | |
|
26 |
+
| MMLU (5-shot) | 71.16 | 69.46 |
|
27 |
| TruthfulQA-MC2 | 64.58 | 62.63 |
|
28 |
| Winogrande (5-shot)| 81.37 | 81.06 |
|
29 |
+
| GSM8K (5-shot)| 60.73 | 57.77 |
|
30 |
| Average| 72.62 | |
|
31 |
|
32 |
### Basic Usage
|
|
|
48 |
from hqq.core.quantize import *
|
49 |
HQQLinear.set_backend(HQQBackend.ATEN_BACKPROP)
|
50 |
|
|
|
51 |
def chat_processor(chat, max_new_tokens=100, do_sample=True):
|
52 |
tokenizer.use_default_system_prompt = False
|
53 |
streamer = transformers.TextIteratorStreamer(tokenizer, timeout=10.0, skip_prompt=True, skip_special_tokens=True)
|
|
|
78 |
outputs = chat_processor("How do I build a car?", max_new_tokens=1000, do_sample=False)
|
79 |
```
|
80 |
|
|
|
81 |
### Quantization
|
82 |
|
83 |
You can reproduce the model using the following quant configs:
|