mobicham commited on
Commit
2f7303b
1 Parent(s): 454e767

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -8
README.md CHANGED
@@ -8,11 +8,9 @@ pipeline_tag: text-generation
8
  ---
9
  ## Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-3bit-metaoffload-HQQ
10
  This is a version of the
11
- <a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1"> Mixtral-8x7B-Instruct-v0.1 model</a> quantized with a mix of 4-bit and 3-bit via Half-Quadratic Quantization (HQQ).
12
 
13
- More specifically, the attention layers are quantized to 4-bit and the experts are quantized to 3-bit.
14
-
15
- Contrary to the <a href="https://huggingface.co/mobiuslabsgmbh/Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bitgs8-metaoffload-HQQ"> 2bitgs8 model </a> that was designed to use less GPU memory, this one uses about 22GB for the folks who want to get better quality and use the maximum VRAM available on 24GB GPUs.
16
 
17
  ![image/gif](https://cdn-uploads.huggingface.co/production/uploads/636b945ef575d3705149e982/-gwGOZHDb9l5VxLexIhkM.gif)
18
 
@@ -25,10 +23,10 @@ Contrary to the <a href="https://huggingface.co/mobiuslabsgmbh/Mixtral-8x7B-Inst
25
  | Runtime VRAM | 94 GB | <b> 22.3 GB</b> |
26
  | ARC (25-shot) | 70.22 | 69.62 |
27
  | Hellaswag (10-shot)| 87.63 | |
28
- | MMLU (5-shot) | 71.16 | |
29
  | TruthfulQA-MC2 | 64.58 | 62.63 |
30
  | Winogrande (5-shot)| 81.37 | 81.06 |
31
- | GSM8K (5-shot)| 60.73 | |
32
  | Average| 72.62 | |
33
 
34
  ### Basic Usage
@@ -50,7 +48,6 @@ model = HQQModelForCausalLM.from_quantized(model_id)
50
  from hqq.core.quantize import *
51
  HQQLinear.set_backend(HQQBackend.ATEN_BACKPROP)
52
 
53
-
54
  def chat_processor(chat, max_new_tokens=100, do_sample=True):
55
  tokenizer.use_default_system_prompt = False
56
  streamer = transformers.TextIteratorStreamer(tokenizer, timeout=10.0, skip_prompt=True, skip_special_tokens=True)
@@ -81,7 +78,6 @@ def chat_processor(chat, max_new_tokens=100, do_sample=True):
81
  outputs = chat_processor("How do I build a car?", max_new_tokens=1000, do_sample=False)
82
  ```
83
 
84
-
85
  ### Quantization
86
 
87
  You can reproduce the model using the following quant configs:
 
8
  ---
9
  ## Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-3bit-metaoffload-HQQ
10
  This is a version of the
11
+ <a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1"> Mixtral-8x7B-Instruct-v0.1 model</a> quantized with a mix of 4-bit and 3-bit via Half-Quadratic Quantization (HQQ). More specifically, the attention layers are quantized to 4-bit and the experts are quantized to 3-bit.
12
 
13
+ Contrary to the <a href="https://huggingface.co/mobiuslabsgmbh/Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bitgs8-metaoffload-HQQ"> 2bitgs8 model </a> that was designed to use less GPU memory, this one uses about ~22GB for the folks who want to get better quality and use the maximum VRAM available on 24GB GPUs.
 
 
14
 
15
  ![image/gif](https://cdn-uploads.huggingface.co/production/uploads/636b945ef575d3705149e982/-gwGOZHDb9l5VxLexIhkM.gif)
16
 
 
23
  | Runtime VRAM | 94 GB | <b> 22.3 GB</b> |
24
  | ARC (25-shot) | 70.22 | 69.62 |
25
  | Hellaswag (10-shot)| 87.63 | |
26
+ | MMLU (5-shot) | 71.16 | 69.46 |
27
  | TruthfulQA-MC2 | 64.58 | 62.63 |
28
  | Winogrande (5-shot)| 81.37 | 81.06 |
29
+ | GSM8K (5-shot)| 60.73 | 57.77 |
30
  | Average| 72.62 | |
31
 
32
  ### Basic Usage
 
48
  from hqq.core.quantize import *
49
  HQQLinear.set_backend(HQQBackend.ATEN_BACKPROP)
50
 
 
51
  def chat_processor(chat, max_new_tokens=100, do_sample=True):
52
  tokenizer.use_default_system_prompt = False
53
  streamer = transformers.TextIteratorStreamer(tokenizer, timeout=10.0, skip_prompt=True, skip_special_tokens=True)
 
78
  outputs = chat_processor("How do I build a car?", max_new_tokens=1000, do_sample=False)
79
  ```
80
 
 
81
  ### Quantization
82
 
83
  You can reproduce the model using the following quant configs: