Text Generation
Transformers
mixtral
Not-For-All-Audiences
nsfw
mergekit
Merge
HQQ
2bit
conversational
Inference Endpoints
ProphetOfBostrom
commited on
Commit
•
d7bde7a
1
Parent(s):
f37cbe5
readme notice but i'm very sleepy please correct my mistakes for me thanks
Browse files
README.md
CHANGED
@@ -9,7 +9,38 @@ tags:
|
|
9 |
- nsfw
|
10 |
- mergekit
|
11 |
- merge
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
12 |
|
|
|
|
|
13 |
---
|
14 |
# BagelMix
|
15 |
|
|
|
9 |
- nsfw
|
10 |
- mergekit
|
11 |
- merge
|
12 |
+
- HQQ
|
13 |
+
- 2bit
|
14 |
+
library_name: transformers
|
15 |
+
---
|
16 |
+
## BagelMix-8x7B branch 2g16-4g64-HQQ
|
17 |
+
By [Undi95](https://huggingface.co/Undi95/BagelMix-8x7B)
|
18 |
+
|
19 |
+
#### (this readme has been written by a sleepy person. /disclaimer)
|
20 |
+
---
|
21 |
+
|
22 |
+
[main branch is the same quant config as last time, the reference one from mobius here](https://huggingface.co/mobiuslabsgmbh/Mixtral-8x7B-v0.1-hf-attn-4bit-moe-2bit-HQQ)
|
23 |
+
|
24 |
+
the label i've chosen here refers to 2 bit linear layers with a 16 param group size (per 8 bit group weight), and 4 bits in groups of 64 for the attention layers
|
25 |
+
thus the actual bpw is higher than 2 in no small part because we're adding another byte every 4 bytes (i think??) for the linear layers.
|
26 |
+
|
27 |
+
from what I can gather of hqq's source code, the gate network isn't quantised (because it's tiny and very important)
|
28 |
+
|
29 |
+
such reasoning has lead me to try experimenting with taking more bits away from ther expert/linear layers and put them in the attention layers.
|
30 |
+
|
31 |
+
i've currently got a slightly heavier 2g16 experts with 8g512 attention (not really sure how meaningful groups of 512 are but w/e) model already,
|
32 |
+
which would look like this, which is *not the model on the main branch*:
|
33 |
+
```
|
34 |
+
attn_prams = BaseQuantizeConfig(nbits=8, group_size=512, quant_zero=True, quant_scale=True) # MAIN BRANCH IS nbits=4 group_size=64 !!!
|
35 |
+
attn_prams['scale_quant_params']['group_size'] = 512 #was 256, not sure what this does lol
|
36 |
+
experts_params = BaseQuantizeConfig(nbits=2, group_size=16, quant_zero=True, quant_scale=True)
|
37 |
+
```
|
38 |
+
again this is not what you're downloading if you get this right now: I want to see if I can actually keep the bpw down.
|
39 |
+
these will be uploaded as alternate branches to this repo if they seem worth doing.
|
40 |
+
might fiddle with 2g32 or even 3g128 or such for experts. given their most delectable sparseness
|
41 |
|
42 |
+
### you could also use the included python script (and a big swap partition) to make them yourself. again it's just the one from mobiuslabs themselves
|
43 |
+
### ps read Sleeper Agents (2024/01) :-)
|
44 |
---
|
45 |
# BagelMix
|
46 |
|