Text Generation
Transformers
mixtral
Not-For-All-Audiences
nsfw
mergekit
Merge
HQQ
2bit
conversational
Inference Endpoints
ProphetOfBostrom
commited on
Commit
•
2b60801
1
Parent(s):
2c6818b
utility: maximised.
Browse files
README.md
CHANGED
@@ -12,19 +12,19 @@ tags:
|
|
12 |
- HQQ
|
13 |
- 2bit
|
14 |
---
|
15 |
-
## BagelMix-8x7B branch 2g16-4g64-HQQ
|
|
|
16 |
By [Undi95](https://huggingface.co/Undi95/BagelMix-8x7B)
|
17 |
|
18 |
-
#### (this readme has been written by a sleepy person.
|
19 |
---
|
20 |
|
21 |
[main branch is the same quant config as last time, the reference one from mobius here](https://huggingface.co/mobiuslabsgmbh/Mixtral-8x7B-v0.1-hf-attn-4bit-moe-2bit-HQQ)
|
22 |
|
23 |
the label i've chosen here refers to 2 bit linear layers with a 16 param group size (per 8 bit group weight), and 4 bits in groups of 64 for the attention layers
|
24 |
thus the actual bpw is higher than 2 in no small part because we're adding another byte every 4 bytes (i think??) for the linear layers.
|
25 |
-
|
26 |
-
|
27 |
-
|
28 |
such reasoning has lead me to try experimenting with taking more bits away from ther expert/linear layers and put them in the attention layers.
|
29 |
|
30 |
i've currently got a slightly heavier 2g16 experts with 8g512 attention (not really sure how meaningful groups of 512 are but w/e) model already,
|
@@ -36,10 +36,16 @@ experts_params = BaseQuantizeConfig(nbits=2, group_size=16, quant_zero=True, qua
|
|
36 |
```
|
37 |
again this is not what you're downloading if you get this right now: I want to see if I can actually keep the bpw down.
|
38 |
these will be uploaded as alternate branches to this repo if they seem worth doing.
|
39 |
-
might fiddle with 2g32 or even 3g128 or such for experts.
|
40 |
|
41 |
-
|
42 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
43 |
---
|
44 |
# BagelMix
|
45 |
|
|
|
12 |
- HQQ
|
13 |
- 2bit
|
14 |
---
|
15 |
+
## BagelMix-8x7B - main branch 2g16-4g64-HQQ
|
16 |
+
Under 20 GB
|
17 |
By [Undi95](https://huggingface.co/Undi95/BagelMix-8x7B)
|
18 |
|
19 |
+
#### (this readme has been written by a sleepy person. the link above takes you to the original model, the link below to the Mixtral HQQ reference. the rest is rambling)
|
20 |
---
|
21 |
|
22 |
[main branch is the same quant config as last time, the reference one from mobius here](https://huggingface.co/mobiuslabsgmbh/Mixtral-8x7B-v0.1-hf-attn-4bit-moe-2bit-HQQ)
|
23 |
|
24 |
the label i've chosen here refers to 2 bit linear layers with a 16 param group size (per 8 bit group weight), and 4 bits in groups of 64 for the attention layers
|
25 |
thus the actual bpw is higher than 2 in no small part because we're adding another byte every 4 bytes (i think??) for the linear layers.
|
26 |
+
from what I can gather of hqq's source code, the gate ('expert' selection) network isn't quantised (because it's tiny and very important)
|
27 |
+
this is the reason we quantise the attention layers at 4 bits too - in a MoE it's small (shared between all the 'experts') which means it would quantize like a 2 bpw mistral
|
|
|
28 |
such reasoning has lead me to try experimenting with taking more bits away from ther expert/linear layers and put them in the attention layers.
|
29 |
|
30 |
i've currently got a slightly heavier 2g16 experts with 8g512 attention (not really sure how meaningful groups of 512 are but w/e) model already,
|
|
|
36 |
```
|
37 |
again this is not what you're downloading if you get this right now: I want to see if I can actually keep the bpw down.
|
38 |
these will be uploaded as alternate branches to this repo if they seem worth doing.
|
39 |
+
might fiddle with 2g32 or even 3g128 or such for experts. or try to stop HQQ from casting BF16 to FP16 for no reason.
|
40 |
|
41 |
+
#### you could also use the included/linked python script (and a big swap partition) to make them yourself.
|
42 |
+
```
|
43 |
+
for mixtral, using hqq 0.1.2.post:
|
44 |
+
you will need >180 gigabytes of physically addressable memory - but it doesn't need to be RAM. Set yourself up with a ~160GB swap partition.
|
45 |
+
the VRAM requirement is initially zero and never much larger than the emerging model. thus you can make any quant you can run.
|
46 |
+
```
|
47 |
+
#### this takes about 10 minutes with the current optimizer - it takes me all day to upload an ~18 GiB file.
|
48 |
+
## ps read Sleeper Agents (2024/01) :-)
|
49 |
---
|
50 |
# BagelMix
|
51 |
|