ProphetOfBostrom commited on
Commit
2b60801
1 Parent(s): 2c6818b

utility: maximised.

Browse files
Files changed (1) hide show
  1. README.md +14 -8
README.md CHANGED
@@ -12,19 +12,19 @@ tags:
12
  - HQQ
13
  - 2bit
14
  ---
15
- ## BagelMix-8x7B branch 2g16-4g64-HQQ
 
16
  By [Undi95](https://huggingface.co/Undi95/BagelMix-8x7B)
17
 
18
- #### (this readme has been written by a sleepy person. /disclaimer)
19
  ---
20
 
21
  [main branch is the same quant config as last time, the reference one from mobius here](https://huggingface.co/mobiuslabsgmbh/Mixtral-8x7B-v0.1-hf-attn-4bit-moe-2bit-HQQ)
22
 
23
  the label i've chosen here refers to 2 bit linear layers with a 16 param group size (per 8 bit group weight), and 4 bits in groups of 64 for the attention layers
24
  thus the actual bpw is higher than 2 in no small part because we're adding another byte every 4 bytes (i think??) for the linear layers.
25
-
26
- from what I can gather of hqq's source code, the gate network isn't quantised (because it's tiny and very important)
27
-
28
  such reasoning has lead me to try experimenting with taking more bits away from ther expert/linear layers and put them in the attention layers.
29
 
30
  i've currently got a slightly heavier 2g16 experts with 8g512 attention (not really sure how meaningful groups of 512 are but w/e) model already,
@@ -36,10 +36,16 @@ experts_params = BaseQuantizeConfig(nbits=2, group_size=16, quant_zero=True, qua
36
  ```
37
  again this is not what you're downloading if you get this right now: I want to see if I can actually keep the bpw down.
38
  these will be uploaded as alternate branches to this repo if they seem worth doing.
39
- might fiddle with 2g32 or even 3g128 or such for experts. given their most delectable sparseness
40
 
41
- ### you could also use the included python script (and a big swap partition) to make them yourself. again it's just the one from mobiuslabs themselves
42
- ### ps read Sleeper Agents (2024/01) :-)
 
 
 
 
 
 
43
  ---
44
  # BagelMix
45
 
 
12
  - HQQ
13
  - 2bit
14
  ---
15
+ ## BagelMix-8x7B - main branch 2g16-4g64-HQQ
16
+ Under 20 GB
17
  By [Undi95](https://huggingface.co/Undi95/BagelMix-8x7B)
18
 
19
+ #### (this readme has been written by a sleepy person. the link above takes you to the original model, the link below to the Mixtral HQQ reference. the rest is rambling)
20
  ---
21
 
22
  [main branch is the same quant config as last time, the reference one from mobius here](https://huggingface.co/mobiuslabsgmbh/Mixtral-8x7B-v0.1-hf-attn-4bit-moe-2bit-HQQ)
23
 
24
  the label i've chosen here refers to 2 bit linear layers with a 16 param group size (per 8 bit group weight), and 4 bits in groups of 64 for the attention layers
25
  thus the actual bpw is higher than 2 in no small part because we're adding another byte every 4 bytes (i think??) for the linear layers.
26
+ from what I can gather of hqq's source code, the gate ('expert' selection) network isn't quantised (because it's tiny and very important)
27
+ this is the reason we quantise the attention layers at 4 bits too - in a MoE it's small (shared between all the 'experts') which means it would quantize like a 2 bpw mistral
 
28
  such reasoning has lead me to try experimenting with taking more bits away from ther expert/linear layers and put them in the attention layers.
29
 
30
  i've currently got a slightly heavier 2g16 experts with 8g512 attention (not really sure how meaningful groups of 512 are but w/e) model already,
 
36
  ```
37
  again this is not what you're downloading if you get this right now: I want to see if I can actually keep the bpw down.
38
  these will be uploaded as alternate branches to this repo if they seem worth doing.
39
+ might fiddle with 2g32 or even 3g128 or such for experts. or try to stop HQQ from casting BF16 to FP16 for no reason.
40
 
41
+ #### you could also use the included/linked python script (and a big swap partition) to make them yourself.
42
+ ```
43
+ for mixtral, using hqq 0.1.2.post:
44
+ you will need >180 gigabytes of physically addressable memory - but it doesn't need to be RAM. Set yourself up with a ~160GB swap partition.
45
+ the VRAM requirement is initially zero and never much larger than the emerging model. thus you can make any quant you can run.
46
+ ```
47
+ #### this takes about 10 minutes with the current optimizer - it takes me all day to upload an ~18 GiB file.
48
+ ## ps read Sleeper Agents (2024/01) :-)
49
  ---
50
  # BagelMix
51