ProphetOfBostrom
/

BagelMix-8x7B-2b-HQQ

@@ -12,19 +12,19 @@ tags:
 - HQQ
 - 2bit
 ---
-## BagelMix-8x7B branch 2g16-4g64-HQQ
 By [Undi95](https://huggingface.co/Undi95/BagelMix-8x7B)
-#### (this readme has been written by a sleepy person. /disclaimer)
 ---
 [main branch is the same quant config as last time, the reference one from mobius here](https://huggingface.co/mobiuslabsgmbh/Mixtral-8x7B-v0.1-hf-attn-4bit-moe-2bit-HQQ)
 the label i've chosen here refers to 2 bit linear layers with a 16 param group size (per 8 bit group weight), and 4 bits in groups of 64 for the attention layers
 thus the actual bpw is higher than 2 in no small part because we're adding another byte every 4 bytes (i think??) for the linear layers.
-from what I can gather of hqq's source code, the gate network isn't quantised (because it's tiny and very important)
 such reasoning has lead me to try experimenting with taking more bits away from ther expert/linear layers and put them in the attention layers.
 i've currently got a slightly heavier 2g16 experts with 8g512 attention (not really sure how meaningful groups of 512 are but w/e) model already,
@@ -36,10 +36,16 @@ experts_params = BaseQuantizeConfig(nbits=2, group_size=16, quant_zero=True, qua
 ```
 again this is not what you're downloading if you get this right now: I want to see if I can actually keep the bpw down.
 these will be uploaded as alternate branches to this repo if they seem worth doing.
-might fiddle with 2g32 or even 3g128 or such for experts. given their most delectable sparseness
-### you could also use the included python script (and a big swap partition) to make them yourself. again it's just the one from mobiuslabs themselves
-### ps read Sleeper Agents (2024/01) :-)
 ---
 # BagelMix

 - HQQ
 - 2bit
 ---
+## BagelMix-8x7B - main branch 2g16-4g64-HQQ
+Under 20 GB
 By [Undi95](https://huggingface.co/Undi95/BagelMix-8x7B)
+#### (this readme has been written by a sleepy person. the link above takes you to the original model, the link below to the Mixtral HQQ reference. the rest is rambling)
 ---
 [main branch is the same quant config as last time, the reference one from mobius here](https://huggingface.co/mobiuslabsgmbh/Mixtral-8x7B-v0.1-hf-attn-4bit-moe-2bit-HQQ)
 the label i've chosen here refers to 2 bit linear layers with a 16 param group size (per 8 bit group weight), and 4 bits in groups of 64 for the attention layers
 thus the actual bpw is higher than 2 in no small part because we're adding another byte every 4 bytes (i think??) for the linear layers.
+from what I can gather of hqq's source code, the gate ('expert' selection) network isn't quantised (because it's tiny and very important)
+this is the reason we quantise the attention layers at 4 bits too - in a MoE it's small (shared between all the 'experts') which means it would quantize like a 2 bpw mistral
 such reasoning has lead me to try experimenting with taking more bits away from ther expert/linear layers and put them in the attention layers.
 i've currently got a slightly heavier 2g16 experts with 8g512 attention (not really sure how meaningful groups of 512 are but w/e) model already,
 ```
 again this is not what you're downloading if you get this right now: I want to see if I can actually keep the bpw down.
 these will be uploaded as alternate branches to this repo if they seem worth doing.
+might fiddle with 2g32 or even 3g128 or such for experts. or try to stop HQQ from casting BF16 to FP16 for no reason.
+#### you could also use the included/linked python script (and a big swap partition) to make them yourself.
+```
+ for mixtral, using hqq 0.1.2.post:
+ you will need >180 gigabytes of physically addressable memory - but it doesn't need to be RAM. Set yourself up with a ~160GB swap partition.
+ the VRAM requirement is initially zero and never much larger than the emerging model. thus you can make any quant you can run.
+```
+#### this takes about 10 minutes with the current optimizer - it takes me all day to upload an ~18 GiB file.
+## ps read Sleeper Agents (2024/01) :-)
 ---
 # BagelMix