MarsupialAI commited on
Commit
a383725
1 Parent(s): c3cd2ef

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +21 -0
README.md ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Some folks are claiming there's something funky going on with GGUF quanting for Llama 3 models. I don't disagree.
2
+
3
+ Some of those people are speculating that it has something to do with converting the raw weights from bf16 to fp16 instead
4
+ of converting to fp32 as an intermediate step. I think that's bollocks. There is no logical or mathmatical justification for
5
+ how that could possibly matter.
6
+
7
+ So to test this crazy theory, I downloaded Undi95/Meta-Llama-3-8B-Instruct-hf and converted it to GGUF three ways:
8
+ - fp16 specifically with `--outtype f16`
9
+ - fp32 specifically with `--outtype f32`
10
+ - "Auto" with no outtype specified
11
+
12
+ I then quantized each of these conversions to Q4_K_M and ran perplexity tests on everything using my abbreviated wiki.short.raw
13
+
14
+ The results:
15
+
16
+
17
+
18
+
19
+ As you can see, converting to fp32 has no meaningful effect on PPL. There will no doubt be some people who will claim
20
+ "PpL iSn'T gOoD eNoUgH!!1!". For those people, I have uploaded all GGUFs used in this test. Feel free to do more extensive
21
+ testing on your own time. I consider the matter resolved until somebody can conclusively demonstrate otherwise.