MarsupialAI commited on
Commit
9a9cf2e
1 Parent(s): 6376837

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +33 -30
README.md CHANGED
@@ -1,32 +1,3 @@
1
- # Initial Testing 2024-04-25
2
-
3
- Some folks are claiming there's something funky going on with GGUF quanting for Llama 3 models. I don't disagree.
4
-
5
- Some of those people are speculating that it has something to do with converting the raw weights from bf16 to fp16 instead
6
- of converting to fp32 as an intermediate step. I think that's bollocks. There is no logical or mathmatical justification for
7
- how that could possibly matter.
8
-
9
- So to test this crazy theory, I downloaded Undi95/Meta-Llama-3-8B-Instruct-hf and converted it to GGUF three ways:
10
- - fp16 specifically with `--outtype f16`
11
- - fp32 specifically with `--outtype f32`
12
- - "Auto" with no outtype specified
13
-
14
- I then quantized each of these conversions to Q4_K_M and ran perplexity tests on everything using my abbreviated wiki.short.raw
15
- text file. The results:
16
-
17
- ````
18
- FP16 specified: size 14.9GB PPL @ fp16 9.5158 +/- 0.15418 PPL @ Q4km 9.6414 +/- 0.15494
19
- FP32 specified: size 29.9GB PPL @ fp32 9.5158 +/- 0.15418 PPL @ Q4km 9.6278 +/- 0.15466
20
- None specified: size 29.9GB PPL @ ???? 9.5158 +/- 0.15418 PPL @ Q4km 9.6278 +/- 0.15466
21
- ````
22
-
23
-
24
- As you can see, converting to fp32 has no meaningful effect on PPL compared to converting to fp16. PPL is identical at full weight,
25
- and the miniscule loss shown at Q4km is will within the margin of error. There will no doubt be some people who will claim
26
- "PpL iSn'T gOoD eNoUgH!!1!". For those people, I have uploaded all GGUFs used in this test. Feel free to use those files to do
27
- more extensive testing on your own time. I consider the matter resolved until somebody can conclusively demonstrate otherwise.
28
-
29
-
30
  # Continued Experiments 2024-05-11
31
 
32
  As an imatrix enjoyer, it has been bugging me whether the precision of the quant used to generate the imatrix actually
@@ -70,4 +41,36 @@ the fp32 GGUF and the fp32-generated imatrix was best, it was by such a miniscul
70
  difference between that (11.9314) and the Q4km made from the fp16 GGUF with the Q4_0-generaged imatrix (11.9355) could be detected
71
  under normal usage. The only counterintuitive result here is that the Q4_0-imat quants outperformed the Q8_0-imat quants. I cannot
72
  think of a reason why this should be the case. But as it seemingly *is* the case, I will be using Q4_0 as my intermediate step for
73
- generating imatrices in the future when the full fp16 model is too big for my measly 72GB of VRAM.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # Continued Experiments 2024-05-11
2
 
3
  As an imatrix enjoyer, it has been bugging me whether the precision of the quant used to generate the imatrix actually
 
41
  difference between that (11.9314) and the Q4km made from the fp16 GGUF with the Q4_0-generaged imatrix (11.9355) could be detected
42
  under normal usage. The only counterintuitive result here is that the Q4_0-imat quants outperformed the Q8_0-imat quants. I cannot
43
  think of a reason why this should be the case. But as it seemingly *is* the case, I will be using Q4_0 as my intermediate step for
44
+ generating imatrices in the future when the full fp16 model is too big for my measly 72GB of VRAM.
45
+
46
+
47
+
48
+
49
+ # Initial Testing 2024-04-25
50
+
51
+ Some folks are claiming there's something funky going on with GGUF quanting for Llama 3 models. I don't disagree.
52
+
53
+ Some of those people are speculating that it has something to do with converting the raw weights from bf16 to fp16 instead
54
+ of converting to fp32 as an intermediate step. I think that's bollocks. There is no logical or mathmatical justification for
55
+ how that could possibly matter.
56
+
57
+ So to test this crazy theory, I downloaded Undi95/Meta-Llama-3-8B-Instruct-hf and converted it to GGUF three ways:
58
+ - fp16 specifically with `--outtype f16`
59
+ - fp32 specifically with `--outtype f32`
60
+ - "Auto" with no outtype specified
61
+
62
+ I then quantized each of these conversions to Q4_K_M and ran perplexity tests on everything using my abbreviated wiki.short.raw
63
+ text file. The results:
64
+
65
+ ````
66
+ FP16 specified: size 14.9GB PPL @ fp16 9.5158 +/- 0.15418 PPL @ Q4km 9.6414 +/- 0.15494
67
+ FP32 specified: size 29.9GB PPL @ fp32 9.5158 +/- 0.15418 PPL @ Q4km 9.6278 +/- 0.15466
68
+ None specified: size 29.9GB PPL @ ???? 9.5158 +/- 0.15418 PPL @ Q4km 9.6278 +/- 0.15466
69
+ ````
70
+
71
+
72
+ As you can see, converting to fp32 has no meaningful effect on PPL compared to converting to fp16. PPL is identical at full weight,
73
+ and the miniscule loss shown at Q4km is will within the margin of error. There will no doubt be some people who will claim
74
+ "PpL iSn'T gOoD eNoUgH!!1!". For those people, I have uploaded all GGUFs used in this test. Feel free to use those files to do
75
+ more extensive testing on your own time. I consider the matter resolved until somebody can conclusively demonstrate otherwise.
76
+