File size: 1,493 Bytes
a383725
 
 
 
 
 
 
 
 
 
 
1f787c8
0e51047
a383725
0e51047
 
 
 
 
a383725
 
1f787c8
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Some folks are claiming there's something funky going on with GGUF quanting for Llama 3 models.  I don't disagree.

Some of those people are speculating that it has something to do with converting the raw weights from bf16 to fp16 instead
of converting to fp32 as an intermediate step.  I think that's bollocks.  There is no logical or mathmatical justification for
how that could possibly matter.

So to test this crazy theory, I downloaded Undi95/Meta-Llama-3-8B-Instruct-hf and converted it to GGUF three ways:
- fp16 specifically with `--outtype f16`
- fp32 specifically with `--outtype f32`
- "Auto" with no outtype specified

I then quantized each of these conversions to Q4_K_M and ran perplexity tests on everything using my abbreviated wiki.short.raw 
text file.  The results:

````
FP16 specified:  size 14.9GB    PPL @ fp16 9.5158 +/- 0.15418    PPL @ Q4km 9.6414 +/- 0.15494
FP32 specified:  size 29.9GB    PPL @ fp32 9.5158 +/- 0.15418    PPL @ Q4km 9.6278 +/- 0.15466
None specified:  size 29.9GB    PPL @ ???? 9.5158 +/- 0.15418    PPL @ Q4km 9.6278 +/- 0.15466
````


As you can see, converting to fp32 has no meaningful effect on PPL compared to converting to fp16.  There will no doubt be some 
people who will claim "PpL iSn'T gOoD eNoUgH!!1!".  For those people, I have uploaded all GGUFs used in this test.  Feel free to 
use those files to do more extensive testing on your own time.  I consider the matter resolved until somebody can conclusively 
demonstrate otherwise.