|
Some folks are claiming there's something funky going on with GGUF quanting for Llama 3 models. I don't disagree. |
|
|
|
Some of those people are speculating that it has something to do with converting the raw weights from bf16 to fp16 instead |
|
of converting to fp32 as an intermediate step. I think that's bollocks. There is no logical or mathmatical justification for |
|
how that could possibly matter. |
|
|
|
So to test this crazy theory, I downloaded Undi95/Meta-Llama-3-8B-Instruct-hf and converted it to GGUF three ways: |
|
- fp16 specifically with `--outtype f16` |
|
- fp32 specifically with `--outtype f32` |
|
- "Auto" with no outtype specified |
|
|
|
I then quantized each of these conversions to Q4_K_M and ran perplexity tests on everything using my abbreviated wiki.short.raw |
|
text file. The results: |
|
|
|
```` |
|
FP16 specified: size 14.9GB PPL @ fp16 9.5158 +/- 0.15418 PPL @ Q4km 9.6414 +/- 0.15494 |
|
FP32 specified: size 29.9GB PPL @ fp32 9.5158 +/- 0.15418 PPL @ Q4km 9.6278 +/- 0.15466 |
|
None specified: size 29.9GB PPL @ ???? 9.5158 +/- 0.15418 PPL @ Q4km 9.6278 +/- 0.15466 |
|
```` |
|
|
|
|
|
As you can see, converting to fp32 has no meaningful effect on PPL compared to converting to fp16. There will no doubt be some |
|
people who will claim "PpL iSn'T gOoD eNoUgH!!1!". For those people, I have uploaded all GGUFs used in this test. Feel free to |
|
use those files to do more extensive testing on your own time. I consider the matter resolved until somebody can conclusively |
|
demonstrate otherwise. |