README.md · MarsupialAI/Llama3_GGUF_Quant_Testing at a3837253c8c1e9c7b217d16d1c060e2fb35992d5

Some folks are claiming there's something funky going on with GGUF quanting for Llama 3 models. I don't disagree.

Some of those people are speculating that it has something to do with converting the raw weights from bf16 to fp16 instead of converting to fp32 as an intermediate step. I think that's bollocks. There is no logical or mathmatical justification for how that could possibly matter.

So to test this crazy theory, I downloaded Undi95/Meta-Llama-3-8B-Instruct-hf and converted it to GGUF three ways:

fp16 specifically with --outtype f16
fp32 specifically with --outtype f32
"Auto" with no outtype specified

I then quantized each of these conversions to Q4_K_M and ran perplexity tests on everything using my abbreviated wiki.short.raw

The results:

As you can see, converting to fp32 has no meaningful effect on PPL. There will no doubt be some people who will claim "PpL iSn'T gOoD eNoUgH!!1!". For those people, I have uploaded all GGUFs used in this test. Feel free to do more extensive testing on your own time. I consider the matter resolved until somebody can conclusively demonstrate otherwise.