MarsupialAI
/

Llama3_GGUF_Quant_Testing

GGUF

Inference Endpoints

conversational

Model card Files Files and versions Community

MarsupialAI commited on May 11

Commit

9a9cf2e

•

1 Parent(s): 6376837

Update README.md

Browse files

Files changed (1) hide show

README.md +33 -30

README.md CHANGED Viewed

@@ -1,32 +1,3 @@
-# Initial Testing 2024-04-25
-Some folks are claiming there's something funky going on with GGUF quanting for Llama 3 models.  I don't disagree.
-Some of those people are speculating that it has something to do with converting the raw weights from bf16 to fp16 instead
-of converting to fp32 as an intermediate step.  I think that's bollocks.  There is no logical or mathmatical justification for
-how that could possibly matter.
-So to test this crazy theory, I downloaded Undi95/Meta-Llama-3-8B-Instruct-hf and converted it to GGUF three ways:
-- fp16 specifically with `--outtype f16`
-- fp32 specifically with `--outtype f32`
-- "Auto" with no outtype specified
-I then quantized each of these conversions to Q4_K_M and ran perplexity tests on everything using my abbreviated wiki.short.raw
-text file.  The results:
-````
-FP16 specified:  size 14.9GB    PPL @ fp16 9.5158 +/- 0.15418    PPL @ Q4km 9.6414 +/- 0.15494
-FP32 specified:  size 29.9GB    PPL @ fp32 9.5158 +/- 0.15418    PPL @ Q4km 9.6278 +/- 0.15466
-None specified:  size 29.9GB    PPL @ ???? 9.5158 +/- 0.15418    PPL @ Q4km 9.6278 +/- 0.15466
-````
-As you can see, converting to fp32 has no meaningful effect on PPL compared to converting to fp16.  PPL is identical at full weight,
-and the miniscule loss shown at Q4km is will within the margin of error.  There will no doubt be some people who will claim
-"PpL iSn'T gOoD eNoUgH!!1!".  For those people, I have uploaded all GGUFs used in this test.  Feel free to use those files to do
-more extensive testing on your own time.  I consider the matter resolved until somebody can conclusively demonstrate otherwise.
 # Continued Experiments 2024-05-11
 As an imatrix enjoyer, it has been bugging me whether the precision of the quant used to generate the imatrix actually
@@ -70,4 +41,36 @@ the fp32 GGUF and the fp32-generated imatrix was best, it was by such a miniscul
 difference between that (11.9314) and the Q4km made from the fp16 GGUF with the Q4_0-generaged imatrix (11.9355) could be detected
 under normal usage.  The only counterintuitive result here is that the Q4_0-imat quants outperformed the Q8_0-imat quants.  I cannot
 think of a reason why this should be the case.  But as it seemingly *is* the case, I will be using Q4_0 as my intermediate step for
-generating imatrices in the future when the full fp16 model is too big for my measly 72GB of VRAM.

 # Continued Experiments 2024-05-11
 As an imatrix enjoyer, it has been bugging me whether the precision of the quant used to generate the imatrix actually
 difference between that (11.9314) and the Q4km made from the fp16 GGUF with the Q4_0-generaged imatrix (11.9355) could be detected
 under normal usage.  The only counterintuitive result here is that the Q4_0-imat quants outperformed the Q8_0-imat quants.  I cannot
 think of a reason why this should be the case.  But as it seemingly *is* the case, I will be using Q4_0 as my intermediate step for
+generating imatrices in the future when the full fp16 model is too big for my measly 72GB of VRAM.
+# Initial Testing 2024-04-25
+Some folks are claiming there's something funky going on with GGUF quanting for Llama 3 models.  I don't disagree.
+Some of those people are speculating that it has something to do with converting the raw weights from bf16 to fp16 instead
+of converting to fp32 as an intermediate step.  I think that's bollocks.  There is no logical or mathmatical justification for
+how that could possibly matter.
+So to test this crazy theory, I downloaded Undi95/Meta-Llama-3-8B-Instruct-hf and converted it to GGUF three ways:
+- fp16 specifically with `--outtype f16`
+- fp32 specifically with `--outtype f32`
+- "Auto" with no outtype specified
+I then quantized each of these conversions to Q4_K_M and ran perplexity tests on everything using my abbreviated wiki.short.raw
+text file.  The results:
+````
+FP16 specified:  size 14.9GB    PPL @ fp16 9.5158 +/- 0.15418    PPL @ Q4km 9.6414 +/- 0.15494
+FP32 specified:  size 29.9GB    PPL @ fp32 9.5158 +/- 0.15418    PPL @ Q4km 9.6278 +/- 0.15466
+None specified:  size 29.9GB    PPL @ ???? 9.5158 +/- 0.15418    PPL @ Q4km 9.6278 +/- 0.15466
+````
+As you can see, converting to fp32 has no meaningful effect on PPL compared to converting to fp16.  PPL is identical at full weight,
+and the miniscule loss shown at Q4km is will within the margin of error.  There will no doubt be some people who will claim
+"PpL iSn'T gOoD eNoUgH!!1!".  For those people, I have uploaded all GGUFs used in this test.  Feel free to use those files to do
+more extensive testing on your own time.  I consider the matter resolved until somebody can conclusively demonstrate otherwise.