8bit quantization

#2
by Kernel - opened

Any plans for 8bit quantized model? I see that you don't make such models, why so? I think this is the best for GPU usage

GGML models are CPU only, so GPU isn't involved.

I've never bothered with q8 because q5_1 is already so incredibly close to fp16, that there didn't seem any point.

Here's the quantisation table from the README of llama.cpp:

image.png

On a 13B model like this, fp16 scores 5.2455 and q5_1 is 5.2582. That's a difference of 0.24%. Q8 scores 5.2458, which is 0.05% 'worse' than FP16. So it is better than q5_1, but is anyone ever really going to be able to spot the difference between 0.24% higher perplexity vs 0.05% higher?

So that's why I never bothered. Then again I guess I could upload them just for completeness! It's not like it uses any disk space for me once I've uploaded them - that's on HF :)

OK next time I do a model I'll do q8 as well, and maybe I'll add some q8's for the last couple of models I did as well.

Sign up or log in to comment