Can we go even lower?

#1
by MotherSoraka - opened

Your quantization method seem to be superior and smarter to every other methods so far,
If you can achieve such high output quality from Hybrid(?) of Q5-K and F16 quantization,
How do you think a Q2 or Q3 would do for 13-34B models?
Have you tried it?

Your quantization method seem to be superior and smarter to every other methods so far,
If you can achieve such high output quality from Hybrid(?) of Q5-K and F16 quantization,
How do you think a Q2 or Q3 would do for 13-34B models?
Have you tried it?

Not extensively, but under Q5 I usually see signs of model dergradation.
From some feedbacks I received, it also seems that large models are more affected by quantizations but it yet to be proven.

If you want to try, just tell me which model you want to quantize and I can provide the quants.

Or you can do the quantizations yourself, starting from the f16 gguf file with:

quantize.exe --allow-requantize --output-tensor-type f16 --token-embedding-type f16 model.f16.gguf model.f16.q5.gguf q5_k
quantize.exe --allow-requantize --output-tensor-type f16 --token-embedding-type f16 model.f16.gguf model.f16.q6.gguf q6_k
quantize.exe --allow-requantize --output-tensor-type f16 --token-embedding-type f16 model.f16.gguf model.f16.q6.gguf q8_0
quantize.exe --allow-requantize --pure model.f16.gguf model.f16.q8_p.gguf q8_0

Sign up or log in to comment