Quantization of output weight

#2
by Nexesenex - opened

Hello @mradermacher ,

I noticed something unexpected in your quants : they are a bit bigger than usual.
When reading them with LlamaCPP, I noticed that one tensor is not quantized as it should, and that's probably the output tensor : that tensor remains in F16 instead of going in Q6_K : That's a 500MB (several precents) bump on the size of your model, and an output tensor in F16 will bring about a 0.1% quality bump, if any.

That mismatch is likely present in many of your quants, on several models.

Screenshot 2024-04-05 031209.png
The gguf look-insider hf has says output.weight is in f16, unsure if it's the same one you mentioned

Yes, the output.weight.
It should at least be quantized in Q6_K to shrink it without much quality loss.

I use --leave-output-tensor (which seems to be recommended), because quantizing the output sensor, especially at high low bpp, often has a very detrimental effect. I don't know of a way of quantizing the output tensor to higher values. But I am very open to discuss this further and potentially change this.

Hmm, I switched it off. Can't remember which models caused issues, but it indeed seems excessive, and I noticed this in the beginning when I was still rather green. I wonder if there is a way just to requantize those tensors in an already-quantized file.

Not sure where I got it from, but it is mentioned often, e.g. in https://github.com/ggerganov/llama.cpp/discussions/2937

That's because quantizing this tensor was shown to have a pretty significant effect on model quality and one may choose to make the fairly small size/quality tradeoff of leaving that tensor alone.

Anyway, I let quantize decide for the time being. Clearly, "fairly small trade-off" and "it was shown" is all very fuzzy.

Ikawrakow already defined that, starting from the 3 bits quants strategies, 6 bits would be used for the output tensor. And 5 bits below.

From my different tests with quantization strategies, I can confirm that even on small models, the standard quantizations of the output tensor are optimal and perfectly fine and in line what's expected : the perplexity won't budge more than 0.1% (and often less than that) from an output tensor F16 to a Q6_K. With Q5_K, the drop is more significant (0.2%? :D) but irrelevant on 1 and 2 bits quants.

So, you can let the quant decide for the output tensor without the shadow of a doubt.

Your confirmation is enough for me then. Thanks for investigating!

mradermacher changed discussion status to closed

Sign up or log in to comment