Q4_1 requests
Can't say I am thrilled to download and convert these big models just for one quant, but you definitely earned it :) They should be there already.
I found something strange about Midnight-Miqu-70B-v1.5.i1-Q4_1.gguf. The perplexity appears to be higher than Q4_0 on wiki.test.raw. I have never seen this before, usually Q4_1 has lower perplexity than Q4_0 on every value. What's happening?
Midnight-Miqu-70B-v1.5.i1-Q4_0.gguf [1]3.9332, [2]4.2799, [3]4.8661, [4]4.6234, [5]4.8498, [6]4.7785, [7]4.8799, [8]4.8708, [9]5.0583, [10]5.2259, [11]5.3931, [12]5.4692
Midnight-Miqu-70B-v1.5.i1-Q4_1.gguf [1]4.0059, [2]4.3201, [3]4.9247, [4]4.6730, [5]4.8854, [6]4.7992, [7]4.8997, [8]4.8820, [9]5.0574, [10]5.2133, [11]5.3793, [12]5.4593
Also, token probabilities of Q8_0 match closer with Q4_0 than Q4_1:
Q8_0 [( L 49.17%) ( El 37.96%) ( Eli 4.29%) ( A 3.26%)]
Q4_0 [( L 52.19%) ( El 40.28%) ( Eli 4.33%) ( A 3.21%)]
Q4_1 [( El 47.00%) ( L 43.36%) ( Eli 4.23%) ( A 2.91%)]
usually Q4_1 has lower perplexity than Q4_0 on every value.
Well, we've been telling you: Q4_1 is well known to be very unstable. That was one of the reasons it was abondoned: it is often larger and worse than Q4_0. The reason I added it was because you convinced me of the usefulness in certain situations (speed with metal). This is just a data point that the problems were not fixed in recent versions.
The Q4_1 quant was done with the same imatrix, but using the current version of llama.cpp.
It could also be because miqu was made from Q5_K instead of the usual 16-bit, so there's no bias value to begin with that makes Q4_1 worthwhile.
I tested more models, seems like a miqu issue rather than a problem with Q4_1 as a whole.
We will have to accept that Q4_1 quants have the potential to turn out worse than Q4_0 quants. For static quants my measurements so far even indicate that this usually is the case. Now while for wighted/imatrix quants Q4_1 is usually much better than Q4_0 you cannot relay on imatrix training working that well for every model. There are just some models/architectures that see less improvement from wighted/imatrix quants in which case Q4_0 might still beat Q4_1 however I would expect such outliers to be quite rare and I would expect such cases to mainly happen for non-English models due to ouer imatrix dataset beeing English focused.
In this specific case I wouldn't count too much on perplexity measurements. They are one of the worst measurements llama.cpp gives you especially when comparing quants of almost the same quality. Instead use KL-divergence and token probability measurments and see if they lead you to the same conclusion.
You're right, Q4_1 is better at English but worse at other languages than Q4_0, when comparing token probability differences with Q8_0.