Heads up: Broken Mistral Nemos 12B / other sizes.
I was doing some testing of new Mistral Nemo "MOEs" today and ran into a oh crap... situation.
Here it is:
- min viable quant for a Mistral Nemo (regardless of parameters) seems to be IQ4_XS / Q4_K_S.
- Imat: IQ3_XXS (min, IQ3_S better)
Q2k (non / imat) are borked. Completely, totally BORKED.
Q3s barely operated / are dumb.
Suggest Q5 min for quality / Q6.
This applies to 12B mistral nemo, MOES of Mistral Nemo, and larger Mistral Nemos as far as I can tell.
I tested a number of them.
Maybe this is why they dropped max context to 128k for other newer mistrals?
NOTE:
Older Mistrals (7b / 32k context) operate at IQ1_S/M ; same for MOEs of this Mistral type.
That's not good... Interesting how it would affect as whole family.
Clarify:
By larger Mistral Nemos I mean stackers like my 23B / 23.5B Grand Gutenbergs.
Going to update my "MN" model cards tomorrow. ERRRR.
:) We're ready, although hf has severely limited out quantize capabilities :(
OOo... storage limits? other?
Early am here, brain not fully engaged.
Going to see if I can "hack" a solution for MNs.
I almost threw 3 perfectly viable MN MOEs in the trash because q2k did not work - a case of : "It really is YOU, not ME".
This issue does affect Llama 3,3.1,3.2 - but to a WAY smaller, if negligible level.
This tracks with my low quant investigations => More context, you need a higher quant to operate.
Makes sense
-> You need greater BPW, for more nuance, and the model needs more nuanced understanding at higher context limits.
Rate limited the api so that our repository creation was too fast and they blocked us for hours at a time, very little progress possible. The problem is that every upload tries to create the repository, and that call seems to have a very low rate limit. We have worked aorund it by uploading small quants in batches, which has improved the situation. But doing hack modifications in a running system with this throughput is not fun, especially if you want to go to sleep and have to wait for a few jobs to finish so you see if it works or not. Sigh.
The only real disappointment is that I mailed huggingface about it, and they didn't even bother with a reply.
As for your problem, is it specifically a problem for moes? Would make sense, as quantisation affects smaller models worse.
Hmm ; mini nightmare at this end uploading source files ; really bad the last week or so.
Could be everyone shopping online!?!
Had HF api uploaders (for source) crash/break several times.
I have mine set with 3 duplicates per 10 source files upload - always seemed to work ... not this past week.
Sleep? What is that? ...
Seems to affect Mistral Nemos - MOE or non-moe.
Q2k quant is junk ; been trying different ways to "raise it up" - helped a bit with more BITS for output weights / embed weights.
(--output-tensor-type Q8_0 --token-embedding-type Q8_0 )
Q3s not much better.
Basically it means the "Floor" for a Mistral Nemo (12B model) quant is Q4KS at "min" , instead of "recommended".
Ahh... "Q2K" is "freaking desperate, I am dying of thirst" .
Issue is both instruction following / output generation.
Instruction: Not following correctly / missing instructions.
Output: Generation crashes -> Para repeats (maybe fix with "DRY" sampler) , but worse - word repeats => BOOM.
Sentence structure is poor / awful too.
This is for MOE (2x, 4x) and reg 12B Mistral Nemos.
My MN stackers seem to fair a bit better ; 23B, 23.5B , 29B.
I am still running tests at this point , going to try "older" llamacpp too.
UPDATE:
It seems generating the "main.gguf" (to quant all the others) at F32 (vs f16,bf16) with --output-tensor-type Q8_0 --token-embedding-type Q8_0 during quantize works.
For reference, the output tensor / token embed type Q2K is "Q6_k" and "Q2_K" respectively.
Q2k isn't perfect, but it does operate fairly well.
Going to test on other quants and do some measurements.
Had HF api uploaders (for source) crash/break several times.
Never seen anything else.
Sleep? What is that? ...
People who don't sleep enough increasingly sound like you. Sufficient sleep is important.
Sleep has never been one of my strong suits. ;
That being said, I am aware the tin foil hat gets tighter, on little sleep.
More testing today and options on revise quant options for MN models (and other archs too).
Did some testing on my "Multiverse MOE 4X7B" - q2k ; (7B mistral models, 32k context).
Some uptick in performance. However PPL and general stability of these "old 7Bs" are remarkable.
Looking at crafting "Super 6" and "Super 8" [F16/BF16] on newer archs / archs with higher context levels to measure effects.
Likewise looking at Imatrix end too ; with upping the output/embed settings "just a bit" to see if it fixes some issue at low end.
IE: Instruction following is the first issue that appears at low end quants, and based on some test results, increasing the bits
for embed/output (most likely output!) addresses these issues to a degree.
Interestingly: q8/q8 for output / embed is not always the best choice depending on USE CASE.