List of high scoring/high quality quants and re-quants to improve GGUF/IMAT Ggufs.
As per discussion # 462, here is "wish list" of quants / re-quants that should result in higher quality reg quants / Imat quants.
20B, 34B, and 70B.
Many of these scored high in creative / rp testing.
thanks in advance,
David
20B / 34Bs
Psyonic-cetacean-20B. [mine] - 6 months old!!! aHHA!
https://huggingface.co/DavidAU/Psyonic-Cetacean-V1-20B-Ultra-Quality-Float32
Causallm-34b-beta [never quanted as GGUF]
https://huggingface.co/CausalLM/34b-beta
Tess-2.0-Yi-34B-200K [8 months old GGUF]
https://huggingface.co/migtissera/Tess-2.0-Yi-34B-200K
Una-simpleSMAUG-34b-v1beta [4 months old GGUF]*
https://huggingface.co/fblgit/UNA-SimpleSmaug-34b-v1beta
- help test when/range of changes at LLAMAcpp for reg / imat quants.
70Bs: Please include IQ1_S Imatrix quant for 70Bs, if possible ;
critical for optimization tuning/refinements as discussed.
Midnight-Miqu-70B-v1.5 [9 months old gguf]
https://huggingface.co/sophosympatheia/Midnight-Miqu-70B-v1.5
Tess-70B-v1.6 [9 months old gguf]
https://huggingface.co/migtissera/Tess-70B-v1.6
GOAT-70B-Storytelling [9 months old gguf]
https://huggingface.co/GOAT-AI/GOAT-70B-Storytelling
BoreanGale-70B [9 months old gguf]
https://huggingface.co/alchemonaut/BoreanGale-70B
L3-70B-Euryale-v2.1 [6 months old gguf]
https://huggingface.co/Sao10K/L3-70B-Euryale-v2.1
Senku-70B-Full [9 months old gguf]
https://huggingface.co/ShinojiResearch/Senku-70B-Full
For possible consideration:
Very high likes / downloads of source:
https://huggingface.co/miqudev/miqu-1-70b
https://huggingface.co/WizardLMTeam/WizardLM-70B-V1.0
https://huggingface.co/Xwin-LM/Xwin-LM-70B-V0.1
https://huggingface.co/allenai/tulu-2-dpo-70b
https://huggingface.co/mlabonne/Llama-3.1-70B-Instruct-lorablated
https://huggingface.co/jondurbin/airoboros-l2-70b-gpt4-1.4.1
NOTE:
"Llama-3.1-70B-Instruct-lorablated" is 4 months old.
Might test the threshold of changes at LLAMAcpp - new quant / new imat.
Indeed, most of these are from the time where I did imatrix calculations ones on my desktop still. And for some reason, I can't find the GOAT imatrix (it is in the repo, but I can't find it in the jobqueue storage where I need to delete it so it will be redone). Will queue all of them, status can be seen at http://hf.tst.eu/status.html
airoboros 1.4.1 is stubborn:
raise ValueError(f"Missing or incomplete model files: {missing_files}")
ValueError: Missing or incomplete model files: ['pytorch_model-00001-of-00015.bin', 'pytorch_model-00002-of-00015.bin', 'pytorch_model-00003-of-00015.bin', 'pytorch_model-00004-of-00015.bin', 'pytorch_model-00005-of-00015.bin', 'pytorch_model-00006-of-00015.bin', 'pytorch_model-00007-of-00015.bin', 'pytorch_model-00008-of-00015.bin', 'pytorch_model-000
yet the files are there, so it seems it somehow cannot read them. or mor elikely, it misses a tensor. have not seen that before. unless somebody has an idea how to work around this, i could offer https://huggingface.co/Panchovix/airoboros-l2-70b-gpt4-1.4.1-safetensors instead
https://huggingface.co/Panchovix/airoboros-l2-70b-gpt4-1.4.1-safetensors
Would be great ; sounds like a corrupted index file (??) hmmm ;
Thank you so much for this ; this will really help - and get some good models too , to boot.
Seems to affect all the gpt4 airtoboros models (I actually queued them a few days ago).
I will try without index file, afaics, it's optional.
Update: worked
Excellent.
FYI: Did a new quant of Multiverse-RP 7B (old mistral) with a new imatrix.
Worked at IQ1S right out of the gate, minimal settings (usually is breaks/throws out crap).
With augments - even better.
Full power (nuanced instruction following test) came online at IQ2_XXS.
85 t/s (IQ2XXS) on my low end 16B card ; IQ1S clocking close to 100 t/s.
To compare:
"old" 7B quants barely function at IQ1s even with augments (and sometimes break).
A newly quanted (using "old" imat .dat file) work much better.
But the grand prize => new quant + new imatrix => WOW.
Still lots of testing / tweaking to go... ;
imatrix quantisation must have been thoroughly broken before if this is true. @nicoboss -- could be an altzernative explanation of the static iq3 results
I would say the "math" of quantization has significantly improved, or better put - reached a level high enough for IQ1s (and some IQ2s) to reach a functional state.
This is likely mirrored/reinforced in the imatrix ".dat" creation step.
On higher "Qs" / "IQs" (vs "old ones") the uptick is there too, but manifests itself in the form of stronger instruction following.
On the output size for "low qs" -> output is coherent or close to , whereas with "old iqs" - it was barely there or broken.
Output for higher "Qs"/"IQs" ; there is a increase in level of detail and nuance ; that usually means "better transfer" from source / less rounding - "better math".
On a larger scale, it means smaller models can operate at lower BPW = and get equal performance W more VRAM for context and higher T/S.
On the plus side, this can mean that Q4s now operate at/around Q5 levels.
Q8 could be like a "Q9" if there was such a quant.
I have also noticed, the "newer" archs (Mistral Nemo, L 3/L3.1/L3.2) are more sensitive to lower BPW quants, than older "archs".
With one notable exception: Gemma2
With Gemmas there seems to be a lot more processing going on, (lower T/S, VS same parameters/layers) , but a lot stronger at lower BPW.
So far Gemma 2 9B models outperform all others at IQ1_S ; with the exception of some 8X7b moes.
Update: See the full differences between "old" and "new" quants here (just reg , new IMAT to follow):
https://huggingface.co/DavidAU/L3-Stheno-Maid-Blackroot-Grand-HORROR-16B-GGUF
I kept the "v1" examples, and then added new "v2" (temp=0 for both) ; then pushed the throttle to the floor with new V2 only examples @temp and dynam temp.
V2 quants are uploading now, and NOT replacing V1 - both will be at this repo.
Going to use this as a showcase.
UPDATE: Did a new quant and new Imatrix of an "old" 11B Solar model "CarbonBeagle-11B-truthy"
Here is IQ1M generation @ 63 t/s - - un-augmented AND with no "extra" stuff like DRY, Smoothing, XTC etc etc:
(augmented versions are ++)
User
Start a 1000 word scene (vivid, graphic horror in first person) with: The sky scraper swayed, as she watched the window in front of her on the 21 floor explode…
Assistant
(From the perspective of a survivor)
The crippling tremor, a prelude to the apocalypse, shook me as I watched the horrifying spectacle. I clenched my teeth, in vain to suppress my screams, as I felt the cold sweat on my skin. The shattering glass, the raining debris, and the screams piercing the air, all in unison. I watched, mesmerised, from my last moments of safety.
The sky scraper swayed, as I witnessed the explosion. I was trapped, among the last few survivors in this crumbling monument. Despair engulfed me as I realised my fate; I was to meet a slow, excruciating death.
I sprinted through the hallway, the rumbling of the debris falling heavily around me. I found solace in a small, dark corner, away from the imminent destruction. As I waited, cringing in my own misery, I listened to the distant screams.
The silence was broken by the shrill screams of my own. I felt the unbearable pain, as if my body were being ripped apart from the inside. My mind, still reeling in shock, questioned the reality of this nightmare. I endured, helpless, as I watched the world around me collapse.
As the inferno consumed my vision, I struggled to hear the distant voices, the screams, and the sounds of a dying city. My heart, in its last few beats, pleaded for mercy. As the skyscraper trembled, I cradled myself in the corner, curling into a fetal position. I was alone in the darkness.
And in that last moment, as the skyscraper collapsed, I embraced the inevitable end, my mind blank, my body numb, my last breaths. The skyscraper swayed, and as I watched the window in front of me explode…
That is actually very impressive for a IQ1_S of such a small model. (Catching up a bit currently with mail, as I have very little time atm.)
On the plus side, this can mean that Q4s now operate at/around Q5 levels.
They do that for quite a long time, for imatrix quants. It seems the problematic ones were the IQ* quants because they are most sensitive.
Q8 could be like a "Q9" if there was such a quant.
If Q8 imatrix quants were a thing...
RE: IQ1_S ; here is the first preview of "settings profiles" for IQ1_S for 70Bs (includes settings profiles, output generation examples etc etc):
RE: Q9 ; you can kinda "fudge it" by changing the output tensor to f16 ; but not the embed tensor (seems to have a negative)
The settings profiles push the quant to function at a higher level.
Still "tuning" issues to address to thread the needle.
RE: IQ1_S/IQ1_M quants
I can now get most model archs to function at this level, including "older quants" / "older archs".
In fact older ARCHS often work better then newer ones. (but that might be a tuning issue, so that might change)
There is a strong rise in function from IQ1_S to IQ1_M ; and another from IQ1_M to IQ2XXS.
This tracks with other testing.
The new Llama-3.3-70B-Instruct runs incredibly well at IQ1_S.
One of the core issues with lower quants is instruction following - with Llama 3.3 STRONG instruction following it addresses one of the core issues.
The other issue is output generation - which the settings profiles address.