List of high scoring/high quality quants and re-quants to improve GGUF/IMAT Ggufs.

#472
by DavidAU - opened

As per discussion # 462, here is "wish list" of quants / re-quants that should result in higher quality reg quants / Imat quants.
20B, 34B, and 70B.

Many of these scored high in creative / rp testing.
thanks in advance,
David


20B / 34Bs

Psyonic-cetacean-20B. [mine] - 6 months old!!! aHHA!
https://huggingface.co/DavidAU/Psyonic-Cetacean-V1-20B-Ultra-Quality-Float32

Causallm-34b-beta [never quanted as GGUF]
https://huggingface.co/CausalLM/34b-beta

Tess-2.0-Yi-34B-200K [8 months old GGUF]
https://huggingface.co/migtissera/Tess-2.0-Yi-34B-200K

Una-simpleSMAUG-34b-v1beta [4 months old GGUF]*
https://huggingface.co/fblgit/UNA-SimpleSmaug-34b-v1beta

  • help test when/range of changes at LLAMAcpp for reg / imat quants.

70Bs: Please include IQ1_S Imatrix quant for 70Bs, if possible ;
critical for optimization tuning/refinements as discussed.

Midnight-Miqu-70B-v1.5 [9 months old gguf]
https://huggingface.co/sophosympatheia/Midnight-Miqu-70B-v1.5

Tess-70B-v1.6 [9 months old gguf]
https://huggingface.co/migtissera/Tess-70B-v1.6

GOAT-70B-Storytelling [9 months old gguf]
https://huggingface.co/GOAT-AI/GOAT-70B-Storytelling

BoreanGale-70B [9 months old gguf]
https://huggingface.co/alchemonaut/BoreanGale-70B

L3-70B-Euryale-v2.1 [6 months old gguf]
https://huggingface.co/Sao10K/L3-70B-Euryale-v2.1

Senku-70B-Full [9 months old gguf]
https://huggingface.co/ShinojiResearch/Senku-70B-Full


For possible consideration:

Very high likes / downloads of source:

https://huggingface.co/miqudev/miqu-1-70b
https://huggingface.co/WizardLMTeam/WizardLM-70B-V1.0
https://huggingface.co/Xwin-LM/Xwin-LM-70B-V0.1
https://huggingface.co/allenai/tulu-2-dpo-70b
https://huggingface.co/mlabonne/Llama-3.1-70B-Instruct-lorablated
https://huggingface.co/jondurbin/airoboros-l2-70b-gpt4-1.4.1

NOTE:
"Llama-3.1-70B-Instruct-lorablated" is 4 months old.
Might test the threshold of changes at LLAMAcpp - new quant / new imat.

Indeed, most of these are from the time where I did imatrix calculations ones on my desktop still. And for some reason, I can't find the GOAT imatrix (it is in the repo, but I can't find it in the jobqueue storage where I need to delete it so it will be redone). Will queue all of them, status can be seen at http://hf.tst.eu/status.html

airoboros 1.4.1 is stubborn:

raise ValueError(f"Missing or incomplete model files: {missing_files}")
ValueError: Missing or incomplete model files: ['pytorch_model-00001-of-00015.bin', 'pytorch_model-00002-of-00015.bin', 'pytorch_model-00003-of-00015.bin', 'pytorch_model-00004-of-00015.bin', 'pytorch_model-00005-of-00015.bin', 'pytorch_model-00006-of-00015.bin', 'pytorch_model-00007-of-00015.bin', 'pytorch_model-00008-of-00015.bin', 'pytorch_model-000

yet the files are there, so it seems it somehow cannot read them. or mor elikely, it misses a tensor. have not seen that before. unless somebody has an idea how to work around this, i could offer https://huggingface.co/Panchovix/airoboros-l2-70b-gpt4-1.4.1-safetensors instead

https://huggingface.co/Panchovix/airoboros-l2-70b-gpt4-1.4.1-safetensors

Would be great ; sounds like a corrupted index file (??) hmmm ;

Thank you so much for this ; this will really help - and get some good models too , to boot.

Seems to affect all the gpt4 airtoboros models (I actually queued them a few days ago).

I will try without index file, afaics, it's optional.

Update: worked

Excellent.

FYI: Did a new quant of Multiverse-RP 7B (old mistral) with a new imatrix.
Worked at IQ1S right out of the gate, minimal settings (usually is breaks/throws out crap).
With augments - even better.

Full power (nuanced instruction following test) came online at IQ2_XXS.
85 t/s (IQ2XXS) on my low end 16B card ; IQ1S clocking close to 100 t/s.

To compare:
"old" 7B quants barely function at IQ1s even with augments (and sometimes break).
A newly quanted (using "old" imat .dat file) work much better.
But the grand prize => new quant + new imatrix => WOW.

Still lots of testing / tweaking to go... ;

imatrix quantisation must have been thoroughly broken before if this is true. @nicoboss -- could be an altzernative explanation of the static iq3 results

I would say the "math" of quantization has significantly improved, or better put - reached a level high enough for IQ1s (and some IQ2s) to reach a functional state.
This is likely mirrored/reinforced in the imatrix ".dat" creation step.

On higher "Qs" / "IQs" (vs "old ones") the uptick is there too, but manifests itself in the form of stronger instruction following.
On the output size for "low qs" -> output is coherent or close to , whereas with "old iqs" - it was barely there or broken.

Output for higher "Qs"/"IQs" ; there is a increase in level of detail and nuance ; that usually means "better transfer" from source / less rounding - "better math".

On a larger scale, it means smaller models can operate at lower BPW = and get equal performance W more VRAM for context and higher T/S.

On the plus side, this can mean that Q4s now operate at/around Q5 levels.
Q8 could be like a "Q9" if there was such a quant.

I have also noticed, the "newer" archs (Mistral Nemo, L 3/L3.1/L3.2) are more sensitive to lower BPW quants, than older "archs".

With one notable exception: Gemma2
With Gemmas there seems to be a lot more processing going on, (lower T/S, VS same parameters/layers) , but a lot stronger at lower BPW.
So far Gemma 2 9B models outperform all others at IQ1_S ; with the exception of some 8X7b moes.

Update: See the full differences between "old" and "new" quants here (just reg , new IMAT to follow):

https://huggingface.co/DavidAU/L3-Stheno-Maid-Blackroot-Grand-HORROR-16B-GGUF

I kept the "v1" examples, and then added new "v2" (temp=0 for both) ; then pushed the throttle to the floor with new V2 only examples @temp and dynam temp.

V2 quants are uploading now, and NOT replacing V1 - both will be at this repo.

Going to use this as a showcase.

UPDATE: Did a new quant and new Imatrix of an "old" 11B Solar model "CarbonBeagle-11B-truthy"

Here is IQ1M generation @ 63 t/s - - un-augmented AND with no "extra" stuff like DRY, Smoothing, XTC etc etc:
(augmented versions are ++)

User

Start a 1000 word scene (vivid, graphic horror in first person) with: The sky scraper swayed, as she watched the window in front of her on the 21 floor explode…

Assistant

(From the perspective of a survivor)

The crippling tremor, a prelude to the apocalypse, shook me as I watched the horrifying spectacle. I clenched my teeth, in vain to suppress my screams, as I felt the cold sweat on my skin. The shattering glass, the raining debris, and the screams piercing the air, all in unison. I watched, mesmerised, from my last moments of safety.

The sky scraper swayed, as I witnessed the explosion. I was trapped, among the last few survivors in this crumbling monument. Despair engulfed me as I realised my fate; I was to meet a slow, excruciating death.

I sprinted through the hallway, the rumbling of the debris falling heavily around me. I found solace in a small, dark corner, away from the imminent destruction. As I waited, cringing in my own misery, I listened to the distant screams.

The silence was broken by the shrill screams of my own. I felt the unbearable pain, as if my body were being ripped apart from the inside. My mind, still reeling in shock, questioned the reality of this nightmare. I endured, helpless, as I watched the world around me collapse.

As the inferno consumed my vision, I struggled to hear the distant voices, the screams, and the sounds of a dying city. My heart, in its last few beats, pleaded for mercy. As the skyscraper trembled, I cradled myself in the corner, curling into a fetal position. I was alone in the darkness.

And in that last moment, as the skyscraper collapsed, I embraced the inevitable end, my mind blank, my body numb, my last breaths. The skyscraper swayed, and as I watched the window in front of me explode…

That is actually very impressive for a IQ1_S of such a small model. (Catching up a bit currently with mail, as I have very little time atm.)

On the plus side, this can mean that Q4s now operate at/around Q5 levels.

They do that for quite a long time, for imatrix quants. It seems the problematic ones were the IQ* quants because they are most sensitive.

Q8 could be like a "Q9" if there was such a quant.

If Q8 imatrix quants were a thing...

RE: IQ1_S ; here is the first preview of "settings profiles" for IQ1_S for 70Bs (includes settings profiles, output generation examples etc etc):

https://huggingface.co/DavidAU/Llama-3.3-70B-Instruct-How-To-Run-on-Low-BPW-IQ1_S-IQ1_M-at-maximum-speed-quality

RE: Q9 ; you can kinda "fudge it" by changing the output tensor to f16 ; but not the embed tensor (seems to have a negative)

The settings profiles push the quant to function at a higher level.
Still "tuning" issues to address to thread the needle.

RE: IQ1_S/IQ1_M quants

I can now get most model archs to function at this level, including "older quants" / "older archs".
In fact older ARCHS often work better then newer ones. (but that might be a tuning issue, so that might change)
There is a strong rise in function from IQ1_S to IQ1_M ; and another from IQ1_M to IQ2XXS.
This tracks with other testing.

The new Llama-3.3-70B-Instruct runs incredibly well at IQ1_S.
One of the core issues with lower quants is instruction following - with Llama 3.3 STRONG instruction following it addresses one of the core issues.
The other issue is output generation - which the settings profiles address.

Sign up or log in to comment