GGUF
Not-For-All-Audiences
nsfw
Inference Endpoints
conversational

No imatrix and/or no llama-3 support

#1
by mradermacher - opened

These quants have been made without an imatrix (they lack the headers), or using a very outdated llama.cpp version without support for llama 3.

You have to use a current llama.cpp version (not older than a few days) and convert-hf-to-gguf.py, otherwise you get severely reduced quality, although it superficially might work.

If llama.cpp was not too old, it's possible to work around this by using --override-kv tokenizer.ggml.pre=str:llama3 when loading them into a current version of llama.cpp.

I don't know if the faraday app can work around this, if not, it's better to redo the quants.

Backyard AI org

These quants have been made without an imatrix (they lack the headers), or using a very outdated llama.cpp version without support for llama 3.

I supply both kinds. Quants whose name contains ".IQ" are quantized with an imatrix, those whose name contains ".Q" or ".F16" are quantized without. All were made with with a llama.cpp built from the most recent sources available at the time. In this case, it was:

version: 2781 (6ecf3189)
built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu

Any reason why you don't use the imatrix for the .Q quants? Especially the smaller ones would benefit enourmously from an imatrix.

In any case, these quants are broken because they were done with the wrong converter that does not support llama-3, as I have explained.

Backyard AI org

Have you tried them? They are working GGUFs without the tokenizer issue. I’ve used the pre-fix broken quants and these are not the same.

There is no such thing as a fix to the tokenizer. If you doctored around with the model, you should clearly mark that you didn't quantize the original model but something else, and you can't fix the pretokenizer by doctoring around with the tokenizer. You can make hacks around the end tokens, but that has nothing to do with the pretokenizer.

For the pretokenizer to work correctly, it must be set in the gguf, but they are not.

In any case, I give up now. I am only the messenger, and you are just making up stuff (e.g. that Q2_K does not benefit from an imatrix) or that the wrong tokenizer cna be fixed with a tokenizer change. The fact is that these quants are doubly-low quality because they used the wrong converter AND failed to apply an imatrix to most quants. There is a real opportunity to learn something here instead of stubbornly defending a mistake.

Backyard AI org

Pre-fix as in the week where you could quantized a llama 3 model but the BPE pre-tokenizer wasn’t working correctly.

Our quants that are public are all made with a version of llama.cpp from after the pull request was merged.

Nobody said anything about doctoring the tokenizer.

Sigh. How can you be so stubborn: the BPE tokenizer in your GGUFs is not set. Just have a look at your files! They should be set to llama3, but it's missing, causing llama.cpp to fall back to 'default' (i.e. llama 2). That's why they are broken. They are not using the fix, regardless of the llama.cpp version.

The fix is not to use a newer version of llama.cpp with the wrong tool. The fix is to use a newer llama.cpp with the correct tool.

Anyway, I'll disengage here. If you want to fix it, I gave you all the info you need. If something is unclear, you can ask.

Backyard AI org

@mradermacher

So I've just had a look at NeverSleep's GGUF's (which you claim are done "correctly") and they are also done without using an iMatrix. So can you please stop causing drama? Thank you.

I never claimed otherwise. How about addressing my claims instead of spreading lies about me?

Please note that I wasn't the one who started this drama by attacking me personally for spilling the facts.

Since your continued refusal to address the substance of my criticism (instead attacking my person) makes it clear you have no defense other than that, an apology would be in order, but I have little hopes for honesty form your side at this point.

PS: I am not just claiming they are done correctly, I also explained exactly why and whats wrong with yours. Funny how you selectively ignore the actual criticism. Again, wrong pretokenizer and not using the imatrix on low-bpp quants both reduce quality, considerably so. Very easy to verify. Why don't you?

Sign up or log in to comment