IQ2_S model?
The old 3-70B model just about fit into a 24GB video card's memory with IQ2_S quantization. Would you consider making one of roughly that size?
I just realized I missed all the IQ* quants!!! Once I finished 405B model I will add the missing IQ* quants :)
@kerrmetric we have the new IQ models back, could you please have a look?
Hey thanks for these -- I'm having wierd issues loading them on my 24 GB GPU. Either way I suspect we'll need to regenerate after the Llama.cpp tokenization and ROPE fixes?
Hey thanks for these -- I'm having wierd issues loading them on my 24 GB GPU. Either way I suspect we'll need to regenerate after the Llama.cpp tokenization and ROPE fixes?
Not re-generating, but just a simple meta update to handle context larger than 8k with better quality.
@MaziyarPanahi I don't think that's true because there's a new tensor that gets added with the larger context
@MaziyarPanahi I don't think that's true because there's a new tensor that gets added with the larger context
Oh damn! Last time we talked in that discussion they said it is ok with metadata change.
It’s ok, I’ll build a new Llama.cpp tonight and re-run this. 😆(thanks @bartowski ❤️)
careful tho cause there's another chat template fix incoming, and you can't update that with set metadata :')
https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct/discussions/63/
Then let's wait until it's all settled :D
Hey folks - given that I've got two of the leading quantizers in this chat, I'd love your thoughts on this idea. Do you think this will be useful as a quick project:
It's proving really hard to tell when a new LLM model (it's actually much worse with vision LLMs) is correctly implemented by various backends, the most prominent of which is, of course, Llama.cpp.
To solve this, we need an easy way for a user to pick up a quantized version of a newly released LLM and compare its output to the de facto transformers implementation. Typically, folks do this by running a few queries and comparing the output ad hoc, but that is expensive, slow, and uncertain, especially with large models. Instead, what we need to develop as a community is a set of golden inputs that test various pieces of functionality of a large language model:
- Basic chat (both one-turn and multiple): Even today people get prompting Gemma/Phi wrong given their specific set of chat special tokens etc.
- Tool use tokens (if available): I'm still not sure we've got these implemented right in Llama.cpp
- Very long length inputs (5K, 10K, 50K, 100K, 200K, 500K tokens): Every model seems to do something a little innovative with ROPE
Instead of comparing the full output of Llama.cpp with the original implementation, it'll be easier and more accurate to just measure the KL divergence between the token distribution for the first token between the two implementations. Given a good mix of golden inputs, we should see significant differences in the first token distribution between a correct and incorrect implementation (or incorrect use of special tokens, etc.)
The advantage of this is that only a single member of the community (or ideally, the model developer) needs to run the golden inputs on the full correct implementation. Everybody else can relatively quickly validate their implementation against the first token distribution (so they don't even need to run extended inferences). Over time, we'll develop a sense of how much you can expect a correct Q4 or Q8 quantization to diverge from the true BF16 model and quickly be able to pick up issues and tell when we've got the implementation exactly right.
I hypothesize ~35 or so golden inputs are all that will be needed (15 basic chat, ~10 tool use examples, ~10 long inputs), so the first token distribution "signature" for a given model will be fairly small. This isn't adversarial, so it's fine if the golden inputs are well known and standardized.