Inference Issues
I saw you uploaded a Marlin packed version shortly after me. Are you running this on vLLM by any chance?
I am having real inference issues. I tried your version as well and I have the same issues. FP16 works fine though.
Hey @qeternity yes I ran this in vLLM, it seemed to be reasonable but I haven't run proper evaluations on it yet.
It seems to work alright at very short contexts, but breaks beyond that (same with my version).
I should say I am running via SGLang (which uses vLLM) and I opened a PR for the prompt templating tonight, so I may have gotten something wrong there (but I don't think so given fp16 is fine).
Ok - this is an issue with their chat template in tokenizer config being wrong.
The correct one is here: https://github.com/meta-llama/llama3/blob/92a325ec9925557b5fd64202c91024231a428c08/llama/test_tokenizer.py#L67
EDIT: nevermind, after inspecting the tokenizer, it seems the above comments are wrong.
@qeternity it might still be a tokenization issue, check out this fix that just landed last night in vllm https://github.com/vllm-project/vllm/pull/4182
Alright so I have almost the exact same issues with your version as I do with my own. I suspect we are quanting the same way. I also tried bf16 -> fp32 pre quant but that did not change anything.
Weirdly if I download another GPTQ model and repack with marlin, then everything works fine. The first 8B desc_act=False I found is this one: https://huggingface.co/MaziyarPanahi/Meta-Llama-3-8B-Instruct-GPTQ
You're not having any issues with this? I'm just trying to figure out why my quant script is not working now.
@qeternity
Any news about which Marlin version is working correctly ?
I see you uploaded a new version Apr 28
Is the latest version fixing the mentioned issue?
I think the new version I uploaded was simply to handle the ever-changing quant_config
formatting.
But no, I was never able to get this issue fixed. The only quants that work are ones which do not use the chat template, which is obviously going to result in a worse quant (how much of a difference that makes is not clear to me). My quant and all of the others, afaict, and simply using wikitext to do the quant.