Inference Issues

by qeternity - opened Apr 18

Discussion

qeternity

Apr 18

I saw you uploaded a Marlin packed version shortly after me. Are you running this on vLLM by any chance?

I am having real inference issues. I tried your version as well and I have the same issues. FP16 works fine though.

mgoin

Owner Apr 18

Hey @qeternity yes I ran this in vLLM, it seemed to be reasonable but I haven't run proper evaluations on it yet.

qeternity

Apr 19

•

edited Apr 19

It seems to work alright at very short contexts, but breaks beyond that (same with my version).

I should say I am running via SGLang (which uses vLLM) and I opened a PR for the prompt templating tonight, so I may have gotten something wrong there (but I don't think so given fp16 is fine).

qeternity

Apr 19

•

edited Apr 19

Ok - this is an issue with their chat template in tokenizer config being wrong.

The correct one is here: https://github.com/meta-llama/llama3/blob/92a325ec9925557b5fd64202c91024231a428c08/llama/test_tokenizer.py#L67

EDIT: nevermind, after inspecting the tokenizer, it seems the above comments are wrong.

mgoin

Owner Apr 19

@qeternity it might still be a tokenization issue, check out this fix that just landed last night in vllm https://github.com/vllm-project/vllm/pull/4182

qeternity

Apr 20

Alright so I have almost the exact same issues with your version as I do with my own. I suspect we are quanting the same way. I also tried bf16 -> fp32 pre quant but that did not change anything.

Weirdly if I download another GPTQ model and repack with marlin, then everything works fine. The first 8B desc_act=False I found is this one: https://huggingface.co/MaziyarPanahi/Meta-Llama-3-8B-Instruct-GPTQ

You're not having any issues with this? I'm just trying to figure out why my quant script is not working now.

Jimmy-Newtron

Jun 3

@qeternity Any news about which Marlin version is working correctly ?
I see you uploaded a new version Apr 28
Is the latest version fixing the mentioned issue?

qeternity

Jun 3

I think the new version I uploaded was simply to handle the ever-changing quant_config formatting.

But no, I was never able to get this issue fixed. The only quants that work are ones which do not use the chat template, which is obviously going to result in a worse quant (how much of a difference that makes is not clear to me). My quant and all of the others, afaict, and simply using wikitext to do the quant.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment