8 bits quantization is not working with the model on the latest oobabooga

#1
by Stilgar - opened

gptq-8bit-32g raised the following exception:
site-packages\exllama\cuda_ext.py", line 33, in ext_make_q4
return make_q4(qweight,
RuntimeError: qweight and qzeros have incompatible shapes

I'm using "ExLlama_HF" loader, autoGPTQ not better

I'm not clear what it want to do with "ext_make_q4"

gptq-4bit-32g is OK and working without issue

oobabooga is running on it's own virtual env and match latest requirements

ExLlama doesn't support 8-bit GPTQ, and AutoGPTQ doesn't currently support Mistral GPTQ.

Please try using Transformers as the Loader, and see if that works. I've not personally tested it in Oobabooga, but I know Transformers works from Python code

It works with transformer but it’s slow (3 to 4 tokens/s), the GGUF release with 6 bits quantization and llama.cpp runs at 8 tokens/s and the GPTQ 4 bits with ExLlama_HF reaches 45 to 48 token/s.
I’m using a 4090, and the last case is using 11Go on the GPU.
I cannot see a quality difference between the models at least for storytelling, but did not test long
Thank you for you work and your reply.

Stilgar changed discussion status to closed

Sign up or log in to comment