Quantized version possible?
Greetings,
I've been trying to quantize this down to 4 bits for the last day-ish, and it looks like it loses touch with the tokens that are being output. (It outputs at a cadence that suggests it is predicting tokens, but the tokens are nonsense.) This is my first time trying to quantize a HuggingFace Transformers model. (I've done it in the past with the raw LLaMA model using llama.cpp
.) Any pointers that I could use to figure out what's going wrong?
Thanks muchly for any advice,
-- Cypherfox
I've quantized it to 4bit-128g, and was able to use in oobabooga/text-generation-webui without a problem.
Since it's fast moving space, and everyone is doing different things, there are so many versions of GPTQ.
The trick is to make sure you run the inference using the same version that you used to quantize.
I learned the hard way with LLaMA after wasting time trying to quantize bunch of times with different combination of parameters.
Also make sure whatever you're using it to inference supports that particular GPTQ. With different combination of transformer, you might get an error as well.
Hope that helps and good luck!
Use ooba’s cmd script to open a command prompt and use the copy in text-generation-webui/repositories/gptq. Also, try without act-order, for me that messes things up, but otherwise it quantises fine.
I'm doing it from the command-line locally, not in a colab, so I was having some other issues.
I think I must have been using the wrong GPTQ repo, as I seem to have three different ones on my system. I must have finally used the correct oobabooga one, but when I used the llama
script to convert the post-xor model on my local system (3070Ti w/8GB) it ran out of GPU memory at layer 27. :( So I spun up a p3.2xlarge (V100 w/16GB) on EC2 and did all the downloading, xor'ing, and generated the necessary HF files, and then did the quantization using the llama.py
script. Downloaded that to my local system, and that worked!
Just for anyone who wants to know how, the command looked like this:
python repositories/GPTQ-for-LLaMa/llama.py models/pygmalion-7b c4 --wbits 4 --groupsize 128 --save_safetensors models/pygmalion-7b-4bits/pygmalion-7b-4bit-128g.safetensors
with the post-xor model in models/pygmalion-7b
, and an empty directory in models/pygmalion-7b-4bits
. After the quantized file was created, I downloaded that and all the associated (post-xor) *.json
and the one tokenizer.model
file from models/pygmalion-7b
into models/pygmalion-7b-4bits
on my local system, and launched oobabooga with:
python server.py --no-cache --threads 4 --chat --listen --model pygmalion-7b-4bits --wbits 4 --groupsize 128 --model_type llama
The iteration speed is pretty slow on my local system (around 0.66 tokens/s) but it works without running out of GPU memory, and the quality is actually really good so far.
Thanks very much for the encouragement and letting me know that it could work! That helped me figure out what I was doing wrong on my side.
-- Cypherfox
You've already closed the topic, but you can try quantising with --true-sequential to see if that speeds up inference. While you've got something running though, definitely don't replace what you have. True sequential is meant to improve accuracy, but for me, I felt that it ran faster too, but maybe that was just my imagination.