May 5, 2023

•

edited May 5, 2023

Hello, I love your work, I would like to ask you if you use AutoGPTQ and what hardware should I have to be able to use it?
my goal is to create a GPTQ-4bit-128g of a GPT-J 6b model to be able to use it in oobabooga.

Do you think that with my current PC configuration I can achieve something like this?
ram: 32GB
CPU: i7 10700
Gpu: nvidia rtx 3060 12GB

If you know of any code that can make my life easier, I would appreciate it.

TheBloke

Owner May 5, 2023

•

edited May 5, 2023

I do not currently use AutoGPTQ to make these models, because before I do that I want to do an evaluation on the best dataset to use, and compare results with GPTQ-for-LLaMa.

But yes I am using AutoGPTQ very regularly now for testing GPTQ inference. And I am trying to help make AutoGPTQ be the new standard for GPTQ, replacing GPTQ-for-LLaMa. You will see I am posting quite a lot in https://github.com/PanQiWei/AutoGPTQ at the moment.

Regarding GPT-J: this is something I've not looked at yet. I've not quantised any GPT-J models because GPTQ-for-LLaMa doesn't support them well. AutoGPTQ should work for this, I've just not tested it yet.

I think your system will be fine for both quantising and inference of a 6B model.

I will warn you that AutoGPTQ is still in quite an early state and there are bugs and issues at the moment. For example the example code in the README (which quantises an OPT model) currently produces bad output :) So you might need to wait a few more days for it to be stable.

But give it a go and see what happens and let me know if you encounter problems. And if you've found a bug, post it in https://github.com/PanQiWei/AutoGPTQ/issues

RedXeol

May 5, 2023

Thank you very much, you have given me encouragement, I thought it was impossible with my current pc, I will use AutoGPTQ and I will try to do my best to achieve my goal, I will let you know if I succeed or if I run into some error, I hope it is not a bother for you . Thank you very much, I really admire you.

RedXeol

May 5, 2023

Sorry for the inconvenience, do you know in the AutoGPTQ example file how I indicate to the model that I need a version compatible with oobabooga, that is, compat.no-act-order.safetensors
line:

save quantized model

 model.save_quantized(quantized_model_dir)

 # save quantized model using safetensors
 model.save_quantized(quantized_model_dir, use_safetensors=True)

TheBloke

Owner May 5, 2023

So firstly comat.no-act-order is just my own naming convention. Compat to indicate it's most compatible, and no-act-order to indicate it doesn't use the --act-order feature

Act-order has been renamed desc_act in AutoGPTQ. So if you generate a model without desc_act, it should in theory be compatible with older GPTQ-for-LLaMa code. But I've not yet tested this.

The default is not to use desc_act, so you should be fine anyway.

Whether to use it or not is specified in the quantization configuration:

quantize_config = BaseQuantizeConfig(
    bits=4,  # quantize model to 4-bit
    group_size=128,  # it is recommended to set the value to 128
   desc_act=False
)

# load un-quantized model, by default, the model will always be loaded into CPU memory
model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)

# quantize model, the examples should be list of dict whose keys can only be "input_ids" and "attention_mask"
model.quantize(examples, use_triton=False)

 model.save_quantized(quantized_model_dir, use_safetensors=True)

We specified desc_act=False so it won't use it. But desc_act=False is the default, so it also won't be used if we didn't specifically add that to the BaseQuantizeConfig()

Note that there are currently bugs in quantisation. I tested OPT quantisation earlier and the result was unusable. That's being tracked in this bug: https://github.com/PanQiWei/AutoGPTQ/issues/52

That may be specific to OPT and maybe it works on other models. But I'm not sure yet.

By the way, do check out the example scripts. There's one called quant_with_alpaca that uses the Alpaca dataset as the quantisation examples. The dataset used may improve the quantisation quality. When I quantise with GPTQ-for-LLaMa I currently use c4 as the quantisation dataset.

RedXeol

May 5, 2023

•

edited May 5, 2023

I did it, thank you very much for your guide... the tool is very good, it did not fail me with a GPT-J 6b model
First I had to uninstall cuda since it was not compatible, also I deleted all traces of torch on my pc.
Then install CUDA 11.8
Then troch compatible with this version pip install torch==2.0.0+cu118 torchvision -f https://download.pytorch.org/whl/cu118/torch_stable.html
Then download the model locally
And finally modify the basic_usage.py code like this:
import operating system

from transformers import AutoTokenizer, TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

pretrained_model_dir = "A:/LLMs_LOCAL/caosgpt_j_6B_alpaca/"
quantized_model_dir = "caosgpt-j-6B-alpaca-4bit-128g"

os.makedirs(quantized_model_dir, exist_ok=True)

def main():
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
examples = [
tokenizer(
"auto-gptq is an easy-to-use model quantification library with user-friendly APIs, based on the GPTQ algorithm."
),
tokenizer(
"Artificial intelligence has advanced significantly in recent years."
),
tokenizer(
"Model quantization can reduce model size and improve model efficiency."
),
tokenizer(
"Quantization algorithms can reduce the amount of memory and power required."
),
tokenizer(
"Deep learning is used in a variety of applications, from medicine to marketing."
),
tokenizer(
"The GPT-4 architecture is the foundation of many next-generation language models."
),
tokenizer(
"Natural language processing allows machines to understand and communicate in human languages."
),
tokenizer(
"Convolutional neural networks are used in computer vision."
),
tokenizer(
"Optimization algorithms are fundamental for training deep learning models."
),
tokenizer(
"Reinforcement learning is a machine learning technique in which agents learn through interaction with their environment."
)
]

  quantify_config = BaseQuantizeConfig(
      bits=4, # quantize the model to 4 bits
      group_size=128, # it is recommended to set the value to 128
      desc_act=False
  )

  # load the unquantified model, the model will always be force loaded into the CPU
  model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)

  # quantization model, examples must be a list of dict whose keys contain "input_ids" and "attention_mask"
  # with low value type torch.LongTensor.
  model.quantize(examples, use_triton=False)

  # save the quantized model
  model.save_quantized(quantized_model_dir)

  # save the quantized model using security tensors
  model.save_quantized(quantized_model_dir, use_safetensors=True)

  # load quantized model, currently only supports cpu or single gpu
  model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0", use_triton=False)

  # inference with model.generate
  print(tokenizer.decode(model.generate(**tokenizer("auto_gptq is", return_tensors="pt").to("cuda:0"))[0]))

  # or you can also use pipeline
  pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer, device="cuda:0")
  print(pipeline("auto-gptq is")[0]["generated_text"])

if name == "main":
Import registration

  record.basicConfig(
      format="%(asctime)s %(levelname)s [%(name)s] %(message)s", level=logging.INFO, datefmt="%Y-%m-%d %H:%M: %S"
  )

  major()

finally i added the tokenizer again and it works on oobabooga_windows

Output generated in 3.59 seconds (1.39 tokens/s, 5 tokens, context 25, seed 570147804)
Output generated in 10.13 seconds (9.08 tokens/s, 92 tokens, context 59, seed 771109588)
Output generated in 11.44 seconds (9.09 tokens/s, 104 tokens, context 59, seed 1814752661)
Output generated in 23.26 seconds (8.55 tokens/s, 199 tokens, context 197, seed 873787634)
Output generated in 8.74 seconds (3.89 tokens/s, 34 tokens, context 423, seed 1563385550)
Output generated in 11.40 seconds (4.91 tokens/s, 56 tokens, context 497, seed 390511124)
Output generated in 10.70 seconds (2.99 tokens/s, 32 tokens, context 641, seed 1136593747)
Output generated in 9.73 seconds (1.44 tokens/s, 14 tokens, context 702, seed 1881813284)
So far it works fine, I'll keep trying it... thank you very much

TheBloke

Owner May 5, 2023

Great to hear!

TheBloke
/

GPT4All-13B-snoozy-GGML

help me with a question

save quantized model

os.makedirs(quantized_model_dir, exist_ok=True)