Error message when loading through HuggingFace Transformers/Langchain
First of all thanks for all the work you are doing building these quantitated models, it is greatly appreciated! I am trying to load this using a HuggingFacePipeline in Langchain, that ultimately just uses the transformers library. I get the following error:
OSError: TheBloke/wizardLM-7B-GPTQ does not appear to have a file named pytorch_model.bin, tf_model.h5, model.ckpt
or flax_model.msgpack.
Is there something we need to do with the config files, model, or how we load it using transformers? I am trying to load this on a CUDA RTX GPU with 8GB of VRAM. The code to use as an LLM in Langchain is:
llm = HuggingFacePipeline.from_model_id(model_id="TheBloke/wizardLM-7B-GPTQ", task="text-generation", device=0, quantized=True, strict=False)
The above code will ultimately setup the model like this:
tokenizer = AutoTokenizer.from_pretrained("TheBloke/wizardLM-7B-GPTQ")
model = AutoModelForCausalLM.from_pretrained("TheBloke/wizardLM-7B-GPTQ")
I have had no luck loading any 7B 4bit model :(
Yeah I'm afraid you can't just load these models in normal Transformers code. They're quantised to use less VRAM, and therefore need special inference code.
You have two options:
- Don't load these quantised models, and instead load the original unquantised models in either fp16 or 8bit. This will use much more VRAM than 4bit quantised and will provide slower inference, but can be done immediately with at most one simple dependency.
You can load repos in HuggingFace format either in original 16bit, or in 8bit. You almost certainly want 8bit, as otherwise even a 7B model will use 13GB VRAM, and a 13B needs 26GB - and so can't be loaded in full on any consumer card. In 8bit those figures are halved.
8bit inference requires the bitsandbytes pip installed - pip install bitsandbytes
.
I have unquantised versions of all the models I've uploaded. Their repo names will end in -HF
. Eg TheBloke/wizardLM-7B-HF
To do 8bit inference, you simply add load_in_8bit=True
to your model load line, eg:
tokenizer = AutoTokenizer.from_pretrained("TheBloke/wizardLM-7B-HF")
model = AutoModelForCausalLM.from_pretrained("TheBloke/wizardLM-7B-HF", load_in_8bit=True)
- The second option is to persevere with the 4bit quantised GPTQs, which will save on VRAM and therefore either provide faster inference, or allowed you to load larger models. To do inference on GPTQs requires some special code; it's not supported by the base Transformers library.
To date I have made all my GPTQs using code called GPTQ-for-LLaMa. Until recently this was the only method. Unfortunately, doing inference with this code is not straightforward. There is an example script in that repo that you can use, called llama_inference.py
. So you could do tests with that and use its code as a base. But as you'll see, it's complicated and quite messy.
There should be a better solution: AutoGPTQ. AutoGPTQ is much newer but is developing fast and I'm hopeful that this will become the future of GPTQ.
As you'll see from the AutoGPTQ repo, its aim is to make using GPTQ models nearly as simple as using a standard Transformers model. It's named AutoGPTQ because it's intended to work alongside the Transformers auto methods like AutoModelForCausalLM.
Until a couple of days ago AutoGPTQ didn't immediately support loading models made with GPTQ-for-LLaMa. But there's just been a PR merged that should support it. It's so new I haven't even tried it out myself yet, so if you give it a go let me know!
Both GPTQ-for-LLaMa and AutoGPTQ can be run in one of two modes: Triton, or CUDA. Triton is recommended, but only works on Linux or in WSL2. CUDA requires compilation, meaning you may have to have a C/C++ compiler and the NVidia CUDA Toolkit installed. Although the AutoGPTQ dev recently added a PyPi package (pip install auto-gptq
) so it may be he's provided already-compiled binaries now.
I recommend starting with the AutoGPTQ repo. It includes installation instructions and example inference code.
Let me know how you get on!
PS. Sorry, having written all that I completely forgot you're actually using LangChain.
So firstly there is no official support for GPTQ in LangChain as yet.
There should be support for 8bit, so that remains an option. A 7B model might just fit in 8GB in 8bit, though I don't know for sure.
Does LangChain have an option to set up the model first, then pass it that model? If so you could likely initialise the model with AutoGPTQ and then pass it to LangChain afterwards, if it supports that?
Alternatively, I found this third party repo which says it has put GPTQ into LangChain for the purposes of making a chat bot: https://github.com/paolorechia/learn-langchain
I've not looked into how it does it exactly. My guess is it uses GPTQ-for-LLaMa, the original GPTQ code. But it'd definitely be worth a look and maybe you can use some of their code.
Or if neither of the above work for you, you could try making your own local edit of LangChain to incorporate AutoGPTQ which shouldn't be too hard to do.
Good luck!
you can scape my code for it to work with langchain https://github.com/cxfcxf/embeddings, it includes loading the model and use vector-store with qa_chain
this model does seems better than vicuna-13b as use for embedding.
β― python embeddings.py --index-name state_of_the_union run --model-dir /home/siegfried/model-gptq --model-name wizardlm-7b-4bits --use-safetensors
INFO - Loading encoding model sentence-transformers/all-MiniLM-L6-v2...
INFO - Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2
INFO - Index already exists
INFO - Loading Tokenizer from /home/siegfried/model-gptq/wizardlm-7b-4bits...
INFO - Loading the model from /home/siegfried/model-gptq/wizardlm-7b-4bits...
INFO - Loading gptq quantized models...
WARNING - use_triton will force moving the hole model to GPU, make sure you have enough VRAM.
INFO - Found 3 unique KN Linear values.
INFO - Warming up autotune cache ...
100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 12/12 [00:46<00:00, 3.84s/it]
INFO - creating transformer pipeline...
The model 'LlamaGPTQForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MvpForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'ReformerModelWithLMHead', 'RemBertForCausalLM', 'RobertaForCausalLM', 'RobertaPreLayerNormForCausalLM', 'RoCBertForCausalLM', 'RoFormerForCausalLM', 'Speech2Text2ForCausalLM', 'TransfoXLLMHeadModel', 'TrOCRForCausalLM', 'XGLMForCausalLM', 'XLMWithLMHeadModel', 'XLMProphetNetForCausalLM', 'XLMRobertaForCausalLM', 'XLMRobertaXLForCausalLM', 'XLNetLMHeadModel', 'XmodForCausalLM'].
INFO - creating chain...
INFO - Loading Q&A chain...
Running on local URL: http://127.0.0.1:7860
Thank you both for the thoughtful responses and help. I will look over both approaches.
@cxfcxf I looked at your repo and used it to adapt for my use case. Thanks for sharing. However I am still running into an error. FileNotFoundError: [Errno 2] No such file or directory: '/wizardLM-7B-GPTQ/quantize_config.json'.
This happens when I use this code: model = AutoGPTQForCausalLM.from_quantized(path_to_model, device="cuda:0", use_triton=True, use_safetensors=True), where path_to_model is the local model directory.
It seems it is a looking a special json file that stores some config information about the quantization. @TheBloke this is not in your repo here. Would one of you know where I might find this json, or where I can look up what needs to go into it so I can create it? I also looked at the auto_gptq repo, and I couldn't find any obvious reference to this json in their examples. @cxfcxf I am curious to know how you got the model to run without that aforementioned json, or if you did, if you might consider putting it in your repo?
Many Thanks!
@cxfcxf I looked at your repo and used it to adapt for my use case. Thanks for sharing. However I am still running into an error. FileNotFoundError: [Errno 2] No such file or directory: '/wizardLM-7B-GPTQ/quantize_config.json'.
This happens when I use this code: model = AutoGPTQForCausalLM.from_quantized(path_to_model, device="cuda:0", use_triton=True, use_safetensors=True), where path_to_model is the local model directory.
It seems it is a looking a special json file that stores some config information about the quantization. @TheBloke this is not in your repo here. Would one of you know where I might find this json, or where I can look up what needs to go into it so I can create it? I also looked at the auto_gptq repo, and I couldn't find any obvious reference to this json in their examples. @cxfcxf I am curious to know how you got the model to run without that aforementioned json, or if you did, if you might consider putting it in your repo?
Many Thanks!
yup you need this file cause i m using the autogptq class to load it
β― cat quantize_config.json
{
"bits": 4,
"damp_percent": 0.01,
"desc_act": true,
"group_size": 128
}
you can just place this file in
Many Thanks!
@cxfcxf
how were you able to download the model? I can pull the files with git clone https://huggingface.co/TheBloke/wizardLM-7B-GPTQ
, but when I run it I get this FileNotFoundError: Could not find model at /content/wizardLM-7B-GPTQ/gptq_model-4bit-128g.safetensors
. I'm assuming I need to somehow combine the shards into one file?
Thank you
If you're still working on this, I've added ooba support via the API to langchain. You can follow the example notebook here:
Or just use this code snippet:
from langchain.llms import TextGen
llm = TextGen(model_url="http://localhost:5000")