Chat interface
I am learning alpaca models, can you please point to the right direction on what to use to chat with the model using gpu ? Thank you.
Please take a look at the README for 2 ways to run inference, including chat (option 2)
Got this error
CUDA SETUP: CUDA runtime path found: /opt/miniconda/envs/textgen/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /opt/miniconda/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
Loading alpaca-30b-lora-int4...
Loading model ...
Traceback (most recent call last):
File "/app/text-generation-webui/server.py", line 276, in
shared.model, shared.tokenizer = load_model(shared.model_name)
File "/app/text-generation-webui/modules/models.py", line 102, in load_model
model = load_quantized(model_name)
File "/app/text-generation-webui/modules/GPTQ_loader.py", line 111, in load_quantized
model = load_quant(str(path_to_model), str(pt_path), shared.args.wbits, shared.args.groupsize, shared.args.pre_layer)
File "/app/text-generation-webui/repositories/GPTQ-for-LLaMa/llama_inference_offload.py", line 228, in load_quant
model.load_state_dict(torch.load(checkpoint))
File "/opt/miniconda/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
Missing key(s) in state_dict: "model.layers.0.self_attn.k_proj.qzeros", "model.layers.0.self_attn.o_proj.qzeros", "model.layers.0.self_attn.q_proj.qzeros", "model.layers.0.self_attn.v_proj.qzeros", "model.layers.0.mlp.down_proj.qzeros"
Please see https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model#step-1-install-gptq-for-llama
There are breaking changes and you should use commit a6f363e3f93b9fb5c26064b5ac7ed58d22e3f773
in the cuda
branch.
added an update here https://huggingface.co/elinas/alpaca-30b-lora-int4#update-2023-04-03
Hi, I think a6f36e3
is not a reference of qwopqwop200/GPTQ-for-LLaMa, right?
Currently, I can use 468c47c
of qwopqwop200/GPTQ-for-LLaMa with old alpaca-30b-4bit.pt
. But what version can I use with the safetensor checkpoints? I tried the latest version of GPTQ and it didn't work. Which one should I use to load safetensors?
btw a6f363e3f93b9fb5c26064b5ac7ed58d22e3f773 didnt work, got the same error
Hi, I think a6f36e3 is not a reference of qwopqwop200/GPTQ-for-LLaMa, right?
Yes it is, for the cuda
branch. There is also a triton
branch but I haven't messed with it.
btw a6f363e3f93b9fb5c26064b5ac7ed58d22e3f773 didnt work, got the same error
Do git log
and ensure you're on the correct commit. It works fine for me. If you are and it still does not work, try to re-install all of the requirements and run python setup_cuda.py install
.
commit a6f363e3f93b9fb5c26064b5ac7ed58d22e3f773 (HEAD -> cuda-stable)
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date: Fri Mar 31 00:31:06 2023 -0300
Move model saving back to the end
Sadly no luck, tried with py 3.9, 3.10 , started from scratch with the right requirements
(textgen) root@9b843d1d1b8e:/app/text-generation-webui# python server.py --model alpaca-30b-lora-int4 --wbits 4
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /opt/miniconda/envs/textgen/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
Loading alpaca-30b-lora-int4...
Loading model ...
Traceback (most recent call last):
File "/app/text-generation-webui/server.py", line 276, in
shared.model, shared.tokenizer = load_model(shared.model_name)
File "/app/text-generation-webui/modules/models.py", line 102, in load_model
model = load_quantized(model_name)
File "/app/text-generation-webui/modules/GPTQ_loader.py", line 114, in load_quantized
model = load_quant(str(path_to_model), str(pt_path), shared.args.wbits, shared.args.groupsize, kernel_switch_threshold=threshold)
File "/app/text-generation-webui/modules/GPTQ_loader.py", line 45, in _load_quant
model.load_state_dict(torch.load(checkpoint))
File "/opt/miniconda/envs/textgen/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
Missing key(s) in state_dict: "model.layers.0.self_attn.k_proj.qzeros", "model.layers.0.self_attn.o_proj.qzeros", "model.layers.0.self_attn.q_proj.qzeros", "model.layers.0.self_attn.v_proj.qzeros"
(textgen) root@9b843d1d1b8e:/app/text-generation-webui/repositories/GPTQ-for-LLaMa# git log
commit a6f363e3f93b9fb5c26064b5ac7ed58d22e3f773 (HEAD -> cuda-stable)
Author: oobabooga 112222186+oobabooga@users.noreply.github.com
Date: Fri Mar 31 00:31:06 2023 -0300
Are you using the old .pt
model or one of the new safetensors
models? The former will not work unless you're on a pretty old commit.
(textgen) root@9b843d1d1b8e:/app/text-generation-webui/models/alpaca-30b-lora-int4# ls -alh
total 49G
drwxr-xr-x 1 root root 4.0K Apr 3 18:10 .
drwxr-xr-x 1 root root 4.0K Apr 3 16:42 ..
drwxr-xr-x 1 root root 4.0K Apr 3 18:10 .git
-rw-r--r-- 1 root root 1.5K Apr 3 16:42 .gitattributes
-rw-r--r-- 1 root root 11K Apr 3 16:42 README.md
-rw-r--r-- 1 root root 17G Apr 3 18:10 alpaca-30b-4bit-128g.safetensors
-rw-r--r-- 1 root root 16G Apr 3 18:07 alpaca-30b-4bit.pt
-rw-r--r-- 1 root root 16G Apr 3 18:04 alpaca-30b-4bit.safetensors
-rw-r--r-- 1 root root 426 Apr 3 16:42 config.json
-rw-r--r-- 1 root root 124 Apr 3 16:42 generation_config.json
-rw-r--r-- 1 root root 47K Apr 3 16:42 pytorch_model.bin.index.json
-rw-r--r-- 1 root root 2 Apr 3 16:42 special_tokens_map.json
-rw-r--r-- 1 root root 489K Apr 3 16:42 tokenizer.model
-rw-r--r-- 1 root root 141 Apr 3 16:42 tokenizer_config.json
Only have one checkpoint in your directory that you plan to use.
Hi, I think a6f36e3 is not a reference of qwopqwop200/GPTQ-for-LLaMa, right?
Yes it is, for the
cuda
branch. There is also atriton
branch but I haven't messed with it.
I found the issue. a6f36e3
is on oobabooga/GPTQ-for-LLaMa and not on the qwopqwop200's original repo.
That worked!, thank you, now I am facing another issue, there seems to be an extra text on every response, do you know what could it be ?
https://prnt.sc/a-NmywcPdKAa
That worked!, thank you, now I am facing another issue, there seems to be an extra text on every response, do you know what could it be ?
https://prnt.sc/a-NmywcPdKAa
Haha, I get this too.. It seems to go away if I try the "example" character card, so I think may be the default parameters.
The Model card seems to have some preferred params^^