Chat interface

#11
by rkj45 - opened

I am learning alpaca models, can you please point to the right direction on what to use to chat with the model using gpu ? Thank you.

Please take a look at the README for 2 ways to run inference, including chat (option 2)

elinas changed discussion status to closed

Got this error

CUDA SETUP: CUDA runtime path found: /opt/miniconda/envs/textgen/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /opt/miniconda/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
Loading alpaca-30b-lora-int4...
Loading model ...
Traceback (most recent call last):
File "/app/text-generation-webui/server.py", line 276, in
shared.model, shared.tokenizer = load_model(shared.model_name)
File "/app/text-generation-webui/modules/models.py", line 102, in load_model
model = load_quantized(model_name)
File "/app/text-generation-webui/modules/GPTQ_loader.py", line 111, in load_quantized
model = load_quant(str(path_to_model), str(pt_path), shared.args.wbits, shared.args.groupsize, shared.args.pre_layer)
File "/app/text-generation-webui/repositories/GPTQ-for-LLaMa/llama_inference_offload.py", line 228, in load_quant
model.load_state_dict(torch.load(checkpoint))
File "/opt/miniconda/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
Missing key(s) in state_dict: "model.layers.0.self_attn.k_proj.qzeros", "model.layers.0.self_attn.o_proj.qzeros", "model.layers.0.self_attn.q_proj.qzeros", "model.layers.0.self_attn.v_proj.qzeros", "model.layers.0.mlp.down_proj.qzeros"

rkj45 changed discussion status to open

Please see https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model#step-1-install-gptq-for-llama

There are breaking changes and you should use commit a6f363e3f93b9fb5c26064b5ac7ed58d22e3f773 in the cuda branch.

Hi, I think a6f36e3 is not a reference of qwopqwop200/GPTQ-for-LLaMa, right?

Currently, I can use 468c47c of qwopqwop200/GPTQ-for-LLaMa with old alpaca-30b-4bit.pt. But what version can I use with the safetensor checkpoints? I tried the latest version of GPTQ and it didn't work. Which one should I use to load safetensors?

btw a6f363e3f93b9fb5c26064b5ac7ed58d22e3f773 didnt work, got the same error

Hi, I think a6f36e3 is not a reference of qwopqwop200/GPTQ-for-LLaMa, right?

Yes it is, for the cuda branch. There is also a triton branch but I haven't messed with it.

btw a6f363e3f93b9fb5c26064b5ac7ed58d22e3f773 didnt work, got the same error

Do git log and ensure you're on the correct commit. It works fine for me. If you are and it still does not work, try to re-install all of the requirements and run python setup_cuda.py install.

commit a6f363e3f93b9fb5c26064b5ac7ed58d22e3f773 (HEAD -> cuda-stable)
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Fri Mar 31 00:31:06 2023 -0300

    Move model saving back to the end

Sadly no luck, tried with py 3.9, 3.10 , started from scratch with the right requirements

(textgen) root@9b843d1d1b8e:/app/text-generation-webui# python server.py --model alpaca-30b-lora-int4 --wbits 4
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /opt/miniconda/envs/textgen/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
Loading alpaca-30b-lora-int4...
Loading model ...
Traceback (most recent call last):
File "/app/text-generation-webui/server.py", line 276, in
shared.model, shared.tokenizer = load_model(shared.model_name)
File "/app/text-generation-webui/modules/models.py", line 102, in load_model
model = load_quantized(model_name)
File "/app/text-generation-webui/modules/GPTQ_loader.py", line 114, in load_quantized
model = load_quant(str(path_to_model), str(pt_path), shared.args.wbits, shared.args.groupsize, kernel_switch_threshold=threshold)
File "/app/text-generation-webui/modules/GPTQ_loader.py", line 45, in _load_quant
model.load_state_dict(torch.load(checkpoint))
File "/opt/miniconda/envs/textgen/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
Missing key(s) in state_dict: "model.layers.0.self_attn.k_proj.qzeros", "model.layers.0.self_attn.o_proj.qzeros", "model.layers.0.self_attn.q_proj.qzeros", "model.layers.0.self_attn.v_proj.qzeros"

(textgen) root@9b843d1d1b8e:/app/text-generation-webui/repositories/GPTQ-for-LLaMa# git log
commit a6f363e3f93b9fb5c26064b5ac7ed58d22e3f773 (HEAD -> cuda-stable)
Author: oobabooga 112222186+oobabooga@users.noreply.github.com
Date: Fri Mar 31 00:31:06 2023 -0300

Are you using the old .pt model or one of the new safetensors models? The former will not work unless you're on a pretty old commit.

(textgen) root@9b843d1d1b8e:/app/text-generation-webui/models/alpaca-30b-lora-int4# ls -alh
total 49G
drwxr-xr-x 1 root root 4.0K Apr 3 18:10 .
drwxr-xr-x 1 root root 4.0K Apr 3 16:42 ..
drwxr-xr-x 1 root root 4.0K Apr 3 18:10 .git
-rw-r--r-- 1 root root 1.5K Apr 3 16:42 .gitattributes
-rw-r--r-- 1 root root 11K Apr 3 16:42 README.md
-rw-r--r-- 1 root root 17G Apr 3 18:10 alpaca-30b-4bit-128g.safetensors
-rw-r--r-- 1 root root 16G Apr 3 18:07 alpaca-30b-4bit.pt
-rw-r--r-- 1 root root 16G Apr 3 18:04 alpaca-30b-4bit.safetensors
-rw-r--r-- 1 root root 426 Apr 3 16:42 config.json
-rw-r--r-- 1 root root 124 Apr 3 16:42 generation_config.json
-rw-r--r-- 1 root root 47K Apr 3 16:42 pytorch_model.bin.index.json
-rw-r--r-- 1 root root 2 Apr 3 16:42 special_tokens_map.json
-rw-r--r-- 1 root root 489K Apr 3 16:42 tokenizer.model
-rw-r--r-- 1 root root 141 Apr 3 16:42 tokenizer_config.json

Only have one checkpoint in your directory that you plan to use.

Hi, I think a6f36e3 is not a reference of qwopqwop200/GPTQ-for-LLaMa, right?

Yes it is, for the cuda branch. There is also a triton branch but I haven't messed with it.

I found the issue. a6f36e3 is on oobabooga/GPTQ-for-LLaMa and not on the qwopqwop200's original repo.

That worked!, thank you, now I am facing another issue, there seems to be an extra text on every response, do you know what could it be ?
https://prnt.sc/a-NmywcPdKAa

That worked!, thank you, now I am facing another issue, there seems to be an extra text on every response, do you know what could it be ?
https://prnt.sc/a-NmywcPdKAa

Haha, I get this too.. It seems to go away if I try the "example" character card, so I think may be the default parameters.

The Model card seems to have some preferred params^^

elinas changed discussion status to closed

Sign up or log in to comment