FastChat Load Error

by littledot - opened Oct 1, 2023

Oct 1, 2023

python3 -m fastchat.serve.cli --model-path /home/littledot/.cache/modelscope/hub/qwen/Qwen-14B-Chat-Int4

error info

littledot@aiserver:~$ python3 -m fastchat.serve.cli --model-path /home/littledot/.cache/modelscope/hub/qwen/Qwen-14B-Chat-Int4 Try importing flash-attention for faster inference...
Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████| 5/5 [00:01<00:00, 3.52it/s]
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/dist-packages/fastchat/serve/cli.py", line 283, in
main(args)
File "/usr/local/lib/python3.10/dist-packages/fastchat/serve/cli.py", line 208, in main
chat_loop(
File "/usr/local/lib/python3.10/dist-packages/fastchat/serve/inference.py", line 311, in chat_loop
model, tokenizer = load_model(
File "/usr/local/lib/python3.10/dist-packages/fastchat/model/model_adapter.py", line 288, in load_model
model, tokenizer = adapter.load_model(model_path, kwargs)
File "/usr/local/lib/python3.10/dist-packages/fastchat/model/model_adapter.py", line 1368, in load_model
model = AutoModelForCausalLM.from_pretrained(
File "/home/littledot/.local/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 558, in from_pretrained
return model_class.from_pretrained(
File "/home/littledot/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3250, in from_pretrained
model = quantizer.post_init_model(model)
File "/home/littledot/.local/lib/python3.10/site-packages/optimum/gptq/quantizer.py", line 482, in post_init_model
raise ValueError(
ValueError: Found modules on cpu/disk. Using Exllama backend requires all the modules to be on GPU.You can deactivate exllama backend by setting disable_exllama=True in the quantization config object

jklj077

Qwen org Oct 9, 2023

Please check this issue https://github.com/QwenLM/Qwen/issues/411 and see if the answers would help. (in short, int4 models are not supported on CPUs)

littledot

Oct 9, 2023

我使用 GPU加载的，2080 ti 22G显卡的配置。
运行的过程中出现的这个错误提示。

I used the configuration of a 2080 ti 22G graphics card loaded on a GPU.
This error message appears during the operation.

jklj077

Qwen org Oct 11, 2023

The error message were saying that models are on CPUs. There may be configuration (models were not loaded on GPUs) or environment (CUDA not found) issues. I suggest you check those first.

Those are controled by fastchat and you could also try seek help from them. The related files are here https://github.com/lm-sys/FastChat/blob/0c37d989df96cd67464cfbb21fdbebe1bc59022a/fastchat/model/model_adapter.py#L376

提供的报错信息提示模型没有完全载入GPU。可能存在模型配置或者环境配置问题，例如模型没有配置GPU加载或者环境中无法找到CUDA。建议您检查下相关配置。

这些配置由fastchat配置，您也可以向他们寻求帮助。相关参数配置文件在这里 https://github.com/lm-sys/FastChat/blob/0c37d989df96cd67464cfbb21fdbebe1bc59022a/fastchat/model/model_adapter.py#L376

jklj077

Qwen org Oct 11, 2023

•

edited Oct 11, 2023

After some searching, I think FastChat does not support GPTQ yet (not sure). Qwen-14B-Chat-Int4 is quantized using AutoGPTQ.
https://github.com/lm-sys/FastChat/issues/1671
https://github.com/lm-sys/FastChat/issues/1745
https://github.com/lm-sys/FastChat/issues/2215
https://github.com/lm-sys/FastChat/issues/2375

There is also a PR there. Seems to mention the same problem as yours.
https://github.com/lm-sys/FastChat/pull/2365

JustinLin610

Qwen org Oct 12, 2023

•

edited Oct 12, 2023

I think you need to make changes to fastchat code. I just debugged it, and you need to turn low_cpu_mem_usage=True to device_map="auto" in the Qwen adapter inside model_adapter.py. We'll shoot a PR to the official repo, but for now I advise you to modify yourself.

jklj077 changed discussion status to closed Dec 21, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment