Error while loading

#1
by Nexesenex - opened

I tried this model that you kindly shared, and that's what I get on Ooba, with a RTX 3090 :

127.0.0.1 - - [16/Sep/2023 06:27:45] "GET /api/v1/model HTTP/1.1" 200 -
2023-09-16 06:28:16 INFO:Loading Panchovix_airoboros-l2-70b-gpt4-1.4.1_2.5bpw-h6-exl2...
2023-09-16 06:28:31 ERROR:Failed to load the model.

Traceback (most recent call last):
File “U:\oobabooga_windows\text-generation-webui\modules\ui_model_menu.py”, line 194, in load_model_wrapper

shared.model, shared.tokenizer = load_model(shared.model_name, loader)
File “U:\oobabooga_windows\text-generation-webui\modules\models.py”, line 77, in load_model

output = load_func_maploader
File “U:\oobabooga_windows\text-generation-webui\modules\models.py”, line 338, in ExLlamav2_loader

model, tokenizer = Exllamav2Model.from_pretrained(model_name)
File “U:\oobabooga_windows\text-generation-webui\modules\exllamav2.py”, line 40, in from_pretrained

model.load(split)
File “U:\oobabooga_windows\installer_files\env\lib\site-packages\exllamav2\model.py”, line 233, in load

for module in self.modules: module.load()
File “U:\oobabooga_windows\installer_files\env\lib\site-packages\exllamav2\mlp.py”, line 44, in load

self.up_proj.load()
File “U:\oobabooga_windows\installer_files\env\lib\site-packages\exllamav2\linear.py”, line 37, in load

if w is None: w = self.load_weight()
File “U:\oobabooga_windows\installer_files\env\lib\site-packages\exllamav2\module.py”, line 79, in load_weight

qtensors = self.load_multi(["q_weight", "q_invperm", "q_scale", "q_scale_max", "q_groups", "q_perm"])
File “U:\oobabooga_windows\installer_files\env\lib\site-packages\exllamav2\module.py”, line 69, in load_multi

tensors[k] = st.get_tensor(self.key + "." + k).to(self.device())
RuntimeError: [enforce fail at …\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 88080384 bytes.

This, while :
2023-09-16 06:34:32 INFO:Loading turboderp_LLama2-70B-chat-2.55bpw-h6-exl2...
2023-09-16 06:35:01 INFO:Loaded the model in 28.95 seconds.

The error DefaultCPUAllocator: not enough memory refers as not having enough RAM when loading the model.

You could try increasing swap file and see again.

I can load the model without any warning on 64GB RAM (and 200GB Swap), but the RAM itself should be enough for a model of this size.

Thanks Panchovix, it works now. I'm surprised though that this quant needs a swap, while the 2.55bpw of Turboderp (in 3 safetensor files) doesn't. But we're still in Alpha stage, so it's super already.
I'm hasty to be able to quantize myself, I'm using 3072 context right now.
Once again, thank you very much !

Edit : I was a bit optimistic. I can remain around 10 tokens/s with 1792 ctx only. But it works.
Edit2 : I guess I'll have to buy a second 3090 to have a decent output ! :D

Sign up or log in to comment