out of memory error when launching from oobabooga web ui

#15
by bhaveshNOm - opened

hey i just downloaded this model (from aitreprenuer tutorial) but when ever i start it throws this error RuntimeError: [enforce fail at C : \cb\pytorch_l000000000000\work\cl0\core\impl\alloc_cpu.cpp : 72] data. DefaultCPUAIIocator: not enough memory: you tried to allocate 13107200 bytes.
i have 3060(6gb) and 16 gb ram
please help🥲

Having the same problem here, I see many others are having the same problem. More people are having problems with Oobabooga and GPT x Alpaca than people who are actually using it

Hi guys! I got the same error and was able to move past it. What is happening to you is that the program is trying to allocate more memory than your GPU has available. To solve this, you can edit the start-webui.bat file and add this next parameter: --gpu-memory 5
That "5" is the maximum memory the program will allocate in the GPU. Since I have a 6GB GPU, I input 5. For a more precise number you can do something like this: --gpu-memory 5300MiB which would be arround 5.3GB of memory.

This will make the WebUI launch correctly. However, I failed to chat with the bot :( I got this error now "KeyError: 'model.layers.28.self_attn.q_proj.wf1'"

Guys report back with your results please

I used the flag "--gpu-memory 7800MiB", which is 7.8 GB of GPU memory. I had also tried with less, and the error code is similar.

I'm reporting back with this:

"RuntimeError: [enforce fail at C:\cb\pytorch_1000000000000\work\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 655380480 bytes."

@Detpircsni reading your error message I believe you are running out of CPU memory, not GPU memory. You can use
--cpu-memory
as a parameter isntead (Remember to include the number after it)

Guy's I'll leave you this low VRAM guide from oobabooga, it has some usefull tips and parameters you can try:
https://github.com/oobabooga/text-generation-webui/wiki/Low-VRAM-guide

Also if you get this error after limiting the memory: "KeyError: 'model.layers.28.self_attn.q_proj.wf1'", you can use:
--pre_layer 35
parameter instead. If it get's you "Out of memory" after a few texts, you can try:
--pre_layer 25
or you may go even lower if you need to. This parameter number of layers that will be sent to the GPU.

The awful thing about all this is that the performance is SUPER slow :(. It generates text so so slow. If any one else finds a way to improve performance, please let us know.

I tried the pre_layer 25, i get stupid responses that don't make any sense

@YouKnowWhoItIs2 yeah, it seems to be a cpu memory problem for me as well. Problem is I am allocating 10 gb and more to the UI (which I have available) and it's still not even launching the web ui.

@shuaibtkd720 go to the "Parameters" panel in the WebUI and increase the "max_new_tokens" to the maximum. That seems to help alot with the quality of responses.

Someone got to resolve this problem.

RuntimeError: [enforce fail at C:\cb\pytorch_1000000000000\work\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 141557760 bytes.

It happen to me too. Cant fix with any hack propose here or in other forum. I got a 3070.

I have the same issue on launch, with a 12GB 3080, and 32GB of RAM. I have tried the following -

--auto-devices --chat --wbits 4 --groupsize 128 --gpu-memory 10 --cpu-memory 28

DefaultCPUAllocator: not enough memory: you tried to allocate 2211840 bytes.

Which cpu do you have? I own a 3700x... AMD.

I also have a AMD - Ryzen 9, 5900x

The issue (for me) is the amount of swap memory (especially if you're using Vicuna etc - the models are big.
I'm testing on a little desktop system, 16Gb ram - so another 64Gb of swap got it functional (but slow... that's the price you pay!) :) Hope that helps...

@scott-hugging didn't work for me 🥲 i dont know what memory it is running out of there is peek in task manager memory.
i tried everything mentioned above nothing worked.
i can run it in llmac++ but is really really slow and i dont even know how to change parameter and it uses all my ram
is it also happening to someone with more than 16 gb memory ?

I had this error. I closed a few of my browser tabs and tried again, and it worked. 10GB VRAM and 16GB RAM.

hey i just downloaded this model (from aitreprenuer tutorial) but when ever i start it throws this error RuntimeError: [enforce fail at C : \cb\pytorch_l000000000000\work\cl0\core\impl\alloc_cpu.cpp : 72] data. DefaultCPUAIIocator: not enough memory: you tried to allocate 13107200 bytes.
i have 3060(6gb) and 16 gb ram
please help🥲

It happened to me yesterday, I have a 3060 12gb and a Ryzen 9 12 cores + 16 ram. I managed to run it just setting the virtual memory for the hard drive my AI stuff is in manually. It works now, tho sometimes the AI answers stuff for me, and also its a bit rude for some unknown reason

hey i just downloaded this model (from aitreprenuer tutorial) but when ever i start it throws this error RuntimeError: [enforce fail at C : \cb\pytorch_l000000000000\work\cl0\core\impl\alloc_cpu.cpp : 72] data. DefaultCPUAIIocator: not enough memory: you tried to allocate 13107200 bytes.
i have 3060(6gb) and 16 gb ram
please help🥲

It happened to me yesterday, I have a 3060 12gb and a Ryzen 9 12 cores + 16 ram. I managed to run it just setting the virtual memory for the hard drive my AI stuff is in manually. It works now, tho sometimes the AI answers stuff for me, and also its a bit rude for some unknown reason

Which code do you use? What do change in your settings

hey i just downloaded this model (from aitreprenuer tutorial) but when ever i start it throws this error RuntimeError: [enforce fail at C : \cb\pytorch_l000000000000\work\cl0\core\impl\alloc_cpu.cpp : 72] data. DefaultCPUAIIocator: not enough memory: you tried to allocate 13107200 bytes.
i have 3060(6gb) and 16 gb ram
please help🥲

It happened to me yesterday, I have a 3060 12gb and a Ryzen 9 12 cores + 16 ram. I managed to run it just setting the virtual memory for the hard drive my AI stuff is in manually. It works now, tho sometimes the AI answers stuff for me, and also its a bit rude for some unknown reason

Which code do you use? What do change in your settings

Its not really about code, just search how to set virtual memory for a hard drive. In my case, I have oobabooga on my D drive, and when I checked the virtual memory on it, it was disabled. All you need to do is to select your hard drive where oobagooba is and check the "System managed size" box and that is
imagen_2023-04-10_232828302.png

Greetings,
When I run web UI I got the following error:

Starting the web UI...
Warning: --cai-chat is deprecated. Use --chat instead.

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

CUDA SETUP: CUDA runtime path found: C:\ai\LLM\oobabooga-windows\installer_files\env\bin\cudart64_110.dll
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary C:\ai\LLM\oobabooga-windows\installer_files\env\lib\site-packages\bitsandbytes\libbitsandbytes_cuda117.dll...
Loading anon8231489123_vicuna-13b-GPTQ-4bit-128g...
Found the following quantized model: models\anon8231489123_vicuna-13b-GPTQ-4bit-128g\vicuna-13b-4bit-128g.safetensors
Traceback (most recent call last):
File "C:\ai\LLM\oobabooga-windows\text-generation-webui\server.py", line 346, in
shared.model, shared.tokenizer = load_model(shared.model_name)
File "C:\ai\LLM\oobabooga-windows\text-generation-webui\modules\models.py", line 103, in load_model
model = load_quantized(model_name)
File "C:\ai\LLM\oobabooga-windows\text-generation-webui\modules\GPTQ_loader.py", line 136, in load_quantized
model = load_quant(str(path_to_model), str(pt_path), shared.args.wbits, shared.args.groupsize, kernel_switch_threshold=threshold)
File "C:\ai\LLM\oobabooga-windows\text-generation-webui\modules\GPTQ_loader.py", line 32, in _load_quant
model = AutoModelForCausalLM.from_config(config)
File "C:\ai\LLM\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\auto\auto_factory.py", line 411, in from_config
return model_class._from_config(config, **kwargs)
File "C:\ai\LLM\oobabooga-windows\installer_files\env\lib\site-packages\transformers\modeling_utils.py", line 1138, in _from_config
model = cls(config, **kwargs)
File "C:\ai\LLM\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 614, in init
self.model = LlamaModel(config)
File "C:\ai\LLM\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 445, in init
self.layers = nn.ModuleList([LlamaDecoderLayer(config) for _ in range(config.num_hidden_layers)])
File "C:\ai\LLM\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 445, in
self.layers = nn.ModuleList([LlamaDecoderLayer(config) for _ in range(config.num_hidden_layers)])
File "C:\ai\LLM\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 256, in init
self.mlp = LlamaMLP(
File "C:\ai\LLM\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 152, in init
self.down_proj = nn.Linear(intermediate_size, hidden_size, bias=False)
File "C:\ai\LLM\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\linear.py", line 96, in init
self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs))
RuntimeError: [enforce fail at C:\cb\pytorch_1000000000000\work\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 141557760 bytes.

1)It says I do not have enough memory. It allocated 141557760 bytes (0.14 GB). I have 16 GB of RAM and an RTX 3060. Which is approximately 0.875% of RAM usage. Something dose not add up.

  1. I used a few parameters in the WEB UI bat file like: --gpu-memory 3500MiB --cpu-memory 3000MiB( which constrains the CPU and GPU usage), --load-in-8bit, --auto-devices --cai-chat --wbits 4 --groupsize 128. None of them fixed the issue. BTW I found these in the: https://github.com/oobabooga/text-generation-webui/wiki/Low-VRAM-guide.

  2. I selected option a)NVIDIA, however, based on the following line RuntimeError: [enforce fail at C:\cb\pytorch_1000000000000\work\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 141557760 bytes. I think it is running on CPU not GPU. I am 100% certain that I selected option a)NVIDIA. Which dose not add up.

I have been working on this the whole day. At this point I have no clue what to do. Keep in mind I am pretty new to all this. I have no idea if I am just stupid. Any help would be highly appreciated.

Hey guys I have a solution. I was having the same "Cuda out of memory" issue with a 3080. I tried to flag --gpu memory 10 and I got this error:

Traceback (most recent call last):
File "C:\AI Programs\GPT Programs\oobabooga-windows\oobabooga-windows\text-generation-webui\server.py", line 346, in
shared.model, shared.tokenizer = load_model(shared.model_name)
File "C:\AI Programs\GPT Programs\oobabooga-windows\oobabooga-windows\text-generation-webui\modules\models.py", line 103, in load_model
model = load_quantized(model_name)
File "C:\AI Programs\GPT Programs\oobabooga-windows\oobabooga-windows\text-generation-webui\modules\GPTQ_loader.py", line 147, in load_quantized
device_map = accelerate.infer_auto_device_map(model, max_memory=max_memory, no_split_module_classes=["LlamaDecoderLayer"])
File "C:\AI Programs\GPT Programs\oobabooga-windows\oobabooga-windows\installer_files\env\lib\site-packages\accelerate\utils\modeling.py", line 567, in infer_auto_device_map
max_memory = get_max_memory(max_memory)
File "C:\AI Programs\GPT Programs\oobabooga-windows\oobabooga-windows\installer_files\env\lib\site-packages\accelerate\utils\modeling.py", line 380, in get_max_memory
max_memory[key] = convert_file_size_to_int(max_memory[key])
File "C:\AI Programs\GPT Programs\oobabooga-windows\oobabooga-windows\installer_files\env\lib\site-packages\accelerate\utils\modeling.py", line 59, in convert_file_size_to_int
return int(size[:-3]) * (2**30)
ValueError: invalid literal for int() with base 10: '10GB'

the value error there is basically saying the integer expects the input to be in bytes not GB. This means every time you write --gpu memory 10 it takes that as "10 bytes". If you specify "GB" it wont read it at all. So I changed it to --gpu Memory 10737418240 and that seems to have solved at least one of the issues. I am still working towards following the breadcrumbs to the ultimate issue here.

Greetings,
When I run web UI I got the following error:

Starting the web UI...
Warning: --cai-chat is deprecated. Use --chat instead.

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

CUDA SETUP: CUDA runtime path found: C:\ai\LLM\oobabooga-windows\installer_files\env\bin\cudart64_110.dll
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary C:\ai\LLM\oobabooga-windows\installer_files\env\lib\site-packages\bitsandbytes\libbitsandbytes_cuda117.dll...
Loading anon8231489123_vicuna-13b-GPTQ-4bit-128g...
Found the following quantized model: models\anon8231489123_vicuna-13b-GPTQ-4bit-128g\vicuna-13b-4bit-128g.safetensors
Traceback (most recent call last):
File "C:\ai\LLM\oobabooga-windows\text-generation-webui\server.py", line 346, in
shared.model, shared.tokenizer = load_model(shared.model_name)
File "C:\ai\LLM\oobabooga-windows\text-generation-webui\modules\models.py", line 103, in load_model
model = load_quantized(model_name)
File "C:\ai\LLM\oobabooga-windows\text-generation-webui\modules\GPTQ_loader.py", line 136, in load_quantized
model = load_quant(str(path_to_model), str(pt_path), shared.args.wbits, shared.args.groupsize, kernel_switch_threshold=threshold)
File "C:\ai\LLM\oobabooga-windows\text-generation-webui\modules\GPTQ_loader.py", line 32, in _load_quant
model = AutoModelForCausalLM.from_config(config)
File "C:\ai\LLM\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\auto\auto_factory.py", line 411, in from_config
return model_class._from_config(config, **kwargs)
File "C:\ai\LLM\oobabooga-windows\installer_files\env\lib\site-packages\transformers\modeling_utils.py", line 1138, in _from_config
model = cls(config, **kwargs)
File "C:\ai\LLM\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 614, in init
self.model = LlamaModel(config)
File "C:\ai\LLM\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 445, in init
self.layers = nn.ModuleList([LlamaDecoderLayer(config) for _ in range(config.num_hidden_layers)])
File "C:\ai\LLM\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 445, in
self.layers = nn.ModuleList([LlamaDecoderLayer(config) for _ in range(config.num_hidden_layers)])
File "C:\ai\LLM\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 256, in init
self.mlp = LlamaMLP(
File "C:\ai\LLM\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 152, in init
self.down_proj = nn.Linear(intermediate_size, hidden_size, bias=False)
File "C:\ai\LLM\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\linear.py", line 96, in init
self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs))
RuntimeError: [enforce fail at C:\cb\pytorch_1000000000000\work\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 141557760 bytes.

1)It says I do not have enough memory. It allocated 141557760 bytes (0.14 GB). I have 16 GB of RAM and an RTX 3060. Which is approximately 0.875% of RAM usage. Something dose not add up.

  1. I used a few parameters in the WEB UI bat file like: --gpu-memory 3500MiB --cpu-memory 3000MiB( which constrains the CPU and GPU usage), --load-in-8bit, --auto-devices --cai-chat --wbits 4 --groupsize 128. None of them fixed the issue. BTW I found these in the: https://github.com/oobabooga/text-generation-webui/wiki/Low-VRAM-guide.

  2. I selected option a)NVIDIA, however, based on the following line RuntimeError: [enforce fail at C:\cb\pytorch_1000000000000\work\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 141557760 bytes. I think it is running on CPU not GPU. I am 100% certain that I selected option a)NVIDIA. Which dose not add up.

I have been working on this the whole day. At this point I have no clue what to do. Keep in mind I am pretty new to all this. I have no idea if I am just stupid. Any help would be highly appreciated.

It may sound weird but just change the virtual memory of the HDD where you have oobabooga, I answered earlier in this topic with the same problem and managed to fix it by just doing that

you shouldn't haver to do this with 32gb of ram though (which I have) - this still seems bugged to me but its likely an oobabooga thing, not a model thing

you shouldn't haver to do this with 32gb of ram though (which I have) - this still seems bugged to me but its likely an oobabooga thing, not a model thing

If you watch the video made by Aitrepreneur on this whole thing, he can run it just fine and everything works perfectly well on oobabooga

@Detpircsni reading your error message I believe you are running out of CPU memory, not GPU memory. You can use
--cpu-memory
as a parameter isntead (Remember to include the number after it)

Guy's I'll leave you this low VRAM guide from oobabooga, it has some usefull tips and parameters you can try:
https://github.com/oobabooga/text-generation-webui/wiki/Low-VRAM-guide

Also if you get this error after limiting the memory: "KeyError: 'model.layers.28.self_attn.q_proj.wf1'", you can use:
--pre_layer 35
parameter instead. If it get's you "Out of memory" after a few texts, you can try:
--pre_layer 25
or you may go even lower if you need to. This parameter number of layers that will be sent to the GPU.

The awful thing about all this is that the performance is SUPER slow :(. It generates text so so slow. If any one else finds a way to improve performance, please let us know.

(To you, and to anyone else who is reading)I have a 25+ GB system, and have actually read the guide and tried all of those suggestions. At the time of the post, my prelayers were set to 50, and went as far as to use the --disk flag to allocate memory there. Despite this, it had no effect on the error code.

I tried @sfreeman88 's info about the --gpu-memory flag. My gpu-memory was 7, as many guidelines would advise. I adjusted it to --gpu-memory 7073741824, which should be roughly 7 gigabytes in bytes.

The same error persisted, but the attempted allocation was 141,557,760 bytes, which doesn't even seem to amount to 1 GB. Something is clearly wrong.

I tried the virtual memory option suggested by @xiiredrum . I typed "advanced system settings" in the search, clicked on "settings" under performance, went to the "advanced" tab, clicked on the "change" button, and unchecked the "automatically manage paging file size for all drives" box. I noticed only my OS drive had a paging file size enabled, which was where oogabooga was located. It was still worth a try. Checking the "system managed size" for each drive didn't work, as it disabled my changes after leaving that window. Setting a custom size resolves this problem(5024-50024). After a restart, I checked my results.

The error code changed. I interpreted this as a positive, at least:

=========

Loading vicuna-13b-GPTQ-4bit-128g...
Loading model ...
C:\Program Files\oobabooga-windows\installer_files\env\lib\site-packages\safetensors\torch.py:99: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
with safe_open(filename, framework="pt", device=device) as f:
C:\Program Files\oobabooga-windows\installer_files\env\lib\site-packages\torch_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.get(instance, owner)()
C:\Program Files\oobabooga-windows\installer_files\env\lib\site-packages\torch\storage.py:899: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
storage = cls(wrap_storage=untyped_storage)

==========

After this error code, it hangs and nothing happens. But I ran it a second time:

======

Loading vicuna-13b-GPTQ-4bit-128g...
Loading model ...
C:\Program Files\oobabooga-windows\installer_files\env\lib\site-packages\safetensors\torch.py:99: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
with safe_open(filename, framework="pt", device=device) as f:
C:\Program Files\oobabooga-windows\installer_files\env\lib\site-packages\torch_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.get(instance, owner)()
C:\Program Files\oobabooga-windows\installer_files\env\lib\site-packages\torch\storage.py:899: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
storage = cls(wrap_storage=untyped_storage)
Done.
Using the following device map for the 4-bit model: {'': 0}
Loaded the model in 15.61 seconds.
Running on local URL: http://127.0.0.1:7860
3
To create a public link, set share=True in launch().

=======

My flags were --wbits 4 --groupsize 128 --gpu-memory 77073741824. All my internet browser windows were also closed. It is possible all solutions lead to this outcome, which no longer seems related to memory hell, at least. I also tried it with GPT4 Alpaca, and the results for this whole post are the same.

Additional Observations:

I can confirm that it is functional, but leaving your browser open is a mistake for not-so-powerful GPUs. It, and everything else, becomes very bogged down. It loads quickly, and then begins to draw resources upon generation. Setting the auto-devices flag on in this scenario is advised, and still seems to allow operation with the above troubleshooting, so does the pre-layer flag, and the threads flag. In my case, my 8gb of vram couldn't handle usage at 7, and so I reduced it to 5, which I would also advise. Upon removing byte format in --gpu-memory, and adding, say, a "7", or presenting it in MiBs, the UI interprets every input as a successfully generated output from the terminal(I never tried to reproduce the error, take that with a grain of salt), error below:

KeyError: 'model.layers.27.self_attn.q_proj.wf1'

So, I reversed that change in the .bat file. I also learned anything lower than pre-layer 25 is not that functional without a decently up to date GPU, I wouldn't advise. In an attempt to make it faster, via the UI I tested out things like deepspeed, bf16, flexgen, cuda features, etc.

Conclusion:

After everything worked and all was said and done, if you don't have a very high-end PC, this won't be a good experience for you, and I would advise getting a colab version together or trying another alternative. At 0.89 tokens per second, it will feel like watching paint dry. This is also a very delicate program, just about anything breaks it or sets it off in the wrong direction. If you lack patience, try koboldAI or NAI.

Despite this, thank you everyone for the help.
Capture.PNG

I am also unable to load the model in oobabooga. Always out of memory.

I'm having the same problem.

My web UI starts. But if I type something. This error occurs:

Starting the web UI...
Loading gpt4-x-alpaca-13b-native-4bit-128g...
Found the following quantized model: models\gpt4-x-alpaca-13b-native-4bit-128g\gpt-x-alpaca-13b-native-4bit-128g-cuda.pt
Loading model ...
Done.
Loaded the model in 5.21 seconds.
Loading the extension "gallery"... Ok.
Running on local URL: http://127.0.0.1:7860

To create a public link, set share=True in launch().
Traceback (most recent call last):
File "J:\oobabooga\text-generation-webui\modules\callbacks.py", line 66, in gentask
ret = self.mfunc(callback=_callback, **self.kwargs)
File "J:\oobabooga\text-generation-webui\modules\text_generation.py", line 251, in generate_with_callback
shared.model.generate(**kwargs)
File "J:\oobabooga\installer_files\env\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "J:\oobabooga\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1485, in generate
return self.sample(
File "J:\oobabooga\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2524, in sample
outputs = self(
File "J:\oobabooga\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "J:\oobabooga\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 687, in forward
outputs = self.model(
File "J:\oobabooga\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "J:\oobabooga\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 577, in forward
layer_outputs = decoder_layer(
File "J:\oobabooga\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "J:\oobabooga\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 305, in forward
hidden_states = self.mlp(hidden_states)
File "J:\oobabooga\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "J:\oobabooga\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 157, in forward
return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
File "J:\oobabooga\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "J:\oobabooga\installer_files\env\lib\site-packages\transformers\activations.py", line 150, in forward
return nn.functional.silu(input)
File "J:\oobabooga\installer_files\env\lib\site-packages\torch\nn\functional.py", line 2059, in silu
return torch._C._nn.silu(input)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 8.00 GiB total capacity; 7.07 GiB already allocated; 0 bytes free; 7.31 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Output generated in 2.33 seconds (0.00 tokens/s, 0 tokens, context 37, seed 2091766985)


That is my param

call python server.py --chat --wbits 4 --groupsize 128 --bf16 --gpu memory 7

Ok i got it running finally. Seriously, it is as simple as @xiiredrum described above, setting the virtual memory for the drive the files for the chat bot are located sufficed. At this point i was through all the other troubles and it turns out setting the args is not necessary at all. lol
Set virtual memory to system managed size. I tried a fixed value which did not work, i don't know why, i have plenty.
Important, you don't need to set wbits and groupsize, there is a config in the models folder that has all the settings for these kinds of cases and picks the right profile automatically. That also means you can start with different models without having to set the args each time.
It runs just fine without all these parameters for me. But if you specifically run out of GPU memory (cuda error) for me it works if i write it like this:
--gpu-memory 10GiB (for a 12GB card and leave a bit legroom the speed is just fine like that)
Same as for the CPU memory (RAM). But when set to use a GPU it does not use much RAM at all.

As a side note, once you have your bot with another model running, ask it for advice. It can be a surprisingly good helper to set itself up... lol
But don't expect a detailed manual, it is able to answer questions pretty well but not able to know about the specifics of your environment and problems if you dont provide that information

/edit, just after i wrote this, i restarted UI and it was running in the same memory problem. Rebooted the pc and it works again.

The problem with low performance is because oobabooga is using just few CPU threads. If someone knows how to increase... With alpaca.cpp CPU only runs very well CPU only with 20 threads.

@Detpircsni
Sorry for my English, Seems like you overcome the 'KeyError: 'model.layers.27.self_attn.q_proj.wf1''
I can run the model perfectly,
but I can't seem to understand what's the problem, looks like the "--pre_layer" flag culprit for me, no matter what number I use it seems like I can't generate text or use anything.

My current param are: --chat --model-menu --wbits 4 --groupsize 128 --gpu-memory 5
If probably you can help me (or anyone), it would be appreciated.

Having the same issue; I have a very high-spec computer, followed all the steps like 4 times, starting from scratch every time, did the virtual data thing, set gpu memory to 10 gig, and every time I submit anything I get the same "CUDA out of memory" issue. Pls, someone needs to figure this out really bad.

I just saw that it is not enough to limit the GPU Memory to 7 when there is 8 GB in the GPU.
I did limit it to 5 and my GPUs memory is full to 6,9 GB. Means you have to give it really much headroom!
Same goes for the cpu-memory. I have 24 GBs installed and set it to 16.
While starting it took more than 16 GB, now it sits at 12 GB.

i have 24 GB RAM and a 3070 with 8 GB and I got it working with following params:

--auto-devices --chat --wbits 4 --groupsize 128 --gpu-memory 5 --cpu-memory 16

But it is soooooo slow. Is this normal ?

@xiiredrum
I have the same build as yours - 3060 and 16 gigs of ram, and manually specifying the pagefile memory max and min fixed it for me (set it from 64gb to 100gb), not just setting it to system managed size.

In case anyone is still having an issue after after trying the above things, check your drive storage space. I had ~5 Gb free and was getting the error DefaultCPUAllocator: not enough memory: you tried to allocate... After I uninstalled some things and had ~30GB free, everything started working. I'm not sure at what point it would have started working, I only tested it at 5gb and 30gb. But I hope this helps someone!

Edit: I had to do this on my C drive, even though I had ~250gb free on the D drive where oobabooga was installed.

SOLUTION:
Error: DefaultCPUAllocator: not enough memory: you tried to allocate 141557760 bytes.
Change system page file on disk where the model is located to custom 5024-50024 as noted by @Detpircsni

Sign up or log in to comment