Out of memory on two 3090

#21
by gameveloster - opened

Tried loading the model using Exllama on two 3090 but kept getting the out of memory error. When this crashes, the first GPU VRAM was fully utilized (23.69GB) but the 2nd GPU only used 7.87 GB of VRAM.

$ python server.py --model TheBloke_guanaco-65B-GPTQ --listen --chat --loader exllama --gpu-split 24,24
bin /home/gameveloster/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
2023-06-18 15:57:03 INFO:Loading TheBloke_guanaco-65B-GPTQ...
Traceback (most recent call last):
  File "/mnt/md0/text-generation-webui/server.py", line 1014, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "/mnt/md0/text-generation-webui/modules/models.py", line 65, in load_model
    output = load_func_map[loader](model_name)
  File "/mnt/md0/text-generation-webui/modules/models.py", line 277, in ExLlama_loader
    model, tokenizer = ExllamaModel.from_pretrained(model_name)
  File "/mnt/md0/text-generation-webui/modules/exllama.py", line 41, in from_pretrained
    model = ExLlama(config)
  File "/mnt/md0/text-generation-webui/repositories/exllama/model.py", line 630, in __init__
    tensor = tensor.to(device, non_blocking = True)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB (GPU 0; 23.69 GiB total capacity; 23.01 GiB already allocated; 35.12 MiB free; 23.01 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Should this model be loadable on two 3090 when using Exllama?

Yes, but you need an unequal split to allow for context on GPU 1. From the exllama README:

image.png

So try --gpu-split 17.2,24 or similar

I was running into this OOM issue even before exllama. Following this recomedation, --gpu-split 17.2,24, now it works perfectly and I am getting 12 tokens/s. Impressive!

use_triton = False
m = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
model_basename=model_basename,
use_cuda_fp16 = False,
use_safetensors=True,
trust_remote_code=True,
device_map="auto",
max_memory= {i: '24000MB' for i in range(torch.cuda.device_count())},
use_triton=use_triton,
quantize_config=None)

You need to put less on GPU 1 to allow for context. Try 16GB GPU1, 24GB GPU 2

Or you'll get much better performance with ExLlama, and lower GPU usage too. Here's example code using ExLlama (there's more examples in the same repo): https://github.com/turboderp/exllama/blob/c16cf49c3f19e887da31d671a713619c8626484e/example_basic.py

To that basic ExLlama code you would add config.set_auto_map("17.2,24") ; config.gpu_peer_fix = True for splitting over two GPUs.

Sign up or log in to comment