CUDA out of memory error when using

by hwasiti - opened Nov 30, 2022

Nov 30, 2022

I am trying to use my 2 GPUs + RAM mapped by device_map="auto"
However, I am getting CUDA OOM error.
Using only 1 GPU gave the same error by specifying:

import torch
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="1"

If only CPU used it will take 30 min. for max_length=2000 on my corei9-9900K/64GB which is too much. I just hoped that my 2x1080Ti (11GB) could help to speed up the text generation.

tokenizer = AutoTokenizer.from_pretrained("facebook/galactica-6.7b")    
model = OPTForCausalLM.from_pretrained("facebook/galactica-6.7b", device_map="auto",  offload_state_dict = True)  #  no disk offloading

input_text = """
# The benefits of deadlifting

## INTRODUCTION
"""

randomizer_value = 0
repititions = 1

# set seed to reproduce results. Feel free to change the seed though to get different results
torch.manual_seed(randomizer_value)

# input_ids = tokenizer(input_text, return_tensors="pt").input_ids   ############### CPU only
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")

# set top_k = 50 and set top_p = 0.95 and num_return_sequences = 3
sample_outputs = model.generate(
    input_ids,
    do_sample=True, 
    max_length=2000, 
    top_k=50, 
    top_p=0.95, 
    num_return_sequences=1
)

OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 1; 10.92 GiB total capacity; 9.80 GiB already allocated; 9.75 MiB free; 9.81 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

hwasiti

Dec 7, 2022

Answering my own question:

We should remap the model a little bit to make the part of the map that occupies the GPU a little smaller. One leyr offloaded to another device like a CPU or another GPU.

Here is an example how I did it for the galactica-30b model:
Run the following code to explore the current mapping that gives the CUDA error

tokenizer = AutoTokenizer.from_pretrained("facebook/galactica-30b")    
model = OPTForCausalLM.from_pretrained("facebook/galactica-30b", device_map=device_map,  torch_dtype=torch.float16)  

model.hf_device_map

It will output a dictionary of the map like:

{'model.decoder.embed_tokens': 0,
 'lm_head': 0,
 'model.decoder.embed_positions': 0,
 'model.decoder.final_layer_norm': 0,
 'model.decoder.layers.0': 0,
...
 'model.decoder.layers.5': 0,
 'model.decoder.layers.6': 0,
 'model.decoder.layers.7': 0,
 'model.decoder.layers.8': 1,
...
 'model.decoder.layers.14': 1,
 'model.decoder.layers.15': 1,
 'model.decoder.layers.16': 1,
 'model.decoder.layers.17': 1,
 'model.decoder.layers.18': 'cpu',
...
 'model.decoder.layers.45': 'cpu',
 'model.decoder.layers.46': 'cpu',
 'model.decoder.layers.47': 'cpu'}

Change the numbers of GPU 0 and GPU 1 to make it a bit less and move those layers to the CPU by the following before initializing the model when you execute the script the next time: (Note the layers 7 and 14, 15 how I changed them to decrease the mapping on gpus 0 and 1)

device_map = {'model.decoder.embed_tokens': 0,
 'lm_head': 0,
 'model.decoder.embed_positions': 0,
 'model.decoder.final_layer_norm': 0,
 'model.decoder.layers.0': 0,
 'model.decoder.layers.1': 0,
 'model.decoder.layers.2': 0,
 'model.decoder.layers.3': 0,
 'model.decoder.layers.4': 0,
 'model.decoder.layers.5': 0,
 'model.decoder.layers.6': 0,
 'model.decoder.layers.7': 1,
 'model.decoder.layers.8': 1,
 'model.decoder.layers.9': 1,
 'model.decoder.layers.10': 1,
 'model.decoder.layers.11': 1,
 'model.decoder.layers.12': 1,
 'model.decoder.layers.13': 1,
 'model.decoder.layers.14': 1,
 'model.decoder.layers.15': 1,
 'model.decoder.layers.16': 'cpu',
 'model.decoder.layers.17': 'cpu',
 'model.decoder.layers.18': 'cpu',
 'model.decoder.layers.19': 'cpu',
 'model.decoder.layers.20': 'cpu',
 'model.decoder.layers.21': 'cpu',
 'model.decoder.layers.22': 'cpu',
 'model.decoder.layers.23': 'cpu',
 'model.decoder.layers.24': 'cpu',
 'model.decoder.layers.25': 'cpu',
 'model.decoder.layers.26': 'cpu',
 'model.decoder.layers.27': 'cpu',
 'model.decoder.layers.28': 'cpu',
 'model.decoder.layers.29': 'cpu',
 'model.decoder.layers.30': 'cpu',
 'model.decoder.layers.31': 'cpu',
 'model.decoder.layers.32': 'cpu',
 'model.decoder.layers.33': 'cpu',
 'model.decoder.layers.34': 'cpu',
 'model.decoder.layers.35': 'cpu',
 'model.decoder.layers.36': 'cpu',
 'model.decoder.layers.37': 'cpu',
 'model.decoder.layers.38': 'cpu',
 'model.decoder.layers.39': 'cpu',
 'model.decoder.layers.40': 'cpu',
 'model.decoder.layers.41': 'cpu',
 'model.decoder.layers.42': 'cpu',
 'model.decoder.layers.43': 'cpu',
 'model.decoder.layers.44': 'cpu',
 'model.decoder.layers.45': 'cpu',
 'model.decoder.layers.46': 'cpu',
 'model.decoder.layers.47': 'cpu'}

tokenizer = AutoTokenizer.from_pretrained("facebook/galactica-30b")  
model = OPTForCausalLM.from_pretrained("facebook/galactica-30b", device_map=device_map,  torch_dtype=torch.float16)  # GPU: manually device mapped # do not map to disk (no disk offloading)

keep experimenting and observe GPU memory utilization in
'nvidia-smi`
and increase/decrease the mapping on gpus until you find the sweet spot.

Hope that helps somebody :)

LidorPrototype

Apr 27, 2023

Hey, I'm using the model = BartForSequenceClassification.from_pretrained("facebook/bart-base") and it does not work for me your solution? I'm getting the RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling 'cublasCreate(handle)' error, I made forum question about it here

donnice849

Jul 12, 2023

Thanks so much! It helps me to resolve the issue!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment