CUDA out of memory error when using
I am trying to use my 2 GPUs + RAM mapped by device_map="auto"
However, I am getting CUDA OOM error.
Using only 1 GPU gave the same error by specifying:
import torch
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="1"
If only CPU used it will take 30 min. for max_length=2000
on my corei9-9900K/64GB which is too much. I just hoped that my 2x1080Ti (11GB) could help to speed up the text generation.
tokenizer = AutoTokenizer.from_pretrained("facebook/galactica-6.7b")
model = OPTForCausalLM.from_pretrained("facebook/galactica-6.7b", device_map="auto", offload_state_dict = True) # no disk offloading
input_text = """
# The benefits of deadlifting
## INTRODUCTION
"""
randomizer_value = 0
repititions = 1
# set seed to reproduce results. Feel free to change the seed though to get different results
torch.manual_seed(randomizer_value)
# input_ids = tokenizer(input_text, return_tensors="pt").input_ids ############### CPU only
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
# set top_k = 50 and set top_p = 0.95 and num_return_sequences = 3
sample_outputs = model.generate(
input_ids,
do_sample=True,
max_length=2000,
top_k=50,
top_p=0.95,
num_return_sequences=1
)
OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 1; 10.92 GiB total capacity; 9.80 GiB already allocated; 9.75 MiB free; 9.81 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Answering my own question:
We should remap the model a little bit to make the part of the map that occupies the GPU a little smaller. One leyr offloaded to another device like a CPU or another GPU.
Here is an example how I did it for the galactica-30b model:
Run the following code to explore the current mapping that gives the CUDA error
tokenizer = AutoTokenizer.from_pretrained("facebook/galactica-30b")
model = OPTForCausalLM.from_pretrained("facebook/galactica-30b", device_map=device_map, torch_dtype=torch.float16)
model.hf_device_map
It will output a dictionary of the map like:
{'model.decoder.embed_tokens': 0,
'lm_head': 0,
'model.decoder.embed_positions': 0,
'model.decoder.final_layer_norm': 0,
'model.decoder.layers.0': 0,
...
'model.decoder.layers.5': 0,
'model.decoder.layers.6': 0,
'model.decoder.layers.7': 0,
'model.decoder.layers.8': 1,
...
'model.decoder.layers.14': 1,
'model.decoder.layers.15': 1,
'model.decoder.layers.16': 1,
'model.decoder.layers.17': 1,
'model.decoder.layers.18': 'cpu',
...
'model.decoder.layers.45': 'cpu',
'model.decoder.layers.46': 'cpu',
'model.decoder.layers.47': 'cpu'}
Change the numbers of GPU 0 and GPU 1 to make it a bit less and move those layers to the CPU by the following before initializing the model when you execute the script the next time: (Note the layers 7 and 14, 15 how I changed them to decrease the mapping on gpus 0 and 1)
device_map = {'model.decoder.embed_tokens': 0,
'lm_head': 0,
'model.decoder.embed_positions': 0,
'model.decoder.final_layer_norm': 0,
'model.decoder.layers.0': 0,
'model.decoder.layers.1': 0,
'model.decoder.layers.2': 0,
'model.decoder.layers.3': 0,
'model.decoder.layers.4': 0,
'model.decoder.layers.5': 0,
'model.decoder.layers.6': 0,
'model.decoder.layers.7': 1,
'model.decoder.layers.8': 1,
'model.decoder.layers.9': 1,
'model.decoder.layers.10': 1,
'model.decoder.layers.11': 1,
'model.decoder.layers.12': 1,
'model.decoder.layers.13': 1,
'model.decoder.layers.14': 1,
'model.decoder.layers.15': 1,
'model.decoder.layers.16': 'cpu',
'model.decoder.layers.17': 'cpu',
'model.decoder.layers.18': 'cpu',
'model.decoder.layers.19': 'cpu',
'model.decoder.layers.20': 'cpu',
'model.decoder.layers.21': 'cpu',
'model.decoder.layers.22': 'cpu',
'model.decoder.layers.23': 'cpu',
'model.decoder.layers.24': 'cpu',
'model.decoder.layers.25': 'cpu',
'model.decoder.layers.26': 'cpu',
'model.decoder.layers.27': 'cpu',
'model.decoder.layers.28': 'cpu',
'model.decoder.layers.29': 'cpu',
'model.decoder.layers.30': 'cpu',
'model.decoder.layers.31': 'cpu',
'model.decoder.layers.32': 'cpu',
'model.decoder.layers.33': 'cpu',
'model.decoder.layers.34': 'cpu',
'model.decoder.layers.35': 'cpu',
'model.decoder.layers.36': 'cpu',
'model.decoder.layers.37': 'cpu',
'model.decoder.layers.38': 'cpu',
'model.decoder.layers.39': 'cpu',
'model.decoder.layers.40': 'cpu',
'model.decoder.layers.41': 'cpu',
'model.decoder.layers.42': 'cpu',
'model.decoder.layers.43': 'cpu',
'model.decoder.layers.44': 'cpu',
'model.decoder.layers.45': 'cpu',
'model.decoder.layers.46': 'cpu',
'model.decoder.layers.47': 'cpu'}
tokenizer = AutoTokenizer.from_pretrained("facebook/galactica-30b")
model = OPTForCausalLM.from_pretrained("facebook/galactica-30b", device_map=device_map, torch_dtype=torch.float16) # GPU: manually device mapped # do not map to disk (no disk offloading)
keep experimenting and observe GPU memory utilization in
'nvidia-smi`
and increase/decrease the mapping on gpus until you find the sweet spot.
Hope that helps somebody :)
Hey, I'm using the model = BartForSequenceClassification.from_pretrained("facebook/bart-base")
and it does not work for me your solution? I'm getting the RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling 'cublasCreate(handle)'
error, I made forum question about it here
Thanks so much! It helps me to resolve the issue!