Out of memory error with an RTX 4090
I am getting an Out of memory error running an RTX 4090. Tried in Win 11 and WSL. Using Cuda 11.7.
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
config = AutoConfig.from_pretrained(
'mosaicml/mpt-7b-chat',
trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
'mosaicml/mpt-7b-chat',
torch_dtype=torch.bfloat16,
config=config,
trust_remote_code=True
)
model.to(device)
Loading checkpoint shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 2/2 [00:04<00:00, 2.16s/it]
---------------------------------------------------------------------------
OutOfMemoryError Traceback (most recent call last)
Cell In[5], line 17
10 #print(config)
11 model = AutoModelForCausalLM.from_pretrained(
12 'mosaicml/mpt-7b-chat',
13 torch_dtype=torch.bfloat16,
14 config=config,
15 trust_remote_code=True
16 )
---> 17 model.to(device)
File ~\Documents\DEV\lib\site-packages\transformers\modeling_utils.py:1878, in PreTrainedModel.to(self, *args, **kwargs)
1873 raise ValueError(
1874 "`.to` is not supported for `8-bit` models. Please use the model as it is, since the"
1875 " model has already been set to the correct devices and casted to the correct `dtype`."
1876 )
1877 else:
-> 1878 return super().to(*args, **kwargs)
File ~\Documents\DEV\lib\site-packages\torch\nn\modules\module.py:1145, in Module.to(self, *args, **kwargs)
1141 return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None,
1142 non_blocking, memory_format=convert_to_format)
1143 return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
-> 1145 return self._apply(convert)
File ~\Documents\DEV\lib\site-packages\torch\nn\modules\module.py:797, in Module._apply(self, fn)
795 def _apply(self, fn):
796 for module in self.children():
--> 797 module._apply(fn)
799 def compute_should_use_set_data(tensor, tensor_applied):
800 if torch._has_compatible_shallow_copy_type(tensor, tensor_applied):
801 # If the new tensor has compatible tensor type as the existing tensor,
802 # the current behavior is to change the tensor in-place using `.data =`,
(...)
807 # global flag to let the user control whether they want the future
808 # behavior of overwriting the existing tensor or not.
File ~\Documents\DEV\lib\site-packages\torch\nn\modules\module.py:797, in Module._apply(self, fn)
795 def _apply(self, fn):
796 for module in self.children():
--> 797 module._apply(fn)
799 def compute_should_use_set_data(tensor, tensor_applied):
800 if torch._has_compatible_shallow_copy_type(tensor, tensor_applied):
801 # If the new tensor has compatible tensor type as the existing tensor,
802 # the current behavior is to change the tensor in-place using `.data =`,
(...)
807 # global flag to let the user control whether they want the future
808 # behavior of overwriting the existing tensor or not.
[... skipping similar frames: Module._apply at line 797 (2 times)]
File ~\Documents\DEV\lib\site-packages\torch\nn\modules\module.py:797, in Module._apply(self, fn)
795 def _apply(self, fn):
796 for module in self.children():
--> 797 module._apply(fn)
799 def compute_should_use_set_data(tensor, tensor_applied):
800 if torch._has_compatible_shallow_copy_type(tensor, tensor_applied):
801 # If the new tensor has compatible tensor type as the existing tensor,
802 # the current behavior is to change the tensor in-place using `.data =`,
(...)
807 # global flag to let the user control whether they want the future
808 # behavior of overwriting the existing tensor or not.
File ~\Documents\DEV\lib\site-packages\torch\nn\modules\module.py:820, in Module._apply(self, fn)
816 # Tensors stored in modules are graph leaves, and we don't want to
817 # track autograd history of `param_applied`, so we have to use
818 # `with torch.no_grad():`
819 with torch.no_grad():
--> 820 param_applied = fn(param)
821 should_use_set_data = compute_should_use_set_data(param, param_applied)
822 if should_use_set_data:
File ~\Documents\DEV\lib\site-packages\torch\nn\modules\module.py:1143, in Module.to.<locals>.convert(t)
1140 if convert_to_format is not None and t.dim() in (4, 5):
1141 return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None,
1142 non_blocking, memory_format=convert_to_format)
-> 1143 return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 23.99 GiB total capacity; 11.89 GiB already allocated; 10.58 GiB free; 11.99 GiB allowed; 11.89 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Yes. Same Problem with my 24G RTX3090!
I tried with adding "torch_dtype=torch.bfloat16" to the model initialization. It's OK now!
I also have that in my code above, but it doesn't work for me.
@antman1p With "torch_dtype=torch.bfloat16", the 7B model should only take up ~14 GB. Here's what my nvidia-smi looks like with the model loaded:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.41.03 Driver Version: 530.41.03 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 Off| 00000000:01:00.0 On | Off |
| 30% 36C P8 10W / 450W| 14624MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
Just want to note that we added device_map
support in case you have multiple smaller GPUs, in this PR: https://huggingface.co/mosaicml/mpt-7b-chat/discussions/17
I tried with adding "torch_dtype=torch.bfloat16" to the model initialization. It's OK now!
I've tried this too. It works fine for inference, but as soon as I try to do finetuning I get an out of memory error from back propogation. I couldn't get flash or triton attention to work. Flash isn't supported on my rtx 3090 with the message "Expected is_sm80 || is_sm90 to be true, but got false." and I'm having trouble configuring my system to get triton running.
/venv/lib/python3.10/site-packages/torch/autograd/init.py", line 204, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 MiB. GPU 0 has a total capacty of 23.70 GiB of which 131.25 MiB is free. Including non-PyTorch memory, this process has 23.35 GiB memory in use. Of the allocated memory 22.98 GiB is allocated by PyTorch, and 48.60 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Same here, (Triton) Errors when trying to install. Should start a new thread for that one.
It would be nice to know how much vram we need for finetuning with different options e.g. torch.bfloat16, optimizer choice, etc.
Actually, it's written here: https://github.com/mosaicml/llm-foundry/tree/main/scripts/train#how-many-gpus-do-i-need-to-train-a-llm. It'd take about 84 GB as a ballpark figure.