Text Generation
Transformers
PyTorch
mpt
Composer
MosaicML
llm-foundry
custom_code
text-generation-inference

Out of memory error with an RTX 4090

#7
by antman1p - opened

I am getting an Out of memory error running an RTX 4090. Tried in Win 11 and WSL. Using Cuda 11.7.

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
config = AutoConfig.from_pretrained(
  'mosaicml/mpt-7b-chat',
  trust_remote_code=True
)

model = AutoModelForCausalLM.from_pretrained(
  'mosaicml/mpt-7b-chat',
  torch_dtype=torch.bfloat16,
  config=config,
  trust_remote_code=True
)
model.to(device)

Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:04<00:00,  2.16s/it]
---------------------------------------------------------------------------
OutOfMemoryError                          Traceback (most recent call last)
Cell In[5], line 17
     10 #print(config)
     11 model = AutoModelForCausalLM.from_pretrained(
     12   'mosaicml/mpt-7b-chat',
     13   torch_dtype=torch.bfloat16,
     14   config=config,
     15   trust_remote_code=True
     16 )
---> 17 model.to(device)

File ~\Documents\DEV\lib\site-packages\transformers\modeling_utils.py:1878, in PreTrainedModel.to(self, *args, **kwargs)
   1873     raise ValueError(
   1874         "`.to` is not supported for `8-bit` models. Please use the model as it is, since the"
   1875         " model has already been set to the correct devices and casted to the correct `dtype`."
   1876     )
   1877 else:
-> 1878     return super().to(*args, **kwargs)

File ~\Documents\DEV\lib\site-packages\torch\nn\modules\module.py:1145, in Module.to(self, *args, **kwargs)
   1141         return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None,
   1142                     non_blocking, memory_format=convert_to_format)
   1143     return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
-> 1145 return self._apply(convert)

File ~\Documents\DEV\lib\site-packages\torch\nn\modules\module.py:797, in Module._apply(self, fn)
    795 def _apply(self, fn):
    796     for module in self.children():
--> 797         module._apply(fn)
    799     def compute_should_use_set_data(tensor, tensor_applied):
    800         if torch._has_compatible_shallow_copy_type(tensor, tensor_applied):
    801             # If the new tensor has compatible tensor type as the existing tensor,
    802             # the current behavior is to change the tensor in-place using `.data =`,
   (...)
    807             # global flag to let the user control whether they want the future
    808             # behavior of overwriting the existing tensor or not.

File ~\Documents\DEV\lib\site-packages\torch\nn\modules\module.py:797, in Module._apply(self, fn)
    795 def _apply(self, fn):
    796     for module in self.children():
--> 797         module._apply(fn)
    799     def compute_should_use_set_data(tensor, tensor_applied):
    800         if torch._has_compatible_shallow_copy_type(tensor, tensor_applied):
    801             # If the new tensor has compatible tensor type as the existing tensor,
    802             # the current behavior is to change the tensor in-place using `.data =`,
   (...)
    807             # global flag to let the user control whether they want the future
    808             # behavior of overwriting the existing tensor or not.

    [... skipping similar frames: Module._apply at line 797 (2 times)]

File ~\Documents\DEV\lib\site-packages\torch\nn\modules\module.py:797, in Module._apply(self, fn)
    795 def _apply(self, fn):
    796     for module in self.children():
--> 797         module._apply(fn)
    799     def compute_should_use_set_data(tensor, tensor_applied):
    800         if torch._has_compatible_shallow_copy_type(tensor, tensor_applied):
    801             # If the new tensor has compatible tensor type as the existing tensor,
    802             # the current behavior is to change the tensor in-place using `.data =`,
   (...)
    807             # global flag to let the user control whether they want the future
    808             # behavior of overwriting the existing tensor or not.

File ~\Documents\DEV\lib\site-packages\torch\nn\modules\module.py:820, in Module._apply(self, fn)
    816 # Tensors stored in modules are graph leaves, and we don't want to
    817 # track autograd history of `param_applied`, so we have to use
    818 # `with torch.no_grad():`
    819 with torch.no_grad():
--> 820     param_applied = fn(param)
    821 should_use_set_data = compute_should_use_set_data(param, param_applied)
    822 if should_use_set_data:

File ~\Documents\DEV\lib\site-packages\torch\nn\modules\module.py:1143, in Module.to.<locals>.convert(t)
   1140 if convert_to_format is not None and t.dim() in (4, 5):
   1141     return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None,
   1142                 non_blocking, memory_format=convert_to_format)
-> 1143 return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)

OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 23.99 GiB total capacity; 11.89 GiB already allocated; 10.58 GiB free; 11.99 GiB allowed; 11.89 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Yes. Same Problem with my 24G RTX3090!

I tried with adding "torch_dtype=torch.bfloat16" to the model initialization. It's OK now!

I also have that in my code above, but it doesn't work for me.

@antman1p With "torch_dtype=torch.bfloat16", the 7B model should only take up ~14 GB. Here's what my nvidia-smi looks like with the model loaded:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.41.03              Driver Version: 530.41.03    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090         Off| 00000000:01:00.0  On |                  Off |
| 30%   36C    P8               10W / 450W|  14624MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
This comment has been hidden

Just want to note that we added device_map support in case you have multiple smaller GPUs, in this PR: https://huggingface.co/mosaicml/mpt-7b-chat/discussions/17

I tried with adding "torch_dtype=torch.bfloat16" to the model initialization. It's OK now!

I've tried this too. It works fine for inference, but as soon as I try to do finetuning I get an out of memory error from back propogation. I couldn't get flash or triton attention to work. Flash isn't supported on my rtx 3090 with the message "Expected is_sm80 || is_sm90 to be true, but got false." and I'm having trouble configuring my system to get triton running.

/venv/lib/python3.10/site-packages/torch/autograd/init.py", line 204, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 MiB. GPU 0 has a total capacty of 23.70 GiB of which 131.25 MiB is free. Including non-PyTorch memory, this process has 23.35 GiB memory in use. Of the allocated memory 22.98 GiB is allocated by PyTorch, and 48.60 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Same here, (Triton) Errors when trying to install. Should start a new thread for that one.

It would be nice to know how much vram we need for finetuning with different options e.g. torch.bfloat16, optimizer choice, etc.

Actually, it's written here: https://github.com/mosaicml/llm-foundry/tree/main/scripts/train#how-many-gpus-do-i-need-to-train-a-llm. It'd take about 84 GB as a ballpark figure.

sam-mosaic changed discussion status to closed

Sign up or log in to comment