torch.cuda.OutOfMemoryError when runing embex.get_state_embs

#297
by Jonyyqn - opened

hello, I tried to run in silico perturbation analysis as tutorial:

cell_states_to_model={"state_key": "Raw_annotation",
                      "start_state": "Tumor",
                      "goal_state": "Normal",
                      "alt_states": []}


embex = EmbExtractor(model_type="CellClassifier",
                     num_classes=2,
                     max_ncells=3000,
                     emb_mode='gene',
                     emb_layer=0,
                     filter_data=None,
                     summary_stat="exact_mean",
                     forward_batch_size=4,
                     nproc=8)

state_embs_dict = embex.get_state_embs(cell_states_to_model,
                                       fine_tuning_model_path,
                                       dataset_path,
                                       out_path,
                                       prefix)

And error occurs during embex.get_state_embs:

Traceback (most recent call last):
  File "/home/geneformer/run_perturb.py", line 33, in <module>
    state_embs_dict = embex.get_state_embs(cell_states_to_model,
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/miniconda3/envs/ai/lib/python3.11/site-packages/geneformer/emb_extractor.py", line 683, in get_state_embs
    state_embs_dict[v] = self.extract_embs(
                         ^^^^^^^^^^^^^^^^^^
  File "/home/miniconda3/envs/ai/lib/python3.11/site-packages/geneformer/emb_extractor.py", line 570, in extract_embs
    embs = get_embs(
           ^^^^^^^^^
  File "/home/miniconda3/envs/ai/lib/python3.11/site-packages/geneformer/emb_extractor.py", line 138, in get_embs
    embs_stack = pu.pad_tensor_list(
                 ^^^^^^^^^^^^^^^^^^^
  File "/home/miniconda3/envs/ai/lib/python3.11/site-packages/geneformer/perturber_utils.py", line 524, in pad_tensor_list
    return torch.cat(tensor_list, 0)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 5.86 GiB. GPU 0 has a total capacity of 15.89 GiB of which 4.00 GiB is free. Including non-PyTorch memory, this process has 11.89 GiB memory in use. Of the allocated memory 10.07 GiB is allocated by PyTorch, and 1.53 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

As shown above, I have set the batch_size very small, but the memory overflow problem is still reported.

Similarly, when I ran the subsequent code, I also encountered a similar error:


isp = InSilicoPerturber(perturb_type="delete",
                        perturb_rank_shift=None,
                        genes_to_perturb='all',
                        combos=0,
                        anchor_gene=None,
                        model_type="CellClassifier",
                        num_classes=2,
                        emb_mode="cell",
                        cell_emb_style="mean_pool",
                        filter_data=None,
                        cell_states_to_model=cell_states_to_model,
                        state_embs_dict=state_embs_dict,
                        max_ncells=2000,
                        emb_layer=0,
                        forward_batch_size=50,
                        nproc=25)

isp.perturb_data(fine_tuning_model_path,
                 dataset_path,
                 out_path,
                 prefix)

  File "/home/miniconda3/envs/ai/lib/python3.11/site-packages/geneformer/in_silico_perturber.py", line 435, in perturb_data
    self.isp_perturb_all(
  File "/home/miniconda3/envs/ai/lib/python3.11/site-packages/geneformer/in_silico_perturber.py", line 751, in isp_perturb_all
    full_perturbation_emb = get_embs(
                            ^^^^^^^^^
  File "/home/miniconda3/envs/ai/lib/python3.11/site-packages/geneformer/emb_extractor.py", line 88, in get_embs
    outputs = model(
              ^^^^^^
  File "/home/miniconda3/envs/ai/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/miniconda3/envs/ai/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/miniconda3/envs/ai/lib/python3.11/site-packages/transformers/models/bert/modeling_bert.py", line 1564, in forward
    outputs = self.bert(
              ^^^^^^^^^^
  File "/home/miniconda3/envs/ai/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/miniconda3/envs/ai/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/miniconda3/envs/ai/lib/python3.11/site-packages/transformers/models/bert/modeling_bert.py", line 1013, in forward
    encoder_outputs = self.encoder(
                      ^^^^^^^^^^^^^
  File "/home/miniconda3/envs/ai/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/miniconda3/envs/ai/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/miniconda3/envs/ai/lib/python3.11/site-packages/transformers/models/bert/modeling_bert.py", line 607, in forward
    layer_outputs = layer_module(
                    ^^^^^^^^^^^^^
  File "/home/miniconda3/envs/ai/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/miniconda3/envs/ai/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/miniconda3/envs/ai/lib/python3.11/site-packages/transformers/models/bert/modeling_bert.py", line 497, in forward
    self_attention_outputs = self.attention(
                             ^^^^^^^^^^^^^^^
  File "/home/miniconda3/envs/ai/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/miniconda3/envs/ai/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/miniconda3/envs/ai/lib/python3.11/site-packages/transformers/models/bert/modeling_bert.py", line 427, in forward
    self_outputs = self.self(
                   ^^^^^^^^^^
  File "/home/miniconda3/envs/ai/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/miniconda3/envs/ai/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/miniconda3/envs/ai/lib/python3.11/site-packages/transformers/models/bert/modeling_bert.py", line 365, in forward
    context_layer = torch.matmul(attention_probs, value_layer)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 100.00 MiB. GPU 0 has a total capacity of 15.89 GiB of which 14.12 MiB is free. Including non-PyTorch memory, this proces
s has 15.88 GiB memory in use. Of the allocated memory 15.56 GiB is allocated by PyTorch, and 30.97 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large
try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-
variables)

Thank you for your interest in Geneformer! 16G GPU is quite a bit smaller than the hardware we usually use. I would recommend reducing the batch size until the job fits on your device. Alternatively, if you have multiple 16G GPUs, you could distribute the model. Keep in mind the model takes up some given memory so as you increase the GPU size you can open up room for much larger batch sizes than the fold change in GPU size. In both cases you are close to fitting it on your device so reducing the batch size should resolve the issue. Please also note there is a 12 and 6 layer model so if you are using the 12 layer one, changing to the 6 layer one would also help. The manuscript results reflect the 6 layer model.

ctheodoris changed discussion status to closed

Thank you for your reply !

Sign up or log in to comment