Sizes of tensors error in in_silico_perturber

#85
by Akutagawaluyifei - opened

Thank you this amazing model!
When I was running in silicon perturbation on a sampled dataset (2000 cells) of my own, I encountered an error concerning sizes of tensors.

RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 1 but got size 1980 for tensor number 220 in the list.
My code shows as follows:

RT = InSilicoPerturber(perturb_type="delete",
                      perturb_rank_shift=None,
                      genes_to_perturb="all",
                      combos=0,
                      anchor_gene=None,
                      model_type="Pretrained",
                      emb_mode="cell_and_gene",
                      cell_emb_style="mean_pool",
                      filter_data={"Celltype_2":["5","3","2"]},
                      cell_states_to_model={"Celltype_2":(["5"],["3"],["2"])},
                      emb_layer=-1,
                      forward_batch_size=10,
                      nproc=10,
                      save_raw_data=True)
RT.perturb_data("/cluster/home/yflu/Geneformer/",
               "/cluster/home/yflu/Geneformer/tokenized/20230704_01.dataset",
               "/cluster/home/yflu/Geneformer/perturb/",
               "20230704_sample10_new.RT")

The log before this error shows:


Map (num_proc=10):   0%|          | 0/1983 [00:00<?, ? examples/s]

Map (num_proc=10):   3%|β–Ž         | 50/1983 [00:00<00:08, 229.35 examples/s]

Map (num_proc=10):  13%|β–ˆβ–Ž        | 249/1983 [00:00<00:02, 848.41 examples/s]

Map (num_proc=10):  27%|β–ˆβ–ˆβ–‹       | 531/1983 [00:00<00:01, 1445.16 examples/s]

Map (num_proc=10):  48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 953/1983 [00:00<00:00, 2299.78 examples/s]

Map (num_proc=10):  74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 1470/1983 [00:00<00:00, 3064.60 examples/s]

Map (num_proc=10):  92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 1834/1983 [00:00<00:00, 2976.65 examples/s]

                                                                               

Map (num_proc=10):   0%|          | 0/1981 [00:00<?, ? examples/s]

Map (num_proc=10):   3%|β–Ž         | 54/1981 [00:00<00:07, 253.86 examples/s]

Map (num_proc=10):  10%|β–‰         | 194/1981 [00:00<00:02, 700.13 examples/s]

Map (num_proc=10):  30%|β–ˆβ–ˆβ–‰       | 593/1981 [00:00<00:00, 1763.90 examples/s]

Map (num_proc=10):  47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 933/1981 [00:00<00:00, 2177.79 examples/s]

Map (num_proc=10):  69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 1369/1981 [00:00<00:00, 2761.15 examples/s]

Map (num_proc=10):  87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 1730/1981 [00:00<00:00, 2952.12 examples/s]

Traceback (most recent call last):
  File "/cluster/home/yflu/Geneformer/Perturber_10.py", line 45, in <module>
    RT.perturb_data("/cluster/home/yflu/Geneformer",
  File "/cluster/home/yflu/Geneformer/geneformer/in_silico_perturber.py", line 563, in perturb_data
    self.in_silico_perturb(model,
  File "/cluster/home/yflu/Geneformer/geneformer/in_silico_perturber.py", line 661, in in_silico_perturb
    cos_sims_data = quant_cos_sims(model,
  File "/cluster/home/yflu/Geneformer/geneformer/in_silico_perturber.py", line 266, in quant_cos_sims
    cos_sims_vs_alt_dict[state] = torch.cat(cos_sims_vs_alt_dict[state])
RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 1 but got size 1980 for tensor number 220 in the list.

The code was running fine for 10+ hours before this error, and if I reduce the sampled dataset to the size of 1% (200 cells) of the original dataset, this error doesn't show up.

Thank you for your interest in Geneformer! Please confirm that you have pulled the current version for both the modules and example notebooks and that the module is running from the correct directory and not an outdated directory because there was a prior tensor issue that I updated the repository to resolve so I want to make sure it isn't related to that. This seems like it's probably different though because the expected size indicates 1 in your error message, which may suggest that the results were trying to concatenate cell embeddings and gene embeddings, leading to the different sizes. One reason I am thinking this may be the case is because usually we run the mode of cell_states_to_model with "cell" as the emb_mode rather than "cell_and_gene" since that is more relevant to modeling cell state shifts. However, I am not sure why it is not showing up until you are more than 200 cells through your dataset. Can you investigate if there is anything unexpected about the cell the error occurs at? The model saves output as it goes in case there is an error so you should have at least some of your data saved. We expect batch sizes of 200-400 so if you are using batch sizes much smaller due to resource limitations, you may want to change the code to save more frequently since it will take longer to run. I would suggest you run the analysis with "cell" as the emb_mode as this is more relevant to modeling the change between cell states; this will also likely run faster considering your code seems to be taking a long time to run with the very low batch size. The model sorts the cells by size so the ones with the most genes detected will be run first to encounter memory limitations earlier rather than later. Since you already ran many of your cells, you can start after the last saved cell in the data that was already saved, and you may be able to run larger batch sizes if the size of the cells at that point forward are smaller. If you still encounter the error despite the above, please email me your dataset so I can reproduce the error and investigate it further.

@Renqing Thank you for the discussion regarding your similar issue. Please check the open and closed discussions before starting a new discussion so that all the information is one place for future users who have the same question. Please post here the last isp options you tried where the error occurred despite changing "genes_to_perturb" to "all". Please make sure that all code is posted as text and not a screenshot. You can use Markdown syntax to insert code blocks into discussions. Please also see the comment above in the original discussion. Please investigate if there is anything unexpected about the cell where the error occurs as this may help to provide important information. When I run the analysis using what I believe were the last isp settings you tried, I am not encountering an error, so I need more information to reproduce the error in order to help you troubleshoot.

Also, I would like to note that we strongly recommend tuning learning hyperparameters for all downstream applications. I am mentioning this because I noticed your model filename "230627_geneformer_CellClassifier_L2048_B4_LR5e-05_LSlinear_WU500_E10_Oadamw_F0/" appears to indicate you used the example learning hyperparameters for the cell annotation example to create your disease classifier. These are not the hyperparameters we used in our manuscript (see Methods) and likely will yield worse results than the optimized hyperparameters. I added a note to both gene and cell classification example notebooks to emphasize this point. Please see the example for hyperparameter optimization for further information on how to optimize hyperparameters.

Sorry for the delayed reply, my account was recently created and was limited to post any more replies in previous days.
Thanks for your help! But changing emb_mode from "cell and gene" to "cell" didn't solve this error, the error occurs at the same location. I added "try except" to the code to skip cells which raise this error and and spotted 2 cells. When I looked into these 2 cells, I was not able to identify any curious charactistics of them. Then I finetuned the model for cell classifier using the full dataset of my own (20000+ cells) and used the finetuned model to run in silico perturber on the same sampled dataset which raised this error, but in this case, the error stopped showing.
So for now, I can get some results by skipping the cells or using a finetuned model, but I am still confused about why this error occur and its possible impact on the results. As noticed, @Renqing and I both used rather small batch size for analysis and finetuning (geneformer_batch_size = 4) due to resource limitations. I will e-mail you my sampled dataset to you so you can look into it.
Thanks for you help and looking forward to further support!

I also faced this error. I used batch size of 14 and it will show the same error, hope to know the reason behind that.

No description provided.

My code is:

isp = InSilicoPerturber(perturb_type="delete",
                      perturb_rank_shift=None,
                      genes_to_perturb='all',
                      combos=0,
                      anchor_gene=None,
                      model_type="CellClassifier",#Pretrained
                      num_classes=3,
                      emb_mode="cell",
                      cell_emb_style="mean_pool",
                      filter_data={"cell_type":["Cardiomyocyte1","Cardiomyocyte2","Cardiomyocyte3"]},
                      cell_states_to_model={"disease":(["nf"],["hcm"],["dcm"])},
                      max_ncells=None,
                      emb_layer=0,
                      forward_batch_size=4,
                      nproc=16,
                      save_raw_data=False)

isp.perturb_data("/cell_disease/230627_geneformer_CellClassifier_L2048_B4_LR5e-05_LSlinear_WU500_E10_Oadamw_F0/",
               "/Geneformer/example_input_file/cell_classification/disease_classification/Split_data/human_dcm_hcm_nf.dataset/",
               "Geneformer/example_input_file/cell_classification/test_silico_all/",
               "test")

@Akutagawaluyifei @DYXDAVE Have you solved this problem?

@Renqing Using a batch.size of 100 doesn't raise this error for me. Maybe a larger batch.size is a solution but still doesn't address the reason behind this error. Still in need of help from author.

I'm getting a similar error which does not get resolved by increasing batch size. I'm using the more recent scripts.
RuntimeError: The size of tensor a (1958) must match the size of tensor b (2048) at non-singleton dimension 3

@ctheodoris @davidjwen Hello I am also getting a similar error. I am trying to recreate Figure 5b from the paper. I did not see the arrow files used for this figure under Genecorpus-30M so I created my own by extracting the fetal cardiomyocytes from GSE156793.

Here is the code I used:

isp = InSilicoPerturber(perturb_type="delete",
                        perturb_rank_shift=None,
                        genes_to_perturb=["ENSG00000136574"],
                        combos=0,
                        anchor_gene=None,
                        model_type="Pretrained",
                        num_classes=0,
                        emb_mode="cell_and_gene",
                        cell_emb_style="mean_pool",
                        cell_states_to_model=None,
                        max_ncells=200,
                        emb_layer=-1,
                        forward_batch_size=50,
                        nproc=16)
                    
isp.perturb_data("TOOLS/Geneformer/",
                 "/Geneformer/test_tokenizer2.dataset/",
                 "/Geneformer/",
                 "test_geneformer")

And the error that I get :

Traceback (most recent call last):
  File "/yanketn/ANALYSIS/Geneformer/Geneformer_Figure5_test.py", line 23, in <module>
    isp.perturb_data("//yanketn/TOOLS/Geneformer/",
  File "/anketn/miniconda/lib/python3.10/site-packages/geneformer/in_silico_perturber.py", line 975, in perturb_dat
a
    self.in_silico_perturb(model,
  File "/yanketn/miniconda/lib/python3.10/site-packages/geneformer/in_silico_perturber.py", line 1053, in in_silico_
perturb
    cos_sims_data = quant_cos_sims(model, 
  File "/yanketn/miniconda/lib/python3.10/site-packages/geneformer/in_silico_perturber.py", line 466, in quant_cos_s
ims
    cos_sims_stack = torch.cat(cos_sims)
RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 2047 but got size 617 for tensor number 1 in
 the list.

I also ran the same exact code using your arrow file from Genecorpus-30M/example_input_files/cell_classification/cell_type_annotation to see if maybe I made a mistake at the tokenizing step but I get a similar error.:

Traceback (most recent call last):
  File "users/yanketn/ANALYSIS/Geneformer/Geneformer_Figure5_test.py", line 23, in <module>
    isp.perturb_data("//yanketn/TOOLS/Geneformer/",
  File "/yanketn/miniconda/lib/python3.10/site-packages/geneformer/in_silico_perturber.py", line 975, in perturb_data
    self.in_silico_perturb(model,
  File "/anketn/miniconda/lib/python3.10/site-packages/geneformer/in_silico_perturber.py", line 1053, in in_silico_perturb
    cos_sims_data = quant_cos_sims(model, 
  File "/yanketn/miniconda/lib/python3.10/site-packages/geneformer/in_silico_perturber.py", line 466, in quant_cos_sims
    cos_sims_stack = torch.cat(cos_sims)
RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 1354 but got size 586 for tensor number 1 in the list.

Any advice you can give is greatly appreciated.

UPDATE

I noticed that @davidjwen made an update to this file in the past 12 hours (Fixed bugs related to overexpressing genes #229). I pulled the most recent version and re-ran my code. I get a different error now which is identical to Discussion #184 :

Map (num_proc=4): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 50/50 [00:00<00:00, 252.33 examples/s]
Map (num_proc=4):  40%|β–ˆβ–ˆβ–ˆ       | 15/50 [00:00<00:01, 24.59 examples/s]
Map (num_proc=4):  94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 47/50 [00:01<00:00, 29.23 examples/s]
Map (num_proc=4): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 50/50 [00:02<00:00, 21.94 examples/s]
Traceback (most recent call last):
  File "/pusers/yanketn/ANALYSIS/Geneformer/Geneformer_Figure5_test.py", line 23, in <mo
dule>
    isp.perturb_data("/yanketn/TOOLS/Geneformer/",
  File "/yanketn/miniconda/lib/python3.10/site-packages/geneformer/in_silico_perturber.py", l
ine 981, in perturb_data
    self.in_silico_perturb(model,
  File "/yanketn/miniconda/lib/python3.10/site-packages/geneformer/in_silico_perturber.py", l
ine 1059, in in_silico_perturb
    cos_sims_data = quant_cos_sims(model, 
  File "/yanketn/miniconda/lib/python3.10/site-packages/geneformer/in_silico_perturber.py", l
ine 445, in quant_cos_sims
    cos_sims += [cos(minibatch_emb, minibatch_comparison).to("cpu")]
  File "/yanketn/miniconda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 150
1, in _call_impl
    return forward_call(*args, **kwargs)
  File "/yanketn/miniconda/lib/python3.10/site-packages/torch/nn/modules/distance.py", line 8
7, in forward
    return F.cosine_similarity(x1, x2, self.dim, self.eps)
RuntimeError: The size of tensor a (2047) must match the size of tensor b (2046) at non-singleton di
mension 1

Thank you for your interest in Geneformer and for your patience! We pushed an update that should resolve this issue. If you continue to face errors after pulling the updated code, please let us know by either reopening this discussion if it's the same error or opening a new discussion if it's a new error. Thank you!

ctheodoris changed discussion status to closed

Sign up or log in to comment