I am very confused with the code in this section of In_silico_perturber
Dear sir:
I am very confused with the code in this section of In_silico_perturber:
After you have the initial embedding and disease-specific embedding and the embedding of the knockout cell, when you calculate the embedding shift after the knockout, The cos_sim_values are calculated using the average of all embedding values of each batch_size on which a specific gene has been knocked out and the embedding of a cell with a specific disease state, rather than the embedding of a single gene knocked out cell. Therefore, when batch_size=400, this would cause a very obvious logical error. Obviously batch_size=1 is correct.
I really want to know the detailed parameters of perturbation of the code running when executing In_silico_perturbation task in the article, please tell me, thank you
raw code:
def cos_sim_shift(original_emb, minibatch_emb, alt_emb, perturb_group):
cos = torch.nn.CosineSimilarity(dim=2)
original_emb = torch.mean(original_emb,dim=0,keepdim=True)
if perturb_group == False:
original_emb = original_emb[None, :]
origin_v_end = cos(original_emb,alt_emb)
perturb_emb = torch.mean(minibatch_emb,dim=1,keepdim=True) <<<<<
perturb_v_end = cos(perturb_emb,alt_emb)
return [(perturb_v_end-origin_v_end).to("cpu")]
Thank you for your question! We have made some changes to this function since you posted this discussion. Please look over the updated code and if there is still confusion please feel free re-open this discussion.
Dear sir,
You guys are doing a great job.Thank you very much for your reply.
Some adjustments have been made to the code, but the previous problems still exist, perhaps because I did not explain them in detail, so I will write down the detailed problems:
I'm going to use some variables in your code to explain, original_emb refers to the average embedding of the initial cell state. minibatch_emb is the batch size embedding with one gene perturbed. state_embs_dict[state] is the embedding of each disease state. After obtaining the above data, Then you calculate :
A=cos_sim_shift(state_embs_dict[state], original_emb)
B=cos_sim_shift(minibatch_emb,state_embs_dict[state])
cos_sims_vs_alt_dict=B-A
The problem appears in B:
When training minibatch_emb, the default size of the batch size is 400, so the minibatch_emb you obtain contains a dictionary set of embedding after perturbing 400 genes in turn. It is not the same as a single embedding dictionary after perturbing a gene. However, when calculating the value of B, you average a dictionary set of embedding values obtained after perturbing 400 genes, as the corresponding single embedding value after perturbing a single gene, your code is here:
def cos_sim_shift(.........)
........
if original_minibatch_lengths is not None:
original_emb = mean_nonpadding_embs(original_emb, original_minibatch_lengths)
# else:
# original_emb = torch.mean(original_emb,dim=1,keepdim=True)
end_emb = torch.unsqueeze(end_emb, 1)
origin_v_end = cos(original_emb, end_emb)
origin_v_end = torch.squeeze(origin_v_end)
if minibatch_lengths is not None:
perturb_emb = mean_nonpadding_embs(minibatch_emb, minibatch_lengths)
else:
perturb_emb = torch.mean(minibatch_emb,dim=1,keepdim=True)#<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<please, here
perturb_v_end = cos(perturb_emb, end_emb)
perturb_v_end = torch.squeeze(perturb_v_end)
return [(perturb_v_end-origin_v_end).to("cpu")]
That is to say, your code uses the average of 400 different embedding after perturbing a gene as the embedding of a gene, and then use this value to calculate the cos_sims_vs_alt_dict after a single perturbed gene.This logic only works when the batch size is equal to 1
In addition, I have another question: Is the result of in silico perturbation in the paper implemented using this code?I used the same code and data set, but the results were very different from those in the article
Thank you for following up!
For the line you are asking about:
perturb_emb = torch.mean(minibatch_emb,dim=1,keepdim=True)
This line averages the embedding in the 1st dimension (the gene dimension) to result in a tensor of dimensions (batch_size, 1, number_embedding_dimensions). So, it’s not averaging in the batch dimension. You can try adding print statements to print the size of the tensor minibatch_emb before this line and perturb_emb after this line to help you visualize the dimensions.
Regarding your other question, we did not have the code packaged into modules like this when running the initial analyses in the manuscript but directly converted the initial code to the modules here so that it would be easier for others to use. In order to help you troubleshoot, we would need to know the specific analysis you are trying to run and the specific arguments you used to set it up. Please let us know so we can help pinpoint if it is being set up in a way that is the same or different from the analysis you are trying to repeat from the manuscript.
Thank you very much for your patient reply. I recently worked on replicating your work( in-silico-pertubation) several times, using the pre-trained model you uploaded, the human_dcm_hcm_nf dataset, and the newly uploaded in-silico-pertubation code. However, there is a big difference between the results I obtained and the bias value after gene perturbed in the article. I would like to know the reason. I will send the results to your email, thank you.
Thank you for following up! As mentioned previously, in order to help you troubleshoot, we would need to know the specific analysis you are trying to run and the specific arguments you used to set it up. In your last comment, you mentioned you used the pretrained model for your analysis. We used a fine-tuned model for cardiomyopathy to perform the analyses relevant to the dataset you mentioned, so this would be one important difference. There could be others, but we would need to know the specifics of what you are trying to do and how you are trying to do it to help you troubleshoot.