How to output gene embeddings after single gene perturbation, similar to Fig5a/b?

#256
by junguyen - opened

Hello,

I've had success outputting cosine shifts in cell embeddings when running emb_mode="cell" with the parameters listed below; however, when changing emb_mode="cell_and_gene" and keeping all other parameters unchanged, I get the exact same output (cosine shifts in cell embeddings only).

My goal is to observe how all other gene embeddings are affected after a single gene perturbation, to identify important proteins in a network (similar to Fig5a/b in the paper). I'm currently using a small subset from the Genecorpus-30M dataset as my input data.

How should I change my parameters to get gene embedding outputs?

Thank you!

# Set perturbation parameters
isp = InSilicoPerturber(perturb_type="delete",
                        perturb_rank_shift=None,
                        genes_to_perturb=["ENSG00000196262"],
                        combos=0,
                        anchor_gene=None,
                        model_type="Pretrained",
                        num_classes=0,
                        emb_mode="cell",
                        cell_emb_style="mean_pool",
                        filter_data=None,
                        cell_states_to_model=None,
                        max_ncells=None,
                        cell_inds_to_perturb={"start":0, "end":50},
                        emb_layer=-1,
                        forward_batch_size=50,
                        nproc=16,
                        token_dictionary_file = "/home/ubuntu/Geneformer/geneformer/token_dictionary.pkl")

# Perturb data
isp.perturb_data("/home/ubuntu/Geneformer/",
                 "/data/genecorpus_filtered_nonhep/",
                 "/data/genecorpus_filtered_nonhep/delete_cell/",
                 "cell_and_gene_test_PPIA")

# Set perturbation stats
ispstats = InSilicoPerturberStats(mode="aggregate_data",
                                  genes_perturbed=["ENSG00000196262"],
                                  combos=0,
                                  anchor_gene=None,
                                  cell_states_to_model=None,
                                  token_dictionary_file = "/home/ubuntu/Geneformer/geneformer/token_dictionary.pkl")

# Get perturbation stats
ispstats.get_stats("/data/genecorpus_filtered_nonhep/delete_cell/",
                   None,
                   "/data/genecorpus_filtered_nonhep/delete_cell/",
                   "delete_cell_and_gene_test_PPIA")

Thank you for your interest in Geneformer and for your patience! We pushed an update that should resolve this issue. If you continue to face errors after pulling the updated code, please let us know by either reopening this discussion if it's the same error or opening a new discussion if it's a new error. Thank you!

ctheodoris changed discussion status to closed

Hello @ctheodoris ,

I am still experiencing the same issue after updating my code to the newest version. My code pretty much the same as above, except for I am looking at a finetuned model which is a cellclassifier. I set the emb_mode="cell_and_gene"in InSilicoPerturber function and my perturbation is the deletion of 1 gene. I want to get the gene embedding shift for the other genes. But my output only contains Cosine_shift at the cell level. Could you advise what could be going wrong?

Thanks a lot !

Hi - thanks for letting us know about the issue! We've had some trouble reproducing this issue. When emb_mode="cell_and_gene", can you confirm that two pickle result dictionaries are being created? And if so, are you running the stats on the one that corresponds to the gene embeddings? Thanks!

Sign up or log in to comment