About silico perturbation

#46
by Renqing - opened

| First of all, congratulations to you.
| I saw in InSilicoPerturber function, parameter model_type has three options {"Pretrained","GeneClassifier","CellClassifier"},
'Pretrained' indicates that the pre-trained model has not been fine-tuned downstream for specific task,'GeneClassifier' is the model in your manuscript for distinguishing dose-sensitive or insensitive, and 'CellClassifier' is the model in your manuscript for distinguishing cardiomyocytes from non-failing hearts or hearts affected by hypertrophic or dilated cardiomyopathy?
| The parameters of this function isp.perturb_data("path/to/model", "path/to/input_data", "path/to/output_directory","output_prefix"), model and data must all be consistent with the function InSilicoPerturber, right?
If I want to use these two functions ("GeneClassifier", "CellClassifier"), do I need to fine-tune the two models myself with the sample data you provided? The model I cloned from huggingface only contains the pre-trained model, right?
| If I use silico perturbation, what is the result difference between the three models ("Pretrained","GeneClassifier","CellClassifier")?
Would you provide a notebook that reproduces the article?
Looking forward to your reply

Thank you for your interest in Geneformer. Geneformer is a foundation model pretrained on 30 million single cell transcriptomes to gain a fundamental understanding of network dynamics that can then be democratized to a multitude of downstream applications. The pretrained model can be used for inference directly with zero-shot learning, or the model can be fine-tuned for downstream applications with a task-specific learning objective and task-specific data. In this repository we provide the pretrained model (the 6 layer version reported in the manuscript as well as a 12 layer version). Fine-tuning towards the particular question at hand with relevant data will generally improve the predictive power in that task. The central idea of transfer learning is that this pretrained model can transfer its knowledge to a vast range of downstream tasks, not just 1 task, or the 2 tasks you mentioned. In the manuscript, we demonstrate the model's predictive power in a diverse panel of downstream applications (see below and in the manuscript).

The notebooks provided in this repository are meant to serve as general examples, but are by no means the only applications of the model. Users can fine-tune the model to any downstream application that they are interested in with limited task-specific data. For example, users can fine-tune the model to distinguish genes of different classes/characteristics (GeneClassifier, analogous to token classification in NLP) or cells of different states (CellClassifier, analogous to sequence classification in NLP). Specifying the type of model to the in silico perturber allows it to load the model of the appropriate type. For example, fine-tuning for gene or cell classification adds a head layer to the model so the number of total layers will be more than the pretrained model. It does not change the procedure of in silico deletion or activation. Please also refer to the code of the in silico perturber to understand the procedure.

Geneformer applications demonstrated in the manuscript:
Fine-tuning:

  • transcription factor dosage sensitivity
  • chromatin dynamics (bivalently marked promoters)
  • transcription factor regulatory range
  • gene network centrality
  • transcription factor targets
  • cell type annotation
  • cell state classification across differentiation
  • disease classification
  • in silico perturbation to determine disease-driving genes
  • in silico treatment to determine candidate therapeutic targets

Zero-shot learning:

  • gene context specificity
  • in silico reprogramming
  • in silico differentiation
  • in silico perturbation to determine impact on cell state
  • in silico perturbation to determine transcription factor targets
  • in silico perturbation to determine transcription factor cooperativity
ctheodoris changed discussion status to closed

Could you provide notebook files of the results in the article for a better understanding of the model and how to use it?
Now the code and notebook in the repository seem to be very unfriendly to reproduce the results of your article.

Thank you for your question. The code in this repository is actually very friendly to using Geneformer. :) We have composed all the components into modules for easy use and reproduction, provided generalizable examples, and integrated with Huggingface so users can take advantage of their user-friendly infrastructure. Please email me if you are still confused and are having trouble reproducing any particular analysis and I would be happy to help you troubleshoot.

Sign up or log in to comment