About dataset for pretraining
Hi, thank you for the great work.
If we pretrain a new model from scratch with a new pretraining corpus other than Genecorpus-30M, could we add expression data from cells related to cancer into the dataset? Because cancer exists in the real life, and gene expression data may differ between normal and cancer cells. If doing like this, the model may comprehensively understand cells in a real world scene. Is that reasonable?
Thank you for your question. As discussed in the manuscript, we excluded cells with high mutational burdens (e.g. malignant cells and immortalized cell lines) that could lead to substantial network rewiring without companion genome sequencing to facilitate interpretation. If many genes are mutated such that the proteins have different functions than they normally would, this may lead to misinterpretation by the model because it does not have the information that the gene it is observing is normal in one cell but mutated in another. For example, mutations can lead to gain of function that is completely distinct from the normal protein function. Of note, gene deletions or extra copies of genes would not be a problem because the change is evident in the transcriptome observed by the model (e.g. less counts for a gene deletion). We therefore opted to pretrain the model without cells with high mutational burdens and allow users to fine-tune the model with malignant cells should they be interested in cancer applications and wish the model to be tuned to the network rewiring and/or altered gene function in that disease state. This is especially important due to the overrepresentation of malignant and immortalized cells in publicly available data due to their being clinically accessible and easy to culture, respectively. Overall, we chose to include less cells that were high quality rather than include more cells that may lead to misinterpretation by the model.