How were gene tokens created?
Hi, first of all, thank you for releasing this code and working so hard to make it accessible to the community. This repo is great.
I see the tokenizing scripts, which access the token dictionary, but I'm wondering how to create new tokens for new genes. What was your process of creating tokens in the first place?
The reason I'm asking is that I'm considering retraining Geneformer for cells from another species. This will require creating new tokens for the new EnsemblIDs. If possible I'd like to keep the token embeddings close in latent space for orthologous genes.
Thanks!
Thank you for your question. The untrained tokens do not have a specific position. The pretraining process embeds them within a latent space that is updated to optimize the training objective. If you'd like to take advantage of the pretraining with human genes, one way would be to assign orthologous genes the same token (at least for the closest ortholog in non-1:1 cases), and genes without an ortholog new tokens. Then, instead of starting with randomly initialized weights as you usually would when pretraining a new model, you could start with the pretrained Geneformer weights. If you have a large amount of pretraining data, this should re-adjust the weights to your organism of interest without overweighting to the human setting, while also potentially achieving better results because the genes are already starting close to where they should be relative to one another in the case of orthologs, as opposed to randomly positioned.
Hi, first of all, thank you for releasing this code and working so hard to make it accessible to the community. This repo is great.
I see the tokenizing scripts, which access the token dictionary, but I'm wondering how to create new tokens for new genes. What was your process of creating tokens in the first place?
The reason I'm asking is that I'm considering retraining Geneformer for cells from another species. This will require creating new tokens for the new EnsemblIDs. If possible I'd like to keep the token embeddings close in latent space for orthologous genes.
Thanks!
@aribenjamin Hi, sorry to bother you, I have also encountered the same problem. Have you resolved it? And may I ask for your advice?
Thank you!
Hi! Yes, I've tried a few things, but I'm still figuring out the best way to do this (with regard to success at downstream tasks). My current metric is the MLM loss on an evaluation set in the new organism. By that metric, it appears Christina's advice above is the best. Take care in creating your new dataset, though. I found it necessary to normalize genes in the new organism by the median of those (ortholog) human genes in the Geneformer dataset, which were created and distributed by Christina. If you use the medians of those genes in the new dataset, you're essentially introducing a distribution shift in the ordering that can be avoided. (Genes without orthologs of course must receive a new median.)