ctheodoris/Geneformer · all cells for training

29 days ago

I have a case where splitting the data for train, eval, validate is complicated. We study a single cell type that has different maturation stages across organs, and is pretty rare. I noticed there is multiple options in classifier for giving different proportions or use all cells. My goal is to do in silico perturb, not really something like compare across patients. What are your thoughts on this case for using all cells in training.

ctheodoris

Owner 26 days ago

Thank you for your question! We would suggest checking if your start and goal states are already well-separated by the pretrained model, in which case the fine-tuning would not be necessary. If fine-tuning is deemed necessary and you do not have sufficient samples to separate into train/valid/test sets, then you could consider using default hyperparameters so you can evaluate on a test set without needing a validation set since you are not optimizing hyperparameters. If the model looks appropriate, you could then go back and train on all data and use all data for in silico perturbation. However, if you do not evaluate on some held-out data, you will not be able to determine whether the fine-tuned model is generalizable. Please note these splits are recommended for fine-tuning, but for in silico perturbation there is no training of the model as it is only inference so there is no need to split the data in this same way.

ctheodoris changed discussion status to closed 26 days ago

cstrlln

16 days ago

Thanks for the response.
Regarding this option: "We would suggest checking if your start and goal states are already well-separated by the pretrained model". What tool do I use to test that?

ctheodoris

Owner 16 days ago

You can extract embeddings using the emb_extractor. There are various metrics you can use to check if they are well separated (e.g. LISI) but as an initial check you can plot them as a heatmap with the provided module to visualize them.