ctheodoris/Geneformer · Input files for examples

yb1996

Apr 8

Hi,
We are trying to test geneformer for insilico and treatment perturbation.

Is it possible to publish the input files used to generate the insilico perturbation and treatment perturbation examples in the paper? It will be very helpful in terms of understanding how to use the code effectively.
For a given perturbation, is it essential to generate experimental data first (for the required perturbation) and then fine tune?
Is it possible to fine tune for perturbation with RNAseq data or even microarray data? the vast majority of data is not single cell.
Thanks.

Owner Apr 8

Thank you for your questions!

All data used in the analyses in the paper is publicly available. Please see the references in the paper for each analysis to identify the relevant dataset. Please also see the example input files here:
https://huggingface.co/datasets/ctheodoris/Genecorpus-30M/tree/main/example_input_files
The purpose of fine-tuning is to better separate the classes within the embedding space so that the model can better distinguish which in silico perturbations shift between the now better-separated states. If the pretrained model has already well-separated the states, fine-tuning is likely not necessary and will likely not impact the results.
We have not tested using bulk RNAseq data but this question has arisen previously - please check the closed discussions to read about some potential approaches to using bulk RNAseq as an input format.

ctheodoris changed discussion status to closed Apr 8