Input files for example notebooks

#4
by GMFranceschini - opened

First of all, congratulations on this terrific work. I was wondering whether the input files used in the notebooks are available anywhere (input files such as "/path/to/[...]" and so on). Could you please advise? I would love to run your example to understand how the model works.
Best

and also if anndata format file is allowed

First of all, congratulations on this terrific work. I was wondering whether the input files used in the notebooks are available anywhere (input files such as "/path/to/[...]" and so on). Could you please advise? I would love to run your example to understand how the model works.
Best

I'm putting together some of the input files under this repo. Please keep in mind that this is my attempt to reproduce the data analysis by following the reference publications (and not necessarily how the original authors generated the relevant input files). Hope you find the repo helpful nonetheless.

Update: we replicated the dosage sensitivity classification task result (Colab notebook) and fine-tuned on essentiality scores from DepMap (Colab notebook).

Thank you for your interest in Geneformer!

The example notebooks are intended to be generally applicable to downstream tasks relevant to gene or cell classification. We added a directory "example_input_files" with the example input files for the application of gene classification to distinguish dosage sensitivity transcription factors. The labels are based on Ni et al 2019, Shihab et al 2017, and Lek et al 2016. The gene information is based on the Ensembl database. The "gene_train_data.dataset" can be particular cells of interest or you can use the Genecorpus-30M (https://huggingface.co/datasets/ctheodoris/Genecorpus-30M), from which the example notebook will extract random cells. Please note that Genecorpus-30M is a large dataset so loading it for the first time may take some time. The Huggingface .dataset format is designed to cache the data though so future usage will be faster in that this cache allows Datasets to avoid re-processing the entire dataset with each use. The "pretrained_model" is Geneformer, which is in the main directory of this repository.

Regarding the question about anndata: Geneformer takes as input tokenized datasets in the Huggingface .dataset format. The transcriptome tokenizer ("tokenizer.py") converts loom files to tokenized datasets. For use with anndata, we recommend converting the file to the loom format with Anndata and then tokenizing it with the transcriptome tokenizer. [Update: please see example for tokenizing in examples folder]

Please let us know if you have any further questions.

Hi @ctheodoris , very impressive work! I am wondering, would a version of Geneformer make sense trained with bulk RNA-seq data? The possible downstream tasks do not seem specific to single cell. Why did you decide to train with single-cell vs bulk RNA-seq?

Thank you for your question. We chose to pretrain Geneformer with single cell data due to the increased precision of this data as it measures the gene expression within each individual cell as opposed to averaged across multiple cells or multiple cell types as in the case of bulk data. As a context-aware model, the predictions are specific to the individual cell presented to the model, so presenting the model with single cells best takes advantage of this context-awareness.

ctheodoris changed discussion status to closed

First of all, congratulations on this terrific work. I was wondering whether the input files used in the notebooks are available anywhere (input files such as "/path/to/[...]" and so on). Could you please advise? I would love to run your example to understand how the model works.
Best

I'm putting together some of the input files under this repo. Please keep in mind that this is my attempt to reproduce the data analysis by following the reference publications (and not necessarily how the original authors generated the relevant input files). Hope you find the repo helpful nonetheless.

Update: we replicated the dosage sensitivity classification task result (Colab notebook) and fine-tuned on essentiality scores from DepMap (Colab notebook).

@onuralp Thank you for your efforts on replicating the analysis. We wanted to mention that we also recommend tuning hyperparameters (e.g. max learning rate, learning schedule, number of layers to freeze, etc.) for your own applications, as this may significantly improve the predictive potential (similarly to other deep learning models). We include an example for this in the examples folder for disease classification.

Sign up or log in to comment