Christina Theodoris
commited on
Commit
•
cf0d7d4
1
Parent(s):
f91f132
add readthedocs link to model card
Browse files- README.md +2 -1
- docs/source/_static/css/custom.css +4 -4
- docs/source/_static/gf_logo.png +0 -0
- docs/source/about.rst +1 -1
README.md
CHANGED
@@ -5,7 +5,8 @@ license: apache-2.0
|
|
5 |
# Geneformer
|
6 |
Geneformer is a foundation transformer model pretrained on a large-scale corpus of ~30 million single cell transcriptomes to enable context-aware predictions in settings with limited data in network biology.
|
7 |
|
8 |
-
See [our manuscript](https://rdcu.be/ddrx0) for details.
|
|
|
9 |
|
10 |
# Model Description
|
11 |
Geneformer is a foundation transformer model pretrained on [Genecorpus-30M](https://huggingface.co/datasets/ctheodoris/Genecorpus-30M), a pretraining corpus comprised of ~30 million single cell transcriptomes from a broad range of human tissues. We excluded cells with high mutational burdens (e.g. malignant cells and immortalized cell lines) that could lead to substantial network rewiring without companion genome sequencing to facilitate interpretation. Each single cell’s transcriptome is presented to the model as a rank value encoding where genes are ranked by their expression in that cell normalized by their expression across the entire Genecorpus-30M. The rank value encoding provides a nonparametric representation of that cell’s transcriptome and takes advantage of the many observations of each gene’s expression across Genecorpus-30M to prioritize genes that distinguish cell state. Specifically, this method will deprioritize ubiquitously highly-expressed housekeeping genes by normalizing them to a lower rank. Conversely, genes such as transcription factors that may be lowly expressed when they are expressed but highly distinguish cell state will move to a higher rank within the encoding. Furthermore, this rank-based approach may be more robust against technical artifacts that may systematically bias the absolute transcript counts value while the overall relative ranking of genes within each cell remains more stable.
|
|
|
5 |
# Geneformer
|
6 |
Geneformer is a foundation transformer model pretrained on a large-scale corpus of ~30 million single cell transcriptomes to enable context-aware predictions in settings with limited data in network biology.
|
7 |
|
8 |
+
- See [our manuscript](https://rdcu.be/ddrx0) for details.
|
9 |
+
- See [geneformer.readthedocs.io](https://geneformer.readthedocs.io) for documentation.
|
10 |
|
11 |
# Model Description
|
12 |
Geneformer is a foundation transformer model pretrained on [Genecorpus-30M](https://huggingface.co/datasets/ctheodoris/Genecorpus-30M), a pretraining corpus comprised of ~30 million single cell transcriptomes from a broad range of human tissues. We excluded cells with high mutational burdens (e.g. malignant cells and immortalized cell lines) that could lead to substantial network rewiring without companion genome sequencing to facilitate interpretation. Each single cell’s transcriptome is presented to the model as a rank value encoding where genes are ranked by their expression in that cell normalized by their expression across the entire Genecorpus-30M. The rank value encoding provides a nonparametric representation of that cell’s transcriptome and takes advantage of the many observations of each gene’s expression across Genecorpus-30M to prioritize genes that distinguish cell state. Specifically, this method will deprioritize ubiquitously highly-expressed housekeeping genes by normalizing them to a lower rank. Conversely, genes such as transcription factors that may be lowly expressed when they are expressed but highly distinguish cell state will move to a higher rank within the encoding. Furthermore, this rank-based approach may be more robust against technical artifacts that may systematically bias the absolute transcript counts value while the overall relative ranking of genes within each cell remains more stable.
|
docs/source/_static/css/custom.css
CHANGED
@@ -26,8 +26,8 @@
|
|
26 |
|
27 |
/* class object */
|
28 |
.sig.sig-object {
|
29 |
-
padding: 5px 5px 5px
|
30 |
-
background-color: #
|
31 |
border-style: solid;
|
32 |
border-color: black;
|
33 |
border-width: 1px 0;
|
@@ -35,6 +35,6 @@
|
|
35 |
|
36 |
/* parameter object */
|
37 |
dt {
|
38 |
-
padding: 5px 5px 5px
|
39 |
-
background-color: #
|
40 |
}
|
|
|
26 |
|
27 |
/* class object */
|
28 |
.sig.sig-object {
|
29 |
+
padding: 5px 5px 5px 5px;
|
30 |
+
background-color: #ececec;
|
31 |
border-style: solid;
|
32 |
border-color: black;
|
33 |
border-width: 1px 0;
|
|
|
35 |
|
36 |
/* parameter object */
|
37 |
dt {
|
38 |
+
padding: 5px 5px 5px 5px;
|
39 |
+
background-color: #ececec;
|
40 |
}
|
docs/source/_static/gf_logo.png
CHANGED
docs/source/about.rst
CHANGED
@@ -8,7 +8,7 @@ Model Description
|
|
8 |
|
9 |
In `our manuscript <https://rdcu.be/ddrx0>`_, we report results for the 6 layer Geneformer model pretrained on Genecorpus-30M. We additionally provide within the repository a 12 layer Geneformer model, scaled up with retained width:depth aspect ratio, also pretrained on Genecorpus-30M.
|
10 |
|
11 |
-
Both the 6 and 12 layer Geneformer models were pretrained in June 2021.
|
12 |
|
13 |
Application
|
14 |
-----------
|
|
|
8 |
|
9 |
In `our manuscript <https://rdcu.be/ddrx0>`_, we report results for the 6 layer Geneformer model pretrained on Genecorpus-30M. We additionally provide within the repository a 12 layer Geneformer model, scaled up with retained width:depth aspect ratio, also pretrained on Genecorpus-30M.
|
10 |
|
11 |
+
Both the `6 <https://huggingface.co/ctheodoris/Geneformer/blob/main/pytorch_model.bin>`_ and `12 <https://huggingface.co/ctheodoris/Geneformer/blob/main/geneformer-12L-30M/pytorch_model.bin>`_ layer Geneformer models were pretrained in June 2021.
|
12 |
|
13 |
Application
|
14 |
-----------
|