NickDST commited on
Commit
4ba5602
·
verified ·
1 Parent(s): ea84fce

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +63 -0
README.md ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+
3
+ # AIDO.Cell 3M
4
+ AIDO.Cell 3M is our smallest cellular foundation model trained on 50 million cells over a diverse
5
+ set of tissues and organs. The AIDO.Cell models are capable of handling the entire human transcriptome as input,
6
+ thus learning accurate and general representations of the human cell's entire transcriptional context.
7
+ AIDO.Cell achieves state-of-the-art results in tasks such as zero-shot clustering, cell-type classification, and perturbation modeling.
8
+
9
+ ## Model Architectural Details
10
+ AIDO.Cell uses an auto-discretization strategy for encoding continuous gene expression values, and uses a bidirectional transformer encoder as its backbone.
11
+ To learn semantically meaningful representations, we employed an BERT-style encoder-only dense transformer architecture. We make minor updates to this architecture to align with current best practices, including using SwiGLU and LayerNorms.
12
+ Below are more details about the model architecture:
13
+
14
+ | Model | Layers | Hidden | Heads | Intermediate Hidden Size |
15
+ | ----- |:------:| ------ | ----- | ------------------------ |
16
+ | 3M | 6 | 128 | 4 | 320 |
17
+ | 10M | 8 | 256 | 8 | 640 |
18
+ | 100M | 18 | 650 | 20 | 1664 |
19
+ | 650M | 32 | 1280 | 20 | 3392 |
20
+
21
+ ## Pre-training of AIDO.Cell
22
+ Here we briefly introduce the details of pre-training of AIDIO.Cell. For more detailed information, please refer to [our paper]()).
23
+ AIDO.Cell uses the Read Depth-Aware (RDA) pretraining objective where a single cell expression is downsampled into a low read depth, and the model
24
+ learns to predict the expression count of higher read depth of masked genes.
25
+
26
+ ### Data
27
+ AIDO.Cell was pretrained on a diverse dataset of 50 million cells from over 100 tissue types. We
28
+ followed the list of data curated by scFoundation in the supplementary. This list includes datasets
29
+ from the Gene Expression Omnibus (GEO), the Deeply Integrated human Single-Cell Omnics
30
+ data (DISCO), the human ensemble cell atlas (hECA), Single Cell Portal and more.
31
+ After preprocessing and quality control, the training dataset contained 50 million cells, or 963 total
32
+ billion gene tokens. We partitioned the dataset to set aside 100,000 cells as our validation set.
33
+
34
+ ### Training Details
35
+ We trained our models with bfloat-16 precision to optimize on memory and speed. The training took place over 256 H100 GPUs over three days for
36
+ the 100M, and eight days for the 650M.
37
+
38
+ ## Evaluation of AIDO.Cell
39
+ We evaluated AIDO.Cell on a series of both zero shots and fine tuned tasks in single cell genomics. For more detailed information, please refer to [our paper]()).
40
+
41
+ ## How to Use
42
+ ### Build any downstream models from this backbone
43
+ #### Embedding
44
+
45
+ #### Sequence Level Classification
46
+
47
+ #### Token Level Classification
48
+
49
+ #### Or use our one-liner CLI to finetune or evaluate any of the above!
50
+
51
+ For more information, visit: [Model Generator](https://github.com/genbio-ai/test)
52
+
53
+
54
+ ## Citation
55
+ Please cite AIDO.Cell using the following BibTeX code:
56
+ ```
57
+ @inproceedings{ho2024scaling,
58
+ title={Scaling Dense Representations for Single Cell with Transcriptome-Scale Context},
59
+ author={Nicholas Ho, Caleb N. Ellington, Jinyu Hou, Sohan Addagudi, Shentong Mo, Tianhua Tao, Dian Li, Yonghao Zhuang, Hongyi Wang, Xingyi Cheng, Le Song, Eric P. Xing},
60
+ booktitle={NeurIPS 2024 Workshop on AI for New Drug Modalities},
61
+ year={2024}
62
+ }
63
+ ```