NickDST commited on
Commit
69f6e65
·
verified ·
1 Parent(s): 4ba5602

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -49
README.md CHANGED
@@ -1,57 +1,14 @@
 
1
 
2
-
3
- # AIDO.Cell 3M
4
- AIDO.Cell 3M is our smallest cellular foundation model trained on 50 million cells over a diverse
5
- set of tissues and organs. The AIDO.Cell models are capable of handling the entire human transcriptome as input,
6
- thus learning accurate and general representations of the human cell's entire transcriptional context.
7
- AIDO.Cell achieves state-of-the-art results in tasks such as zero-shot clustering, cell-type classification, and perturbation modeling.
8
-
9
- ## Model Architectural Details
10
- AIDO.Cell uses an auto-discretization strategy for encoding continuous gene expression values, and uses a bidirectional transformer encoder as its backbone.
11
- To learn semantically meaningful representations, we employed an BERT-style encoder-only dense transformer architecture. We make minor updates to this architecture to align with current best practices, including using SwiGLU and LayerNorms.
12
- Below are more details about the model architecture:
13
-
14
- | Model | Layers | Hidden | Heads | Intermediate Hidden Size |
15
- | ----- |:------:| ------ | ----- | ------------------------ |
16
- | 3M | 6 | 128 | 4 | 320 |
17
- | 10M | 8 | 256 | 8 | 640 |
18
- | 100M | 18 | 650 | 20 | 1664 |
19
- | 650M | 32 | 1280 | 20 | 3392 |
20
-
21
- ## Pre-training of AIDO.Cell
22
- Here we briefly introduce the details of pre-training of AIDIO.Cell. For more detailed information, please refer to [our paper]()).
23
- AIDO.Cell uses the Read Depth-Aware (RDA) pretraining objective where a single cell expression is downsampled into a low read depth, and the model
24
- learns to predict the expression count of higher read depth of masked genes.
25
-
26
- ### Data
27
- AIDO.Cell was pretrained on a diverse dataset of 50 million cells from over 100 tissue types. We
28
- followed the list of data curated by scFoundation in the supplementary. This list includes datasets
29
- from the Gene Expression Omnibus (GEO), the Deeply Integrated human Single-Cell Omnics
30
- data (DISCO), the human ensemble cell atlas (hECA), Single Cell Portal and more.
31
- After preprocessing and quality control, the training dataset contained 50 million cells, or 963 total
32
- billion gene tokens. We partitioned the dataset to set aside 100,000 cells as our validation set.
33
-
34
- ### Training Details
35
- We trained our models with bfloat-16 precision to optimize on memory and speed. The training took place over 256 H100 GPUs over three days for
36
- the 100M, and eight days for the 650M.
37
-
38
- ## Evaluation of AIDO.Cell
39
- We evaluated AIDO.Cell on a series of both zero shots and fine tuned tasks in single cell genomics. For more detailed information, please refer to [our paper]()).
40
 
41
  ## How to Use
42
- ### Build any downstream models from this backbone
43
- #### Embedding
44
-
45
- #### Sequence Level Classification
46
-
47
- #### Token Level Classification
48
 
49
- #### Or use our one-liner CLI to finetune or evaluate any of the above!
50
 
51
- For more information, visit: [Model Generator](https://github.com/genbio-ai/test)
52
 
53
 
54
- ## Citation
55
  Please cite AIDO.Cell using the following BibTeX code:
56
  ```
57
  @inproceedings{ho2024scaling,
@@ -59,5 +16,4 @@ title={Scaling Dense Representations for Single Cell with Transcriptome-Scale Co
59
  author={Nicholas Ho, Caleb N. Ellington, Jinyu Hou, Sohan Addagudi, Shentong Mo, Tianhua Tao, Dian Li, Yonghao Zhuang, Hongyi Wang, Xingyi Cheng, Le Song, Eric P. Xing},
60
  booktitle={NeurIPS 2024 Workshop on AI for New Drug Modalities},
61
  year={2024}
62
- }
63
- ```
 
1
+ ## AIDO.Cell
2
 
3
+ For a more detailed description, refer to the SOTA model in this collection https://huggingface.co/genbio-ai/cellfoundation-100m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
 
5
  ## How to Use
 
 
 
 
 
 
6
 
7
+ For more information, visit: [Model Generator](https://github.com/genbio-ai/modelgenerator)
8
 
9
+ ## Citation
10
 
11
 
 
12
  Please cite AIDO.Cell using the following BibTeX code:
13
  ```
14
  @inproceedings{ho2024scaling,
 
16
  author={Nicholas Ho, Caleb N. Ellington, Jinyu Hou, Sohan Addagudi, Shentong Mo, Tianhua Tao, Dian Li, Yonghao Zhuang, Hongyi Wang, Xingyi Cheng, Le Song, Eric P. Xing},
17
  booktitle={NeurIPS 2024 Workshop on AI for New Drug Modalities},
18
  year={2024}
19
+ }