JiayouZhangGenbio commited on
Commit
68e356f
·
verified ·
1 Parent(s): 04203c6

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +53 -0
README.md ADDED
@@ -0,0 +1,53 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## genbio-ai/proteinMoE-16b-Petal
2
+
3
+ **genbio-ai/proteinMoE-16b-Petal** is a fine-tuned version of **genbio-ai/proteinMoE-16b**, specifically designed for protein structure prediction. This model uses amino acid sequences as input to predict tokens that can be decoded into 3D structures by **genbio-ai/petal-decoder**. It surpasses existing state-of-the-art models, such as **ESM3-open**, in structure prediction tasks, demonstrating its robustness and capability in this domain.
4
+
5
+ ### Model Architecture Details
6
+
7
+ This model retains the architecture of AIDO.Protein 16B, a transformer encoder-only architecture with dense MLP layers replaced by sparse Mixture of Experts (MoE) layers. Each token activates 2 experts using a top-2 routing mechanism. A visual summary of the architecture is provided below:
8
+
9
+ <center>
10
+ <img src="https://huggingface.co/genbio-ai/proteinMoE-16b/resolve/main/proteinmoe_architecture.png" alt="ProteinMoE Architecture" style="width:70%; height:auto;" />
11
+ </center>
12
+
13
+
14
+ #### Key Differences
15
+ The final output linear layer has been adapted to support a new vocabulary size:
16
+ - **Input Vocabulary Size**: 44 (amino acids + special tokens)
17
+ - **Output Vocabulary Size**: 512 (structure tokens without special tokens)
18
+
19
+ #### Architecture Parameters
20
+ | Component | Value |
21
+ |-------------------------------|-------|
22
+ | Number of Attention Heads | 36 |
23
+ | Number of Hidden Layers | 36 |
24
+ | Hidden Size | 2304 |
25
+ | Number of MoE Layers per Block| 8 |
26
+ | Number of MoE Layers per Token| 2 |
27
+ | Input Vocabulary Size | 44 |
28
+ | Output Vocabulary Size | 512 |
29
+ | Context Length | 1024 |
30
+
31
+ ### Training Details
32
+
33
+ The fine-tuning process used **0.4 trillion tokens**, using AlphaFold database with **170M samples** and PDB database with **0.4M samples**, making it highly specialized for structure prediction. The training took around 20 days on 64 A100 GPUs.
34
+
35
+ - **Batch Size**: Global batch size of 2048
36
+ - **Context Length**: 1024
37
+ - **Precision**: FP16
38
+ - **Hardware**: 64 NVIDIA A100 80GB GPUs
39
+ - **Learning Rate**: Max learning rate of 1e-4
40
+ - **Scheduler**: Cosine decay with 2.5% warmup
41
+ - **Tokens Trained**: 4T tokens
42
+ - **Training steps**: 200k steps
43
+
44
+ ### Tokenization
45
+
46
+ The input sequence should be single-chain amino acid sequences.
47
+
48
+ - **Input Tokenization**: The sequences are tokenized at the amino acid level and terminated with a `[SEP]` token (id=34).
49
+ - **Output Tokenization**: Each input token is converted into a structure token. The output can be decoded into 3D structures in PDB format using **genbio-ai/petal-decoder**.
50
+
51
+ ### Results
52
+
53
+ TODO