Update README.md
Browse files
README.md
CHANGED
@@ -13,7 +13,7 @@ This model retains the architecture of AIDO.Protein-16B, a transformer encoder-o
|
|
13 |
Each token activates 2 experts using a top-2 routing mechanism. A visual summary of the architecture is provided below:
|
14 |
|
15 |
<center>
|
16 |
-
<img src="https://huggingface.co/genbio-ai/
|
17 |
</center>
|
18 |
|
19 |
|
@@ -52,7 +52,7 @@ The fine-tuning process used **0.4 trillion tokens**, using AlphaFold database w
|
|
52 |
The input sequence should be single-chain amino acid sequences.
|
53 |
|
54 |
- **Input Tokenization**: The sequences are tokenized at the amino acid level and terminated with a `[SEP]` token (id=34).
|
55 |
-
- **Output Tokenization**: Each input token is converted into a structure token. The output can be decoded into 3D structures in PDB format using
|
56 |
|
57 |
## Results
|
58 |
|
@@ -69,29 +69,39 @@ To reproduce the structure prediction results described above, follow these step
|
|
69 |
2. Run the prediction command:
|
70 |
|
71 |
```bash
|
72 |
-
mgen predict --config experiments/
|
73 |
```
|
74 |
-
This will pull the CASP14, CASP15, and CAMEO dataset from
|
75 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
76 |
```bash
|
77 |
-
mgen predict --config
|
|
|
|
|
|
|
78 |
```
|
79 |
-
|
|
|
80 |
|
81 |
-
Alternatively, you can provide your own input amino acid sequence in a CSV file.
|
82 |
```
|
83 |
-
idx,
|
84 |
-
|
85 |
-
|
86 |
-
|
87 |
-
|
|
|
|
|
|
|
88 |
```
|
89 |
-
|
90 |
-
- The `idx` column assigns a unique index to each sample.
|
91 |
-
- The `seq_len` column represents the length of the sequence.
|
92 |
-
- The `aa_seq` column (amino acid sequence) is used as the input feature.
|
93 |
-
- The `struct_seq` column contains the reference structure tokens obtained from the structure tokenizer. You can set `struct_seq` to a list of zeros if no structure tokens are available.
|
94 |
-
|
95 |
|
96 |
### Build any downstream models from this backbone with ModelGenerator
|
97 |
For more information, visit: [Model Generator](https://github.com/genbio-ai/modelgenerator)
|
|
|
13 |
Each token activates 2 experts using a top-2 routing mechanism. A visual summary of the architecture is provided below:
|
14 |
|
15 |
<center>
|
16 |
+
<img src="https://huggingface.co/genbio-ai/AIDO.Protein-16B/resolve/main/proteinmoe_architecture.png" alt="AIDO.Protein-16B Architecture" style="width:70%; height:auto;" />
|
17 |
</center>
|
18 |
|
19 |
|
|
|
52 |
The input sequence should be single-chain amino acid sequences.
|
53 |
|
54 |
- **Input Tokenization**: The sequences are tokenized at the amino acid level and terminated with a `[SEP]` token (id=34).
|
55 |
+
- **Output Tokenization**: Each input token is converted into a structure token. The output can be decoded into 3D structures in PDB format using [AIDO.StructureDecoder](https://huggingface.co/genbio-ai/AIDO.StructureDecoder).
|
56 |
|
57 |
## Results
|
58 |
|
|
|
69 |
2. Run the prediction command:
|
70 |
|
71 |
```bash
|
72 |
+
mgen predict --config experiments/AIDO.StructureTokenizer/protein2structoken_16b.yaml
|
73 |
```
|
74 |
+
This will pull the CASP14, CASP15, and CAMEO dataset from [genbio-ai/casp14-casp15-cameo-test-proteins](https://huggingface.co/datasets/genbio-ai/casp14-casp15-cameo-test-proteins), and predict the structure tokens from the amino acid sequence.
|
75 |
+
|
76 |
+
3. Convert the output `.tsv` to `.pt` and extract model codebook:
|
77 |
+
|
78 |
+
```bash
|
79 |
+
# convert the predicted structures in tsv into one pt file
|
80 |
+
python experiments/AIDO.StructureTokenizer/struct_token_format_conversion.py logs/protein2structoken_16b/predict_predictions.tsv logs/protein2structoken_16b/predict_predictions.pt
|
81 |
+
# extract the codebook of the structure tokenizer
|
82 |
+
python experiments/AIDO.StructureTokenizer/extract_structure_tokenizer_codebook.py --output_path logs/protein2structoken_16b/codebook.pt
|
83 |
+
```
|
84 |
+
5. Run the decoding command to get 3D structures in PDB format (currently this script only supports single GPU inference):
|
85 |
```bash
|
86 |
+
CUDA_VISIBLE_DEVICES=0 mgen predict --config experiments/AIDO.StructureTokenizer/decode.yaml \
|
87 |
+
--data.init_args.config.struct_tokens_datasets_configs.name=protein2structoken_16b \
|
88 |
+
--data.init_args.config.struct_tokens_datasets_configs.struct_tokens_path=logs/protein2structoken_16b/predict_predictions.pt \
|
89 |
+
--data.init_args.config.struct_tokens_datasets_configs.codebook_path=logs/protein2structoken_16b/codebook.pt
|
90 |
```
|
91 |
+
The outputs are in `logs/protstruct_decode/protein2structoken_16b_pdb_files/`
|
92 |
+
6. You can compare the predicted structures with the ground truth PDBs in [genbio-ai/casp14-casp15-cameo-test-proteins](https://huggingface.co/datasets/genbio-ai/casp14-casp15-cameo-test-proteins/tree/main).
|
93 |
|
94 |
+
Alternatively, you can provide your own input amino acid sequence in a CSV file. Here is one example csv at `experiments/AIDO.StructureTokenizer/protein2structoken_example_input.csv` in `ModelGenerator`:
|
95 |
```
|
96 |
+
idx,aa_seq
|
97 |
+
example,KEFWNLDKNLQLRLGIVFLG
|
98 |
+
```
|
99 |
+
Here, `idx` is a unique name, and `aa_seq` is the amino acid sequence. To use this customized CSV file, replace the second step with
|
100 |
+
```bash
|
101 |
+
mgen predict --config experiments/AIDO.StructureTokenizer/protein2structoken_16b.yaml \
|
102 |
+
--data.init_args.path=experiments/AIDO.StructureTokenizer/ \
|
103 |
+
--data.init_args.test_split_files=[protein2structoken_example_input.csv]
|
104 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
105 |
|
106 |
### Build any downstream models from this backbone with ModelGenerator
|
107 |
For more information, visit: [Model Generator](https://github.com/genbio-ai/modelgenerator)
|