genbio-ai
/

AIDO.Protein2StructureToken-16B

PyTorch

fm4bio

Model card Files Files and versions Community

JiayouZhangGenbio commited on Dec 10, 2024

Commit

605fd54

verified ·

1 Parent(s): 6240c82

Update README.md

Browse files

Files changed (1) hide show

README.md +29 -19

README.md CHANGED Viewed

@@ -13,7 +13,7 @@ This model retains the architecture of AIDO.Protein-16B, a transformer encoder-o
 Each token activates 2 experts using a top-2 routing mechanism. A visual summary of the architecture is provided below:
 <center>
-  <img src="https://huggingface.co/genbio-ai/proteinMoE-16b/resolve/main/proteinmoe_architecture.png" alt="ProteinMoE Architecture" style="width:70%; height:auto;" />
 </center>
@@ -52,7 +52,7 @@ The fine-tuning process used **0.4 trillion tokens**, using AlphaFold database w
 The input sequence should be single-chain amino acid sequences.
 - **Input Tokenization**: The sequences are tokenized at the amino acid level and terminated with a `[SEP]` token (id=34).
-- **Output Tokenization**: Each input token is converted into a structure token. The output can be decoded into 3D structures in PDB format using **genbio-ai/petal-decoder**.
 ## Results
@@ -69,29 +69,39 @@ To reproduce the structure prediction results described above, follow these step
 2. Run the prediction command:
    ```bash
-   mgen predict --config experiments/struct_token/configs/finetuned_struct_token.yaml
    ```
-   This will pull the CASP14, CASP15, and CAMEO dataset from **genbio-ai/petal-test-struct-token**, and predict the structure tokens from the amino acid sequence.
-3. Run the decoding command to get 3D structures in PDB format:
    ```bash
-   mgen predict --config configs/protein_structure/petal_decode.yaml
    ```
-4. You can compare the predicted structures with the ground truths in **genbio-ai/petal-test-struct-token**.
-Alternatively, you can provide your own input amino acid sequence in a CSV file. The input format is as follows:
 ```
-idx,seq_len,aa_seq,struct_seq
-cameo:0,369,LRTPTTVSVSDFGAKGDGKTDDTQAFVNAWKKACSSNGAVNLLVPKGNTYLLKSIQLTGPCNSILTVQIFGTLSASQKRSDYKDISKWIMFDGVNNLSVDGGDTGVVDGNGETWWQNSCKRNKAKPCTKAPTALTFYNSKSLIVKNLKVRNAQQIQISIEKCSNVQVSNVVVTAPADSPNTDGIHITNTQNIRVSESIIGTGDDCISIESGSQNVQINDITCGPGHGISIGSLGDDNSKAFVSGVTVDGAKLSGTDNGVRIKTYQGGSGTASNIIFQNIQMDNVKNPIIIDQDYCDKSKCTTEKSAVQVKNVVYRDISGTSASENAITFNCSKNYPCQGIVLDRVNIKGGKATCTNANVVDKGAVLPQC,"[164, 287, 119, 293, 115, 126, 395, 7, 228, 81, 392, 480, 496, 482, 324, 241, 308, 382, 346, 367, 10, 481, 100, 436, 237, 463, 186, 324, 416, 509, 393, 401, 478, 270, 249, 181, 499, 258, 284, 442, 486, 228, 79, 365, 273, 397, 170, 294, 34, 359, 381, 91, 480, 412, 141, 408, 353, 293, 51, 56, 394, 129, 389, 249, 87, 179, 463, 109, 464, 502, 259, 227, 463, 360, 192, 201, 69, 271, 351, 314, 304, 83, 318, 409, 494, 333, 231, 67, 353, 424, 261, 219, 10, 470, 242, 240, 462, 26, 117, 35, 224, 335, 366, 302, 478, 186, 162, 35, 224, 250, 171, 175, 106, 83, 101, 226, 320, 212, 296, 213, 351, 303, 170, 88, 433, 454, 453, 450, 456, 83, 76, 313, 165, 377, 91, 313, 47, 435, 359, 291, 390, 506, 219, 137, 212, 181, 133, 35, 162, 208, 7, 125, 76, 83, 142, 224, 370, 149, 313, 37, 299, 482, 336, 240, 506, 368, 119, 227, 441, 89, 211, 330, 176, 24, 450, 288, 315, 394, 79, 23, 482, 84, 224, 119, 465, 464, 29, 388, 359, 154, 113, 89, 59, 137, 305, 284, 24, 368, 57, 103, 89, 481, 424, 500, 147, 119, 465, 464, 337, 412, 198, 38, 154, 81, 365, 186, 500, 485, 81, 423, 211, 127, 399, 473, 482, 461, 224, 337, 508, 233, 12, 476, 41, 421, 467, 331, 115, 368, 226, 67, 198, 38, 96, 460, 22, 186, 84, 388, 460, 423, 388, 80, 110, 113, 482, 432, 17, 147, 274, 80, 31, 337, 228, 83, 396, 280, 478, 229, 460, 388, 465, 51, 487, 12, 26, 464, 187, 139, 86, 126, 147, 45, 496, 482, 7, 495, 361, 159, 506, 233, 495, 361, 55, 230, 480, 3, 339, 234, 142, 34, 489, 163, 399, 322, 320, 1, 95, 19, 426, 244, 273, 88, 172, 264, 190, 72, 462, 19, 49, 294, 1, 304, 40, 123, 351, 389, 410, 152, 188, 219, 147, 64, 269, 129, 304, 278, 192, 130, 143, 417, 71, 112, 434, 121, 98, 256, 89, 364, 401, 381, 493, 222, 43, 114, 431, 433, 16, 468, 248, 48, 131, 487, 324, 183, 243, 480, 381, 397, 227]"
-cameo:1,299,HMSIFSPLIKKAEDLAFLENEEALASLEIIGEVFKAELPGSNGKIIAVKKVIQPPKDADERQIRSEINTVGHIRHRNLLPLLAHVSRPECHYLVYEYMEKGSLQDILTDVQAGNQELMWPARHKIALGIAAGLEYLHMDHNPRIIHRDLKPANVLLDDDMEARISDFGLAKAMPDAVTHITTSHVAGTVGYIAPEFYQTHKFTDKCDIYSFGVILGILVIGKLPSDEFFQHTDEMSLIKWMRNIITSENPSLAIDPKLMDQGFDEQMLLVLKIACYCTLDDPKQRPNSKDVRTMLSQIK,"[260, 480, 485, 387, 305, 358, 355, 266, 414, 258, 474, 273, 162, 363, 97, 367, 277, 39, 275, 474, 225, 137, 367, 152, 436, 162, 386, 338, 409, 447, 59, 456, 142, 109, 85, 201, 249, 394, 449, 122, 460, 119, 247, 431, 469, 414, 177, 32, 137, 221, 337, 293, 502, 305, 53, 498, 474, 142, 175, 428, 489, 435, 506, 45, 492, 161, 88, 273, 291, 102, 482, 11, 488, 376, 136, 325, 410, 281, 0, 219, 91, 391, 285, 462, 285, 473, 387, 322, 489, 335, 152, 9, 119, 470, 203, 446, 254, 180, 234, 131, 219, 313, 464, 337, 432, 172, 233, 323, 401, 190, 453, 385, 308, 210, 198, 302, 461, 249, 202, 287, 388, 29, 248, 185, 39, 155, 51, 479, 247, 68, 126, 359, 46, 243, 473, 97, 286, 207, 469, 486, 319, 234, 469, 373, 248, 322, 248, 230, 310, 0, 38, 3, 119, 330, 2, 83, 40, 33, 375, 317, 134, 125, 32, 425, 422, 319, 18, 141, 507, 59, 433, 251, 29, 219, 364, 244, 226, 271, 97, 146, 167, 376, 88, 135, 209, 73, 299, 297, 386, 107, 330, 97, 68, 366, 278, 152, 450, 46, 249, 294, 404, 472, 299, 153, 283, 479, 122, 295, 27, 92, 42, 421, 417, 292, 452, 68, 267, 463, 202, 295, 190, 485, 90, 12, 268, 50, 443, 490, 101, 450, 42, 130, 25, 291, 496, 105, 459, 426, 345, 442, 17, 161, 496, 502, 14, 400, 303, 191, 322, 154, 278, 364, 152, 412, 470, 44, 250, 6, 296, 464, 157, 84, 173, 256, 498, 244, 471, 442, 398, 509, 67, 441, 263, 125, 63, 253, 418, 465, 103, 407, 496, 391, 209, 269, 16, 453, 317, 95, 168, 287, 426, 24, 307, 36, 128, 376, 287, 386, 183]"
-cameo:2,471,TTGEPLTAFETFLPRVVMAEKIQDYQDSDAHEYMKAVQGYLDRFAVGDRLQNATRDLLVTFALAETGEKLSKRLPDQRVYMRDTFERHKDSADDRSAYLRHLRDTAAFIGNAWEPANNSPRALPGLEASAMTDTVKLCLAFLNSLKHTIAIAPLVRFYSEAVHADEGEAREKRVAEFEKAIKAITAFTVFWRATRRGTGNIDSQYRAVMAGADSLTGIGPLARQWAEPDATKPDPDVDAEALKKELAARLSDPKGKGGVPNLASFLADASALPLYKISPPLARFLLLAAYHDTIEDPDNPGLIVQGKAGVASCFTADGWEDDTHLTIEHIAPQSATSGWDAEFYSDKETVHKLGNLVLAPGAANASLSSRPWTEKKVLYAALGASTADDAKSILNSSGFTFAQTTEDLAAMSRYLPHLRALGQREDELDPAFMDQRADVLLRLAYTRLKGWLGLELSDSSSDPVVKVDDVE,"[157, 443, 57, 220, 249, 356, 101, 222, 261, 198, 199, 27, 439, 292, 397, 195, 387, 383, 187, 458, 18, 501, 321, 415, 148, 234, 84, 363, 54, 414, 150, 390, 263, 151, 252, 408, 80, 206, 431, 14, 374, 234, 351, 59, 510, 356, 160, 99, 315, 428, 419, 347, 342, 507, 63, 456, 128, 423, 40, 155, 203, 479, 281, 332, 169, 187, 147, 485, 323, 32, 132, 276, 307, 150, 183, 338, 423, 120, 496, 40, 251, 304, 299, 484, 472, 183, 187, 51, 320, 18, 226, 122, 357, 54, 239, 428, 266, 309, 475, 386, 17, 422, 257, 270, 474, 475, 35, 80, 297, 109, 336, 270, 119, 30, 26, 273, 209, 349, 392, 316, 195, 13, 39, 439, 293, 101, 308, 111, 123, 101, 268, 34, 175, 213, 70, 486, 68, 25, 353, 186, 500, 322, 77, 491, 233, 235, 191, 276, 176, 413, 422, 359, 189, 247, 122, 192, 188, 128, 89, 451, 1, 221, 439, 103, 413, 427, 21, 163, 265, 428, 468, 282, 309, 63, 420, 51, 89, 268, 453, 192, 463, 37, 410, 162, 21, 189, 171, 297, 247, 31, 270, 461, 46, 68, 309, 54, 315, 456, 136, 436, 88, 466, 420, 483, 42, 28, 238, 17, 125, 156, 241, 284, 315, 60, 202, 114, 145, 246, 130, 241, 500, 165, 90, 486, 242, 268, 99, 304, 367, 81, 46, 53, 375, 172, 384, 93, 34, 89, 446, 492, 336, 88, 417, 135, 257, 42, 16, 480, 152, 12, 417, 238, 220, 378, 394, 376, 86, 255, 281, 480, 278, 216, 197, 352, 337, 180, 10, 241, 298, 188, 175, 360, 290, 465, 233, 345, 486, 3, 491, 3, 463, 506, 457, 117, 247, 501, 281, 233, 5, 452, 267, 181, 70, 19, 410, 166, 123, 492, 300, 306, 188, 286, 171, 134, 323, 434, 383, 347, 347, 302, 295, 134, 29, 337, 439, 76, 357, 169, 353, 276, 179, 491, 225, 114, 274, 26, 422, 169, 8, 31, 412, 399, 371, 357, 415, 19, 321, 197, 241, 131, 367, 56, 16, 419, 498, 436, 339, 445, 154, 479, 268, 64, 322, 410, 328, 128, 0, 117, 102, 305, 439, 307, 223, 23, 498, 197, 274, 43, 250, 173, 237, 138, 308, 256, 423, 51, 290, 193, 501, 243, 389, 98, 224, 395, 334, 214, 154, 25, 113, 361, 115, 303, 172, 27, 310, 390, 238, 137, 387, 58, 360, 143, 473, 258, 245, 499, 505, 152, 389, 99, 268, 389, 105, 460, 261, 79, 203, 62, 395, 478, 29, 389, 450, 479, 443, 379, 271, 373, 433, 79, 99, 96, 38, 219, 490, 281, 165, 7, 486, 18, 465, 146, 32, 125, 29, 384, 215, 137, 263, 104, 93, 125, 13, 39, 325, 151, 199, 327, 308, 502, 338, 52, 247, 502, 324, 238, 480, 431, 28, 210, 383]"
-...
 ```
-- The `idx` column assigns a unique index to each sample.
-- The `seq_len` column represents the length of the sequence.
-- The `aa_seq` column (amino acid sequence) is used as the input feature.
-- The `struct_seq` column contains the reference structure tokens obtained from the structure tokenizer. You can set `struct_seq` to a list of zeros if no structure tokens are available.
 ### Build any downstream models from this backbone with ModelGenerator
 For more information, visit: [Model Generator](https://github.com/genbio-ai/modelgenerator)

 Each token activates 2 experts using a top-2 routing mechanism. A visual summary of the architecture is provided below:
 <center>
+  <img src="https://huggingface.co/genbio-ai/AIDO.Protein-16B/resolve/main/proteinmoe_architecture.png" alt="AIDO.Protein-16B Architecture" style="width:70%; height:auto;" />
 </center>
 The input sequence should be single-chain amino acid sequences.
 - **Input Tokenization**: The sequences are tokenized at the amino acid level and terminated with a `[SEP]` token (id=34).
+- **Output Tokenization**: Each input token is converted into a structure token. The output can be decoded into 3D structures in PDB format using [AIDO.StructureDecoder](https://huggingface.co/genbio-ai/AIDO.StructureDecoder).
 ## Results
 2. Run the prediction command:
    ```bash
+   mgen predict --config experiments/AIDO.StructureTokenizer/protein2structoken_16b.yaml
    ```
+   This will pull the CASP14, CASP15, and CAMEO dataset from [genbio-ai/casp14-casp15-cameo-test-proteins](https://huggingface.co/datasets/genbio-ai/casp14-casp15-cameo-test-proteins), and predict the structure tokens from the amino acid sequence.
+3. Convert the output `.tsv` to `.pt` and extract model codebook:
+   ```bash
+   # convert the predicted structures in tsv into one pt file
+   python experiments/AIDO.StructureTokenizer/struct_token_format_conversion.py logs/protein2structoken_16b/predict_predictions.tsv logs/protein2structoken_16b/predict_predictions.pt
+   # extract the codebook of the structure tokenizer
+   python experiments/AIDO.StructureTokenizer/extract_structure_tokenizer_codebook.py --output_path logs/protein2structoken_16b/codebook.pt
+   ```
+5. Run the decoding command to get 3D structures in PDB format (currently this script only supports single GPU inference):
    ```bash
+   CUDA_VISIBLE_DEVICES=0 mgen predict --config experiments/AIDO.StructureTokenizer/decode.yaml \
+     --data.init_args.config.struct_tokens_datasets_configs.name=protein2structoken_16b \
+     --data.init_args.config.struct_tokens_datasets_configs.struct_tokens_path=logs/protein2structoken_16b/predict_predictions.pt \
+     --data.init_args.config.struct_tokens_datasets_configs.codebook_path=logs/protein2structoken_16b/codebook.pt
    ```
+   The outputs are in `logs/protstruct_decode/protein2structoken_16b_pdb_files/`
+6. You can compare the predicted structures with the ground truth PDBs in [genbio-ai/casp14-casp15-cameo-test-proteins](https://huggingface.co/datasets/genbio-ai/casp14-casp15-cameo-test-proteins/tree/main).
+Alternatively, you can provide your own input amino acid sequence in a CSV file. Here is one example csv at `experiments/AIDO.StructureTokenizer/protein2structoken_example_input.csv` in `ModelGenerator`:
 ```
+idx,aa_seq
+example,KEFWNLDKNLQLRLGIVFLG
+```
+Here, `idx` is a unique name, and `aa_seq` is the amino acid sequence. To use this customized CSV file, replace the second step with
+```bash
+mgen predict --config experiments/AIDO.StructureTokenizer/protein2structoken_16b.yaml \
+ --data.init_args.path=experiments/AIDO.StructureTokenizer/ \
+ --data.init_args.test_split_files=[protein2structoken_example_input.csv]
 ```
 ### Build any downstream models from this backbone with ModelGenerator
 For more information, visit: [Model Generator](https://github.com/genbio-ai/modelgenerator)