--- license: other --- # AIDO.Protein2StructureToken-16B **AIDO.Protein2StructureToken-16B** is a fine-tuned version of [AIDO.Protein-16B](https://huggingface.co/genbio-ai/AIDO.Protein-16B), for protein structure prediction. This model uses amino acid sequences as input to predict tokens that can be decoded into 3D structures by [AIDO.StructureDecoder](https://huggingface.co/genbio-ai/AIDO.StructureDecoder). It surpasses existing state-of-the-art models, such as **ESM3-open**, in structure prediction tasks, demonstrating its robustness and capability in this domain. ## Model Architecture Details This model retains the architecture of AIDO.Protein-16B, a transformer encoder-only architecture with dense MLP layers replaced by sparse Mixture of Experts (MoE) layers. Each token activates 2 experts using a top-2 routing mechanism. A visual summary of the architecture is provided below:
AIDO.Protein-16B Architecture
### Key Differences The final output linear layer has been adapted to support a new vocabulary size: - **Input Vocabulary Size**: 44 (amino acids + special tokens) - **Output Vocabulary Size**: 512 (structure tokens without special tokens) ### Architecture Parameters | Component | Value | |-------------------------------|-------| | Number of Attention Heads | 36 | | Number of Hidden Layers | 36 | | Hidden Size | 2304 | | Number of MoE Layers per Block| 8 | | Number of MoE Layers per Token| 2 | | Input Vocabulary Size | 44 | | Output Vocabulary Size | 512 | | Context Length | 1024 | ## Training Details The fine-tuning process used **0.4 trillion tokens**, using AlphaFold database with **170M samples** and PDB database with **0.4M samples**, making it highly specialized for structure prediction. The training took around 20 days on 64 A100 GPUs. - **Batch Size**: Global batch size of 2048 - **Context Length**: 1024 - **Precision**: FP16 - **Hardware**: 64 NVIDIA A100 80GB GPUs - **Learning Rate**: Max learning rate of 1e-4 - **Scheduler**: Cosine decay with 2.5% warmup - **Tokens Trained**: 0.4T tokens - **Training steps**: 200k steps ## Tokenization The input sequence should be single-chain amino acid sequences. - **Input Tokenization**: The sequences are tokenized at the amino acid level and terminated with a `[SEP]` token (id=34). - **Output Tokenization**: Each input token is converted into a structure token. The output can be decoded into 3D structures in PDB format using [AIDO.StructureDecoder](https://huggingface.co/genbio-ai/AIDO.StructureDecoder). ## Results
Structure Prediction Results
## How to Use ### Structure Prediction To reproduce the structure prediction results described above, follow these steps: 1. Install the [Model Generator package](https://github.com/genbio-ai/ModelGenerator/). 2. Run the prediction command: ```bash mgen predict --config experiments/AIDO.StructureTokenizer/protein2structoken_16b.yaml ``` This will pull the CASP14, CASP15, and CAMEO dataset from [genbio-ai/casp14-casp15-cameo-test-proteins](https://huggingface.co/datasets/genbio-ai/casp14-casp15-cameo-test-proteins), and predict the structure tokens from the amino acid sequence. 3. Convert the output `.tsv` to `.pt` and extract model codebook: ```bash # convert the predicted structures in tsv into one pt file python experiments/AIDO.StructureTokenizer/struct_token_format_conversion.py logs/protein2structoken_16b/predict_predictions.tsv logs/protein2structoken_16b/predict_predictions.pt # extract the codebook of the structure tokenizer python experiments/AIDO.StructureTokenizer/extract_structure_tokenizer_codebook.py --output_path logs/protein2structoken_16b/codebook.pt ``` 5. Run the decoding command to get 3D structures in PDB format (currently this script only supports single GPU inference): ```bash CUDA_VISIBLE_DEVICES=0 mgen predict --config experiments/AIDO.StructureTokenizer/decode.yaml \ --data.init_args.config.struct_tokens_datasets_configs.name=protein2structoken_16b \ --data.init_args.config.struct_tokens_datasets_configs.struct_tokens_path=logs/protein2structoken_16b/predict_predictions.pt \ --data.init_args.config.struct_tokens_datasets_configs.codebook_path=logs/protein2structoken_16b/codebook.pt ``` The outputs are in `logs/protstruct_decode/protein2structoken_16b_pdb_files/` 6. You can compare the predicted structures with the ground truth PDBs in [genbio-ai/casp14-casp15-cameo-test-proteins](https://huggingface.co/datasets/genbio-ai/casp14-casp15-cameo-test-proteins/tree/main). Alternatively, you can provide your own input amino acid sequence in a CSV file. Here is one example csv at `experiments/AIDO.StructureTokenizer/protein2structoken_example_input.csv` in `ModelGenerator`: ``` idx,aa_seq example,KEFWNLDKNLQLRLGIVFLG ``` Here, `idx` is a unique name, and `aa_seq` is the amino acid sequence. To use this customized CSV file, replace the second step with ```bash mgen predict --config experiments/AIDO.StructureTokenizer/protein2structoken_16b.yaml \ --data.init_args.path=experiments/AIDO.StructureTokenizer/ \ --data.init_args.test_split_files=[protein2structoken_example_input.csv] ``` ### Build any downstream models from this backbone with ModelGenerator For more information, visit: [Model Generator](https://github.com/genbio-ai/modelgenerator) ```bash mgen fit --model SequenceClassification --model.backbone aido_protein_16b --data SequenceClassificationDataModule --data.path mgen test --model SequenceClassification --model.backbone aido_protein_16b --data SequenceClassificationDataModule --data.path ``` The usage of this model is the same as [AIDO.Protein-16B](https://huggingface.co/genbio-ai/AIDO.Protein-16B). You only need to change the `model.backbone` to `aido_protein2structoken`. ### Or use directly in Python #### Embedding ```python from modelgenerator.tasks import Embed model = Embed.from_config({"model.backbone": "aido_protein2structoken_16b"}).eval() collated_batch = model.collate({"sequences": ["HELLQ", "WRLD"]}) embedding = model(collated_batch) print(embedding.shape) print(embedding) ``` #### Sequence Level Classification ```python import torch from modelgenerator.tasks import SequenceClassification model = SequenceClassification.from_config({"model.backbone": "aido_protein2structoken_16b", "model.n_classes": 2}).eval() collated_batch = model.collate({"sequences": ["HELLQ", "WRLD"]}) logits = model(collated_batch) print(logits) print(torch.argmax(logits, dim=-1)) ``` #### Token Level Classification ```python import torch from modelgenerator.tasks import TokenClassification model = TokenClassification.from_config({"model.backbone": "aido_protein2structoken_16b", "model.n_classes": 3}).eval() collated_batch = model.collate({"sequences": ["HELLQ", "WRLD"]}) logits = model(collated_batch) print(logits) print(torch.argmax(logits, dim=-1)) ``` #### Regression ```python from modelgenerator.tasks import SequenceRegression model = SequenceRegression.from_config({"model.backbone": "aido_protein2structoken_16b"}).eval() collated_batch = model.collate({"sequences": ["HELLQ", "WRLD"]}) logits = model(collated_batch) print(logits) ``` ## Citation Please cite AIDO.Protein and AIDO.StructureTokenizer using the following BibTex codes: ``` @inproceedings{zhang_balancing_2024, title = {Balancing Locality and Reconstruction in Protein Structure Tokenizer}, url = {https://www.biorxiv.org/content/10.1101/2024.12.02.626366v2}, doi = {10.1101/2024.12.02.626366}, publisher = {bioRxiv}, author = {Zhang, Jiayou and Meynard-Piganeau, Barthelemy and Gong, James and Cheng, Xingyi and Luo, Yingtao and Ly, Hugo and Song, Le and Xing, Eric}, year = {2024}, booktitle={NeurIPS 2024 Workshop on Machine Learning in Structural Biology (MLSB)}, } @inproceedings{sun_mixture_2024, title = {Mixture of Experts Enable Efficient and Effective Protein Understanding and Design}, url = {https://www.biorxiv.org/content/10.1101/2024.11.29.625425v1}, doi = {10.1101/2024.11.29.625425}, publisher = {bioRxiv}, author = {Sun, Ning and Zou, Shuxian and Tao, Tianhua and Mahbub, Sazan and Li, Dian and Zhuang, Yonghao and Wang, Hongyi and Cheng, Xingyi and Song, Le and Xing, Eric P.}, year = {2024}, booktitle={NeurIPS 2024 Workshop on AI for New Drug Modalities}, } ```