genbio-ai
/

AIDO.StructureEncoder

Safetensors

biology

genbio

model_hub_mixin

protein

pytorch_model_hub_mixin

Model card Files Files and versions Community

jyz-mbzuai commited on Dec 10, 2024

Commit

1a1cab9

1 Parent(s): 35e3d0f

update readme

Browse files

Files changed (2) hide show

README.md +95 -3
assets/images/architecture.png +0 -0

README.md CHANGED Viewed

@@ -10,10 +10,103 @@ license: other
 # AIDO.StructureEncoder
-AIDO.StructureDecoder is the encoder-only component of [AIDO.StructureTokenizer](https://huggingface.co/genbio-ai/AIDO.StructureTokenizer) for tokenization of protein structures.
 ## How to Use
-Please see `experiments/AIDO.StructureTokenizer` in [Model Generator](https://github.com/genbio-ai/modelgenerator)
 # Citation
 Please cite AIDO.StructureTokenizer using the following BibTex code:
@@ -28,4 +121,3 @@ Please cite AIDO.StructureTokenizer using the following BibTex code:
     booktitle={NeurIPS 2024 Workshop on AI for New Drug Modalities},
 }
 ```
-- Docs: [More Information Needed]

 # AIDO.StructureEncoder
+AIDO.StructureEncoder is the encoder-only component of [AIDO.StructureTokenizer](https://huggingface.co/genbio-ai/AIDO.StructureTokenizer) for tokenization of protein structures.
+## Model Description
+![Model Architecture](./assets/images/architecture.png)
+**AIDO.StructureTokenizer** is built on a Vector Quantized Variational Autoencoder (VQ-VAE) architecture with the following components:
+- Equivariant Encoder (6M): Encodes backbone structures into a latent space that maintains rotational and translational symmetries using the Equiformer architecture.
+- Discrete Codebook: Maps continuous latent vectors into 512 discrete structural tokens.
+- Invariant Decoder (300M): Reconstructs full 3D structures, including side chains, from the structural tokens using an architecture adapted from ESMFold.
+This model strikes a balance between reconstruction fidelity and structural locality, optimizing its suitability for downstream tasks such as structure prediction, homology detection, and multimodal protein modeling.
+### Key Features
+- Encoding Structures into Tokens (See [below](#how-to-use))
+- Decoding Tokens into Structures (See [genbio-ai/AIDO.StructureDecoder](https://huggingface.co/genbio-ai/AIDO.StructureDecoder))
+- Reconstructing Structures (See [genbio-ai/AIDO.StructureTokenizer](https://huggingface.co/genbio-ai/AIDO.StructureTokenizer))
+- Structure Prediction (See [this section](https://huggingface.co/genbio-ai/AIDO.Protein2StructureToken-16B/blob/main/README.md#structure-prediction) in genbio-ai/AIDO.Protein2StructureToken-16B)
 ## How to Use
+Please see `experiments/AIDO.StructureTokenizer` in [Model Generator](https://github.com/genbio-ai/modelgenerator) for more details.
+### Setup
+Install [Model Generator](https://github.com/genbio-ai/modelgenerator)
+#### Data preparation
+To reproduce the reconstruction results in the paper, we provide a preprocessed CASP15 dataset at [genbio-ai/sample-structure-dataset](https://huggingface.co/datasets/genbio-ai/sample-structure-dataset). It could be downloaded via
+```bash
+huggingface-cli download genbio-ai/sample-structure-dataset --repo-type dataset --local-dir ./data/protstruct_sample_data/
+```
+This dataset is based on the CASP15 dataset, which can be referenced at:
+- [CASP15 Prediction Center](https://predictioncenter.org/casp15/)
+- [Bhattacharya-Lab/CASP15](https://github.com/Bhattacharya-Lab/CASP15)
+The downloaded directory includes:
+- A `registries` folder containing a CSV file with metadata such as filenames and PDB IDs.
+- A `CASP15_merged` folder containing PDB files, where domains are merged in the same way as described in [Bhattacharya-Lab/CASP15](https://github.com/Bhattacharya-Lab/CASP15).
+To use customized data, you can prepare a dataset with the following structure:
+- A folder containing PDB files (supported formats: `cif.gz`, `cif`, `ent.gz`, `pdb`).
+Then, you need to prepare a registry file in CSV format using the following command:
+``` bash
+python experiments/AIDO.StructureTokenizer/register_dataset.py \
+    --folder_path /path/to/folder_path \
+    --format cif.gz \
+    --output_file /path/to/output_file.csv
+```
+You need to replace the `folder_path` and the `registry_path` in the following steps accordingly.
+#### Running Encoding Task
+If you use the sample dataset, you can run the encoding task using the following command:
+```bash
+CUDA_VISIBLE_DEVICES=0 mgen predict --config experiments/AIDO.StructureTokenizer/encode.yaml
+```
+If you use your own dataset, you need to update the `folder_path` and the `registry_path` in the `encode.yaml` configuration file to point to your dataset folder and registry file. Alternatively, you can override these parameters when running the command:
+```bash
+CUDA_VISIBLE_DEVICES=0 mgen predict --config experiments/AIDO.StructureTokenizer/encode.yaml \
+    --data.init_args.config.proteins_datasets_configs.name="your_dataset_name" \
+    --data.init_args.config.proteins_datasets_configs.registry_path="your_dataset_folder_path" \
+    --data.init_args.config.proteins_datasets_configs.folder_path="your_dataset_registry_path" \
+    --trainer.callbacks.dict_kwargs.output_dir="your_output_dir"
+```
+**Input:**
+- The PDB files in the dataset folder.
+- The registry file in CSV format indicating the metadata of the dataset.
+**Output:**
+- The encoded tokens will be saved in the output directory specified in the configuration file. By default it is saved in `logs/protstruct_encode/`.
+- The encoded tokens are saved as a `.pt` file, which can be loaded using PyTorch. Inside the file, it's a dictionary that maps the name of the protein to the encoded tokens (`struct_tokens`) and other auxiliary information (`aatype`, `residue_index`) for reconstruction.
+  The structure of the dictionary is as follows:
+  ```python
+  {
+      'T1137s5_nan': { # the nan here is the chain id and CASP15 doesn't have chain id
+          'struct_tokens': tensor([449, 313, 207, 129, ...]),
+          'aatype': tensor([ 4,  7,  5, 17, ...]),
+          'residue_index': tensor([ 33,  34,  35,  36, ...]),
+      },
+      ...
+  }
+  ```
+- A codebook file (`codebook.pt`) that contains the embedding of each token. The shape is `(num_tokens, embedding_dim)`.
+**Notes:**
+- Currently, this function only supports single GPU inference due to the file saving mechanism. We plan to support multi-GPU inference in the future.
+- The auxiliary information (`aatype` and `residue_index`) can be substituted with placeholder values if not required.
+    - `aatype`: This parameter is used to reconstruct the protein sidechains. If sidechain reconstruction is not needed, you can assign dummy values (e.g., all zeros).
+    - `residue_index`: This parameter helps the model identify gaps in residue numbering, which can influence structural predictions. If gaps are present, the model may introduce holes in the structure. For structures without gaps, you can use a continuous sequence of integers (e.g., 0 to n-1).
+- You may need to adjust the `max_nb_res` parameter in the configuration file based on the maximum number of residues in your dataset. For those proteins with more residues than `max_nb_res`, the model will truncate the residues. The default value is set to 1024.
 # Citation
 Please cite AIDO.StructureTokenizer using the following BibTex code:
     booktitle={NeurIPS 2024 Workshop on AI for New Drug Modalities},
 }
 ```

assets/images/architecture.png ADDED Viewed