File size: 4,816 Bytes
9ae8e1c 7756bc9 3c8cc5e 3425691 7756bc9 c41ee8a 7756bc9 3c8cc5e 7756bc9 503222f 3c8cc5e 573284b 7756bc9 3c8cc5e 7756bc9 3c8cc5e 7756bc9 3c8cc5e 7756bc9 4517d36 7756bc9 503222f 7756bc9 3c8cc5e 7756bc9 3c8cc5e 4517d36 3c8cc5e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 |
---
license: apache-2.0
---
# ProteinForceGPT: Generative strategies for modeling, design and analysis of protein mechanics
### Basic information
This protein language model is a 454M parameter autoregressive transformer model in GPT-style, trained to analyze and predict the mechanical properties of a large number of protein sequences. The model has both forward and inverse capabilities. For instance, using generate tasks, the model can design novel proteins that meet one or more mechanical constraints.
This protein language foundation model was based on the NeoGPT-X architecture and uses rotary positional embeddings (RoPE). It has 16 attention heads, 36 hidden layers and a hidden size of 1024, an intermediate size of 4096 and uses a GeLU activation function.
The pretraining task is defined as "Sequence<...>" where ... is an amino acid sequence.
Pretraining dataset: https://huggingface.co/datasets/lamm-mit/GPTProteinPretrained
Pretrained model: https://huggingface.co/lamm-mit/GPTProteinPretrained
In this fine-tuned model, mechanics-related forward and inverse tasks are:
```raw
CalculateForce<GEECDCGSPSNP..>,
CalculateEnergy<GEECDCGSPSNP..>
CalculateForceEnergy<GEECDCGSPSNP...>
CalculateForceHistory<GEECDCGSPSNP...>
GenerateForce<0.262>
GenerateForce<0.220>
GenerateForceEnergy<0.262,0.220>
GenerateForceHistory<0.004,0.034,0.125,0.142,0.159,0.102,0.079,0.073,0.131,0.105,0.071,0.058,0.072,0.060,0.049,0.114,0.122,0.108,0.173,0.192,0.208,0.153,0.212,0.222,0.244>
```
### Load model
You can load the model using this code.
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
ForceGPT_model_name='lamm-mit/ProteinForceGPT'
tokenizer = AutoTokenizer.from_pretrained(ForceGPT_model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
ForceGPT_model_name,
trust_remote_code=True
).to(device)
model.config.use_cache = False
```
### Inference
Sample inference using the "Sequence<...>" task, where here, the model will simply autocomplete the sequence starting with "AIIAA":
```python
prompt = "Sequence<GEECDC"
generated = torch.tensor(tokenizer.encode(prompt, add_special_tokens = False)) .unsqueeze(0).to(device)
print(generated.shape, generated)
sample_outputs = model.generate(
inputs=generated,
eos_token_id =tokenizer.eos_token_id,
do_sample=True,
top_k=500,
max_length = 300,
top_p=0.9,
num_return_sequences=1,
temperature=1,
).to(device)
for i, sample_output in enumerate(sample_outputs):
print("{}: {}\n\n".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))
```
Sample inference using the "CalculateForce<...>" task, where here, the model will calculate the maximum unfolding force of a given sequence:
```python
prompt = "'CalculateForce<GEECDCGSPSNPCCDAATCKLRPGAQCADGLCCDQCRFKKKRTICRIARGDFPDDRCTGQSADCPRWN>"
generated = torch.tensor(tokenizer.encode(prompt, add_special_tokens = False)) .unsqueeze(0).to(device)
sample_outputs = model.generate(
inputs=generated,
eos_token_id =tokenizer.eos_token_id,
do_sample=True,
top_k=500,
max_length = 300,
top_p=0.9,
num_return_sequences=3,
temperature=1,
).to(device)
for i, sample_output in enumerate(sample_outputs):
print("{}: {}\n\n".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))
```
Output:
```raw
0: CalculateForce<GEECDCGSPSNPCCDAATCKLRPGAQCADGLCCDQCRFKKKRTICRIARGDFPDDRCTGQSADCPRWN> [0.262]```
```
## Citations
To cite this work:
```
@article{GhafarollahiBuehler_2024,
title = {ProtAgents: Protein discovery via large language model multi-agent collaborations combining physics and machine learning },
author = {A. Ghafarollahi, M.J. Buehler},
journal = {},
year = {2024},
volume = {},
pages = {},
url = {}
}
```
The dataset used to fine-tune the model is available at:
```
@article{GhafarollahiBuehler_2024,
title = {ForceGen: End-to-end de novo protein generation based on nonlinear mechanical unfolding responses using a protein language diffusion model},
author = {B. Ni, D.L. Kaplan, M.J. Buehler},
journal = {Science Advances},
year = {2024},
volume = {},
pages = {},
url = {}
}
```
|