File size: 3,751 Bytes
089fe7d ada4a23 089fe7d ada4a23 089fe7d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 |
---
license: mit
tags:
- biology
- protein
---
# PLTNUM-SaProt-HeLa
PLTNUM is a protein language model trained to predict protein half-lives based on their sequences.
This model was created based on [westlake-repl/SaProt_650M_AF2](https://huggingface.co/westlake-repl/SaProt_650M_AF2) and trained on protein half-life dataset of HeLa human cell line ([paper link](https://pubmed.ncbi.nlm.nih.gov/29414762/)).
### Model Sources
<!-- Provide the basic links for the model. -->
- **Repository:** https://github.com/sagawatatsuya/PLTNUM
- **Paper:** [Prediction of Protein Half-lives from Amino Acid Sequences by Protein Language Models](https://www.biorxiv.org/content/10.1101/2024.09.10.612367v1)
- **Demo:** https://huggingface.co/spaces/sagawa/PLTNUM
## Uses
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
## How to Get Started with the Model
Use the code below to get started with the model.
```python
from torch import sigmoid
import torch.nn as nn
from transformers import AutoModel, AutoConfig, PreTrainedModel, AutoTokenizer
class PLTNUM_PreTrainedModel(PreTrainedModel):
config_class = AutoConfig
def __init__(self, config):
super(PLTNUM_PreTrainedModel, self).__init__(config)
self.model = AutoModel.from_pretrained(self.config._name_or_path)
self.fc_dropout1 = nn.Dropout(0.8)
self.fc_dropout2 = nn.Dropout(0.4)
self.fc = nn.Linear(self.config.hidden_size, 1)
self._init_weights(self.fc)
def _init_weights(self, module):
if isinstance(module, nn.Linear):
nn.init.normal_(module.weight, mean=0.0, std=self.config.initializer_range)
if module.bias is not None:
nn.init.constant_(module.bias, 0)
elif isinstance(module, nn.Embedding):
nn.init.normal_(module.weight, mean=0.0, std=self.config.initializer_range)
if module.padding_idx is not None:
nn.init.constant_(module.weight[module.padding_idx], 0.0)
elif isinstance(module, nn.LayerNorm):
nn.init.constant_(module.bias, 0)
nn.init.constant_(module.weight, 1.0)
def forward(self, inputs):
outputs = self.model(**inputs)
last_hidden_state = outputs.last_hidden_state[:, 0]
output = (
self.fc(self.fc_dropout1(last_hidden_state))
+ self.fc(self.fc_dropout2(last_hidden_state))
) / 2
return output
def create_embedding(self, inputs):
outputs = self.model(**inputs)
last_hidden_state = outputs.last_hidden_state[:, 0]
return last_hidden_state
model = PLTNUM_PreTrainedModel.from_pretrained("sagawa/PLTNUM-SaProt-HeLa")
tokenizer = AutoTokenizer.from_pretrained("sagawa/PLTNUM-SaProt-HeLa")
seq = "MdSdGdRdGdKpQpGpGpKdApRpApKpAdKdTaRpScSvRvAlGvLaQpFfPrVlGvRvVqHvRvLvLvRvKvGvNpYpSdEpRdVdGdAsGcAnPsVsYvLvArAvVvLvErYvLvTvAvEqIlLcEvLqAlGcNvAqAcRvDvNvKvKhTrRdIrIdPlRlHsLsQqLvAsIqRcNvDdEpEvLsNcKvLvLcGvRpVpTdIrApQpGnGdVhLdPdNdIdQdApVvLpLdPdKdKdTdEpSpHpHpKpPpKpGdKd"
input = tokenizer(
[seq],
add_special_tokens=True,
max_length=512,
padding="max_length",
truncation=True,
return_offsets_mapping=False,
return_attention_mask=True,
return_tensors="pt",
)
print(sigmoid(model(input)))
```
## Citation
Prediction of Protein Half-lives from Amino Acid Sequences by Protein Language Models
Tatsuya Sagawa, Eisuke Kanao, Kosuke Ogata, Koshi Imami, Yasushi Ishihama
bioRxiv 2024.09.10.612367; doi: https://doi.org/10.1101/2024.09.10.612367
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. --> |