|
--- |
|
license: mit |
|
language: protein |
|
tags: |
|
- protein language model |
|
datasets: |
|
- Uniref50 |
|
--- |
|
|
|
# DistilProtBert |
|
|
|
Distilled version of [ProtBert-UniRef100](https://huggingface.co/Rostlab/prot_bert) model. |
|
In addition to cross entropy and cosine teacher-student losses, DistilProtBert was pretrained on a masked language modeling (MLM) objective and it only works with capital letter amino acids. |
|
|
|
[Git](https://github.com/yarongef/DistilProtBert) repository. |
|
|
|
# Model details |
|
| **Model** | **# of parameters** | **# of hidden layers** | **Pretraining dataset** | **# of proteins** | **Pretraining hardware** | |
|
|:--------------:|:-------------------:|:----------------------:|:-----------------------:|:------------------------------:|:------------------------:| |
|
| ProtBert | 420M | 30 | UniRef100 | 216M | 512 16GB TPUs | |
|
| DistilProtBert | 230M | 15 | UniRef50 | 43M | 5 v100 32GB GPUs | |
|
|
|
## Intended uses & limitations |
|
|
|
The model could be used for protein feature extraction or to be fine-tuned on downstream tasks. |
|
|
|
### How to use |
|
|
|
The model can be used the same as ProtBert and with ProtBert's tokenizer. |
|
|
|
## Training data |
|
|
|
DistilProtBert model was pretrained on [Uniref50](https://www.uniprot.org/downloads), a dataset consisting of ~43 million protein sequences (only sequences of length between 20 to 512 amino acids were used). |
|
|
|
# Pretraining procedure |
|
|
|
Preprocessing was done using ProtBert's tokenizer. |
|
The details of the masking procedure for each sequence followed the original Bert (as mentioned in [ProtBert](https://huggingface.co/Rostlab/prot_bert)). |
|
|
|
The model was pretrained on a single DGX cluster for 3 epochs in total. local batch size was 16, the optimizer used was AdamW with a learning rate of 5e-5 and mixed precision settings. |
|
|
|
## Evaluation results |
|
|
|
When fine-tuned on downstream tasks, this model achieves the following results: |
|
|
|
| Task/Dataset | secondary structure (3-states) | Membrane | |
|
|:-----:|:-----:|:-----:| |
|
| CASP12 | 72 | | |
|
| TS115 | 81 | | |
|
| CB513 | 79 | | |
|
| DeepLoc | | 86 | |
|
|
|
Distinguish between proteins and their k-let shuffled versions: |
|
|
|
_Singlet_ |
|
|
|
| Model | AUC | |
|
|:--------------:|:-------:| |
|
| LSTM | 0.71 | |
|
| ProtBert | 0.93 | |
|
| DistilProtBert | 0.92 | |
|
|
|
_Doublet_ |
|
|
|
| Model | AUC | |
|
|:--------------:|:-------:| |
|
| LSTM | 0.68 | |
|
| ProtBert | 0.92 | |
|
| DistilProtBert | 0.91 | |
|
|
|
_Triplet_ |
|
|
|
| Model | AUC | |
|
|:--------------:|:-------:| |
|
| LSTM | 0.61 | |
|
| ProtBert | 0.92 | |
|
| DistilProtBert | 0.87 | |