File size: 3,705 Bytes

10d995a
 
 
 
 
 
 
 
8356876
10d995a
b493048
533d37e
b573649
d9f8896
f7d04c2
533d37e
10d995a
3bb6bbf
1710191
b63691f
ec6857f
b63691f
10d995a
 
 
 
 
 
 
4d1a1db
10d995a
 
 
 
 
 
 
 
 
 
f60eeb6
10d995a
 
 
 
 
 
 
 
 
 
dc98dd8
 
 
 
7d92d6e
dc98dd8
 
 
 
 
 
 
7d92d6e
dc98dd8
 
 
 
 
 
 
7d92d6e
dc98dd8
 
 
 
 
9e3f170

---
license: mit
tags:
- protein language model
datasets:
- Uniref50
---

# DistilProtBert

A distilled version of [ProtBert-UniRef100](https://huggingface.co/Rostlab/prot_bert) model.
In addition to cross entropy and cosine teacher-student losses, DistilProtBert was pretrained on a masked language modeling (MLM) objective  and it only works with capital letter amino acids. 

Check out our paper [DistilProtBert: A distilled protein language model used to distinguish between real proteins and their randomly shuffled counterparts](https://doi.org/10.1093/bioinformatics/btac474) for more details.

[Git](https://github.com/yarongef/DistilProtBert) repository.

# Model details
|    **Model**   | **# of parameters** | **# of hidden layers** | **Pretraining dataset** | **# of proteins** | **Pretraining hardware** |
|:--------------:|:-------------------:|:----------------------:|:-----------------------:|:------------------------------:|:------------------------:|
|    ProtBert    |         420M        |           30           |        UniRef100        |              216M              |       512 16GB TPUs      |
| DistilProtBert |         230M        |           15           |         UniRef50        |               43M              |     5 v100 32GB GPUs     |

## Intended uses & limitations

The model could be used for protein feature extraction or to be fine-tuned on downstream tasks.

### How to use

The model can be used the same as ProtBert and with ProtBert's tokenizer.

## Training data

DistilProtBert model was pretrained on [Uniref50](https://www.uniprot.org/downloads), a dataset consisting of ~43 million protein sequences (only sequences of length between 20 to 512 amino acids were used).

# Pretraining procedure

Preprocessing was done using ProtBert's tokenizer.
The details of the masking procedure for each sequence followed the original Bert (as mentioned in [ProtBert](https://huggingface.co/Rostlab/prot_bert)). 

The model was pretrained on a single DGX cluster for 3 epochs in total. local batch size was 16, the optimizer used was AdamW with a learning rate of 5e-5 and mixed precision settings.

## Evaluation results

When fine-tuned on downstream tasks, this model achieves the following results:

| Task/Dataset | secondary structure (3-states) | Membrane  |
|:-----:|:-----:|:-----:|
|   CASP12  | 72 |    |
|   TS115   | 81 |    | 
|   CB513   | 79 |    |
|  DeepLoc  |    | 86 | 

Distinguish between proteins and their k-let shuffled versions:

_Singlet_ ([dataset](https://huggingface.co/datasets/yarongef/human_proteome_singlets))

|    Model   | AUC |
|:--------------:|:-------:|
|      LSTM      |   0.71  |
|    ProtBert    |   0.93  |
| DistilProtBert |   0.92  |

_Doublet_ ([dataset](https://huggingface.co/datasets/yarongef/human_proteome_doublets))

|    Model   | AUC |
|:--------------:|:-------:|
|      LSTM      |   0.68  |
|    ProtBert    |   0.92  |
| DistilProtBert |   0.91  |

_Triplet_ ([dataset](https://huggingface.co/datasets/yarongef/human_proteome_triplets))

|    Model   | AUC |
|:--------------:|:-------:|
|      LSTM      |   0.61  |
|    ProtBert    |   0.92  |
| DistilProtBert |   0.87  |

## **Citation**
If you use this model, please cite our paper:
```
@article {
	author = {Geffen, Yaron and Ofran, Yanay and Unger, Ron},
	title = {DistilProtBert: A distilled protein language model used to distinguish between real proteins and their randomly shuffled counterparts},
	year = {2022},
	doi = {10.1101/2022.05.09.491157},
	URL = {https://www.biorxiv.org/content/early/2022/05/10/2022.05.09.491157},
	eprint = {https://www.biorxiv.org/content/early/2022/05/10/2022.05.09.491157.full.pdf},
	journal = {bioRxiv}
}
```