|
---
|
|
license: mit
|
|
language: protein
|
|
tags:
|
|
- protein language model
|
|
datasets:
|
|
- Uniref50
|
|
---
|
|
|
|
# DistilProtBert model
|
|
|
|
Distilled protein language of [ProtBert](https://huggingface.co/Rostlab/prot_bert).
|
|
In addition to cross entropy and cosine teacher-student losses, DistilProtBert was pretrained on a masked language modeling (MLM) objective and it only works with capital letter amino acids.
|
|
|
|
# Model description
|
|
|
|
DistilProtBert was pretrained on millions of proteins sequences.
|
|
|
|
Few important differences between DistilProtBert model and the original ProtBert version are:
|
|
1. The size of the model
|
|
2. The size of the pretraining dataset
|
|
3. Time & hardware used for pretraining
|
|
|
|
## Intended uses & limitations
|
|
|
|
The model could be used for protein feature extraction or to be fine-tuned on downstream tasks.
|
|
|
|
### How to use
|
|
|
|
The model can be used the same as ProtBert.
|
|
|
|
## Training data
|
|
|
|
DistilProtBert model was pretrained on [Uniref50](https://www.uniprot.org/downloads), a dataset consisting of ~43 million protein sequences (only sequences of length between 20 to 512 amino acids were used).
|
|
|
|
# Pretraining procedure
|
|
|
|
Preprocessing was done using ProtBert's tokenizer.
|
|
The details of the masking procedure for each sequence followed the original Bert (as mentioned in [ProtBert](https://huggingface.co/Rostlab/prot_bert)).
|
|
|
|
The model was pretrained on a single DGX cluster 3 epochs in total. local batch size was 16, the optimizer used was AdamW with a learning rate of 5e-5 and mixed precision settings.
|
|
|
|
## Evaluation results
|
|
|
|
When fine-tuned on downstream tasks, this model achieves the following results:
|
|
|
|
| Task/Dataset | secondary structure (3-states) | secondary structure (8-states) | Localization | Membrane |
|
|
|:-----:|:-----:|:-----:|:-----:|:-----:|
|
|
| CASP12 | 75 | 63 | | |
|
|
| TS115 | 83 | 72 | | |
|
|
| CB513 | 81 | 66 | | |
|
|
| DeepLoc | | | 79 | 91 |
|
|
|
|
Distinguish between:
|
|
|
|
### BibTeX entry and citation info |