File size: 1,840 Bytes
10d995a e42b656 10d995a 020cba7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 |
---
license: mit
language: protein
tags:
- protein language model
datasets:
- Uniref50
---
# DistilProtBert model
Distilled version of [ProtBert](https://huggingface.co/Rostlab/prot_bert) model.
In addition to cross entropy and cosine teacher-student losses, DistilProtBert was pretrained on a masked language modeling (MLM) objective and it only works with capital letter amino acids.
# Model description
DistilProtBert was pretrained on millions of proteins sequences.
Few important differences between DistilProtBert model and the original ProtBert version are:
1. Size of the model
2. Size of the pretraining dataset
3. Hardware used for pretraining
## Intended uses & limitations
The model could be used for protein feature extraction or to be fine-tuned on downstream tasks.
### How to use
The model can be used the same as ProtBert.
## Training data
DistilProtBert model was pretrained on [Uniref50](https://www.uniprot.org/downloads), a dataset consisting of ~43 million protein sequences (only sequences of length between 20 to 512 amino acids were used).
# Pretraining procedure
Preprocessing was done using ProtBert's tokenizer.
The details of the masking procedure for each sequence followed the original Bert (as mentioned in [ProtBert](https://huggingface.co/Rostlab/prot_bert)).
The model was pretrained on a single DGX cluster 3 epochs in total. local batch size was 16, the optimizer used was AdamW with a learning rate of 5e-5 and mixed precision settings.
## Evaluation results
When fine-tuned on downstream tasks, this model achieves the following results:
| Task/Dataset | secondary structure (3-states) | Membrane |
|:-----:|:-----:|:-----:|
| CASP12 | 72 | |
| TS115 | 81 | |
| CB513 | 79 | |
| DeepLoc | | 86 |
Distinguish between:
### BibTeX entry and citation info |