license: mit
language: protein
tags:
- protein language model
datasets:
- Uniref50
DistilProtBert
Distilled version of ProtBert model. In addition to cross entropy and cosine teacher-student losses, DistilProtBert was pretrained on a masked language modeling (MLM) objective and it only works with capital letter amino acids.
Model description
DistilProtBert was pretrained on millions of proteins sequences.
Few important differences between DistilProtBert model and the original ProtBert version are:
- Size of the model
- Size of the pretraining dataset
- Hardware used for pretraining
Intended uses & limitations
The model could be used for protein feature extraction or to be fine-tuned on downstream tasks.
How to use
The model can be used the same as ProtBert.
Training data
DistilProtBert model was pretrained on Uniref50, a dataset consisting of ~43 million protein sequences (only sequences of length between 20 to 512 amino acids were used).
Pretraining procedure
Preprocessing was done using ProtBert's tokenizer. The details of the masking procedure for each sequence followed the original Bert (as mentioned in ProtBert).
The model was pretrained on a single DGX cluster for 3 epochs in total. local batch size was 16, the optimizer used was AdamW with a learning rate of 5e-5 and mixed precision settings.
Evaluation results
When fine-tuned on downstream tasks, this model achieves the following results:
Task/Dataset | secondary structure (3-states) | Membrane |
---|---|---|
CASP12 | 72 | |
TS115 | 81 | |
CB513 | 79 | |
DeepLoc | 86 |
Distinguish between: