DistilProtBert / README.md
yarongef's picture
Update README.md
d88d839
|
raw
history blame
1.37 kB
---
license: mit
language: protein
tags:
- protein language model
datasets:
- Uniref50
---
# DistilProtBert model
Distilled protein language of [ProtBert](https://huggingface.co/Rostlab/prot_bert).
In addition to cross entropy and cosine teacher-student losses, DistilProtBert was pretrained on a masked language modeling (MLM) objective and it only works with capital letter amino acids.
# Model description
DistilProtBert was pretrained on millions of proteins sequences.
Few important differences between DistilProtBert model and the original ProtBert version are:
1. The size of the model
2. The size of the pretraining dataset
3. Time & hardware used for pretraining
## Intended uses & limitations
The model could be used for protein feature extraction or to be fine-tuned on downstream tasks.
### How to use
The model can be used the same as ProtBert.
## Training data
DistilProtBert model was pretrained on [Uniref50](https://www.uniprot.org/downloads), a dataset consisting of ~43 million protein sequences (only sequences of length between 20 to 512 amino acids were used).
# Pretraining procedure
Preprocessing was done using ProtBert's tokenizer.
The details of the masking procedure for each sequence followed the original Bert (as mentioned in [ProtBert](https://huggingface.co/Rostlab/prot_bert)).