--- license: mit language: protein tags: - protein language model datasets: - Uniref50 --- # DistilProtBert model Distilled protein language of [ProtBert](https://huggingface.co/Rostlab/prot_bert). In addition to cross entropy and cosine teacher-student losses, DistilProtBert was pretrained on a masked language modeling (MLM) objective and it only works with capital letter amino acids. # Model description DistilProtBert was pretrained on millions of proteins sequences. Few important differences between DistilProtBert model and the original ProtBert version are: 1. The size of the model 2. The size of the pretraining dataset 3. Time & hardware used for pretraining ## Intended uses & limitations The model could be used for protein feature extraction or to be fine-tuned on downstream tasks. ### How to use The model can be used the same as ProtBert. ## Training data DistilProtBert model was pretrained on [Uniref50](https://www.uniprot.org/downloads), a dataset consisting of ~43 million protein sequences (only sequences of length between 20 to 512 amino acids were used). # Pretraining procedure Preprocessing was done using ProtBert's tokenizer. The details of the masking procedure for each sequence followed the original Bert (as mentioned in [ProtBert](https://huggingface.co/Rostlab/prot_bert).