File size: 1,365 Bytes
26d1907
 
b27ffe0
 
 
 
 
405d721
 
 
 
0ac854d
 
 
 
36d6389
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ca62e58
36d6389
ca62e58
 
 
d88d839
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
---

license: mit
language: protein
tags:
- protein language model
datasets:
- Uniref50
---


# DistilProtBert model

Distilled protein language of [ProtBert](https://huggingface.co/Rostlab/prot_bert).
In addition to cross entropy and cosine teacher-student losses, DistilProtBert was pretrained on a masked language modeling (MLM) objective  and it only works with capital letter amino acids.

# Model description

DistilProtBert was pretrained on millions of proteins sequences. 

Few important differences between DistilProtBert model and the original ProtBert version are:
1. The size of the model
2. The size of the pretraining dataset
3. Time & hardware used for pretraining

## Intended uses & limitations

The model could be used for protein feature extraction or to be fine-tuned on downstream tasks.

### How to use

The model can be used the same as ProtBert.

## Training data

DistilProtBert model was pretrained on [Uniref50](https://www.uniprot.org/downloads), a dataset consisting of ~43 million protein sequences (only sequences of length between 20 to 512 amino acids were used).

# Pretraining procedure

Preprocessing was done using ProtBert's tokenizer.
The details of the masking procedure for each sequence followed the original Bert (as mentioned in [ProtBert](https://huggingface.co/Rostlab/prot_bert)).