Update README.md
Browse files
README.md
CHANGED
@@ -15,8 +15,6 @@ In addition to cross entropy and cosine teacher-student losses, DistilProtBert w
|
|
15 |
# Model description
|
16 |
|
17 |
DistilProtBert was pretrained on millions of proteins sequences.
|
18 |
-
This means it was pretrained on the raw protein sequences only, with no humans labelling them in any way (which is why it can use lots of
|
19 |
-
publicly available data) with an automatic process to generate inputs and labels from those protein sequences.
|
20 |
|
21 |
Few important differences between DistilProtBert model and the original ProtBert version are:
|
22 |
1. The size of the model
|
@@ -33,5 +31,9 @@ The model can be used the same as ProtBert.
|
|
33 |
|
34 |
## Training data
|
35 |
|
36 |
-
DistilProtBert model was pretrained on [Uniref50](https://www.uniprot.org/downloads), a dataset consisting of ~43 million protein sequences
|
37 |
|
|
|
|
|
|
|
|
|
|
15 |
# Model description
|
16 |
|
17 |
DistilProtBert was pretrained on millions of proteins sequences.
|
|
|
|
|
18 |
|
19 |
Few important differences between DistilProtBert model and the original ProtBert version are:
|
20 |
1. The size of the model
|
|
|
31 |
|
32 |
## Training data
|
33 |
|
34 |
+
DistilProtBert model was pretrained on [Uniref50](https://www.uniprot.org/downloads), a dataset consisting of ~43 million protein sequences (only sequences of length between 20 to 512 amino acids were used).
|
35 |
|
36 |
+
# Pretraining procedure
|
37 |
+
|
38 |
+
Preprocessing was done using ProtBert's tokenizer.
|
39 |
+
The details of the masking procedure for each sequence followed the original Bert (as mentioned in [ProtBert](https://huggingface.co/Rostlab/prot_bert).
|