yarongef commited on
Commit
ca62e58
1 Parent(s): 36d6389

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -3
README.md CHANGED
@@ -15,8 +15,6 @@ In addition to cross entropy and cosine teacher-student losses, DistilProtBert w
15
  # Model description
16
 
17
  DistilProtBert was pretrained on millions of proteins sequences.
18
- This means it was pretrained on the raw protein sequences only, with no humans labelling them in any way (which is why it can use lots of
19
- publicly available data) with an automatic process to generate inputs and labels from those protein sequences.
20
 
21
  Few important differences between DistilProtBert model and the original ProtBert version are:
22
  1. The size of the model
@@ -33,5 +31,9 @@ The model can be used the same as ProtBert.
33
 
34
  ## Training data
35
 
36
- DistilProtBert model was pretrained on [Uniref50](https://www.uniprot.org/downloads), a dataset consisting of ~43 million protein sequences after length filtering (only sequences of length 20 to 512 amino acid were used).
37
 
 
 
 
 
 
15
  # Model description
16
 
17
  DistilProtBert was pretrained on millions of proteins sequences.
 
 
18
 
19
  Few important differences between DistilProtBert model and the original ProtBert version are:
20
  1. The size of the model
 
31
 
32
  ## Training data
33
 
34
+ DistilProtBert model was pretrained on [Uniref50](https://www.uniprot.org/downloads), a dataset consisting of ~43 million protein sequences (only sequences of length between 20 to 512 amino acids were used).
35
 
36
+ # Pretraining procedure
37
+
38
+ Preprocessing was done using ProtBert's tokenizer.
39
+ The details of the masking procedure for each sequence followed the original Bert (as mentioned in [ProtBert](https://huggingface.co/Rostlab/prot_bert).