yarongef commited on
Commit
36d6389
1 Parent(s): 0ac854d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +22 -0
README.md CHANGED
@@ -13,3 +13,25 @@ Distilled protein language of [ProtBert](https://huggingface.co/Rostlab/prot_ber
13
  In addition to cross entropy and cosine teacher-student losses, DistilProtBert was pretrained on a masked language modeling (MLM) objective and it only works with capital letter amino acids.
14
 
15
  # Model description
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
  In addition to cross entropy and cosine teacher-student losses, DistilProtBert was pretrained on a masked language modeling (MLM) objective and it only works with capital letter amino acids.
14
 
15
  # Model description
16
+
17
+ DistilProtBert was pretrained on millions of proteins sequences.
18
+ This means it was pretrained on the raw protein sequences only, with no humans labelling them in any way (which is why it can use lots of
19
+ publicly available data) with an automatic process to generate inputs and labels from those protein sequences.
20
+
21
+ Few important differences between DistilProtBert model and the original ProtBert version are:
22
+ 1. The size of the model
23
+ 2. The size of the pretraining dataset
24
+ 3. Time & hardware used for pretraining
25
+
26
+ ## Intended uses & limitations
27
+
28
+ The model could be used for protein feature extraction or to be fine-tuned on downstream tasks.
29
+
30
+ ### How to use
31
+
32
+ The model can be used the same as ProtBert.
33
+
34
+ ## Training data
35
+
36
+ DistilProtBert model was pretrained on [Uniref50](https://www.uniprot.org/downloads), a dataset consisting of ~43 million protein sequences after length filtering (only sequences of length 20 to 512 amino acid were used).
37
+