Update README.md
Browse files
README.md
CHANGED
@@ -13,3 +13,25 @@ Distilled protein language of [ProtBert](https://huggingface.co/Rostlab/prot_ber
|
|
13 |
In addition to cross entropy and cosine teacher-student losses, DistilProtBert was pretrained on a masked language modeling (MLM) objective and it only works with capital letter amino acids.
|
14 |
|
15 |
# Model description
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
13 |
In addition to cross entropy and cosine teacher-student losses, DistilProtBert was pretrained on a masked language modeling (MLM) objective and it only works with capital letter amino acids.
|
14 |
|
15 |
# Model description
|
16 |
+
|
17 |
+
DistilProtBert was pretrained on millions of proteins sequences.
|
18 |
+
This means it was pretrained on the raw protein sequences only, with no humans labelling them in any way (which is why it can use lots of
|
19 |
+
publicly available data) with an automatic process to generate inputs and labels from those protein sequences.
|
20 |
+
|
21 |
+
Few important differences between DistilProtBert model and the original ProtBert version are:
|
22 |
+
1. The size of the model
|
23 |
+
2. The size of the pretraining dataset
|
24 |
+
3. Time & hardware used for pretraining
|
25 |
+
|
26 |
+
## Intended uses & limitations
|
27 |
+
|
28 |
+
The model could be used for protein feature extraction or to be fine-tuned on downstream tasks.
|
29 |
+
|
30 |
+
### How to use
|
31 |
+
|
32 |
+
The model can be used the same as ProtBert.
|
33 |
+
|
34 |
+
## Training data
|
35 |
+
|
36 |
+
DistilProtBert model was pretrained on [Uniref50](https://www.uniprot.org/downloads), a dataset consisting of ~43 million protein sequences after length filtering (only sequences of length 20 to 512 amino acid were used).
|
37 |
+
|