agemagician
commited on
Commit
•
78376fc
1
Parent(s):
99b17de
update readme
Browse files
README.md
CHANGED
@@ -15,7 +15,7 @@ Pretrained model on protein sequences using a masked language modeling (MLM) obj
|
|
15 |
|
16 |
## Model description
|
17 |
|
18 |
-
ProtT5-XL-
|
19 |
This means it was pretrained on the raw protein sequences only, with no humans labelling them in any way (which is why it can use lots of
|
20 |
publicly available data) with an automatic process to generate inputs and labels from those protein sequences.
|
21 |
|
@@ -42,9 +42,9 @@ from transformers import T5Tokenizer, T5Model
|
|
42 |
import re
|
43 |
import torch
|
44 |
|
45 |
-
tokenizer = T5Tokenizer.from_pretrained('Rostlab/
|
46 |
|
47 |
-
model = T5Model.from_pretrained("Rostlab/
|
48 |
|
49 |
sequences_Example = ["A E T C Z A O","S K T Z P"]
|
50 |
|
@@ -65,7 +65,7 @@ decoder_embedding = embedding[0].cpu().numpy()
|
|
65 |
|
66 |
## Training data
|
67 |
|
68 |
-
The ProtT5-XL-
|
69 |
|
70 |
## Training procedure
|
71 |
|
@@ -87,14 +87,15 @@ The details of the masking procedure for each sequence are as follows:
|
|
87 |
|
88 |
### Pretraining
|
89 |
|
90 |
-
The model was trained on a single TPU Pod
|
|
|
91 |
It has a total of approximately 3B parameters and was trained using the encoder-decoder architecture.
|
92 |
The optimizer used is AdaFactor with inverse square root learning rate schedule for pre-training.
|
93 |
|
94 |
|
95 |
## Evaluation results
|
96 |
|
97 |
-
When the model is used for feature
|
98 |
|
99 |
Test results :
|
100 |
|
|
|
15 |
|
16 |
## Model description
|
17 |
|
18 |
+
ProtT5-XL-UniRef50 is based on the `t5-3b` model and was pretrained on a large corpus of protein sequences in a self-supervised fashion.
|
19 |
This means it was pretrained on the raw protein sequences only, with no humans labelling them in any way (which is why it can use lots of
|
20 |
publicly available data) with an automatic process to generate inputs and labels from those protein sequences.
|
21 |
|
|
|
42 |
import re
|
43 |
import torch
|
44 |
|
45 |
+
tokenizer = T5Tokenizer.from_pretrained('Rostlab/prot_t5_xl_uniref50', do_lower_case=False)
|
46 |
|
47 |
+
model = T5Model.from_pretrained("Rostlab/prot_t5_xl_uniref50")
|
48 |
|
49 |
sequences_Example = ["A E T C Z A O","S K T Z P"]
|
50 |
|
|
|
65 |
|
66 |
## Training data
|
67 |
|
68 |
+
The ProtT5-XL-UniRef50 model was pretrained on [UniRef50](https://www.uniprot.org/help/uniref), a dataset consisting of 45 million protein sequences.
|
69 |
|
70 |
## Training procedure
|
71 |
|
|
|
87 |
|
88 |
### Pretraining
|
89 |
|
90 |
+
The model was trained on a single TPU Pod V2-256 for 600 thousand steps in total, using sequence length 512 (batch size 2k).
|
91 |
+
It was trained using ProtT5-XL-BFD model as an initial checkpoint, rather than training from scratch.
|
92 |
It has a total of approximately 3B parameters and was trained using the encoder-decoder architecture.
|
93 |
The optimizer used is AdaFactor with inverse square root learning rate schedule for pre-training.
|
94 |
|
95 |
|
96 |
## Evaluation results
|
97 |
|
98 |
+
When the model is used for feature extraction, this model achieves the following results:
|
99 |
|
100 |
Test results :
|
101 |
|