agemagician commited on
Commit
78376fc
1 Parent(s): 99b17de

update readme

Browse files
Files changed (1) hide show
  1. README.md +7 -6
README.md CHANGED
@@ -15,7 +15,7 @@ Pretrained model on protein sequences using a masked language modeling (MLM) obj
15
 
16
  ## Model description
17
 
18
- ProtT5-XL-BFD is based on the `t5-3b` model and was pretrained on a large corpus of protein sequences in a self-supervised fashion.
19
  This means it was pretrained on the raw protein sequences only, with no humans labelling them in any way (which is why it can use lots of
20
  publicly available data) with an automatic process to generate inputs and labels from those protein sequences.
21
 
@@ -42,9 +42,9 @@ from transformers import T5Tokenizer, T5Model
42
  import re
43
  import torch
44
 
45
- tokenizer = T5Tokenizer.from_pretrained('Rostlab/prot_t5_xl_bfd', do_lower_case=False)
46
 
47
- model = T5Model.from_pretrained("Rostlab/prot_t5_xl_bfd")
48
 
49
  sequences_Example = ["A E T C Z A O","S K T Z P"]
50
 
@@ -65,7 +65,7 @@ decoder_embedding = embedding[0].cpu().numpy()
65
 
66
  ## Training data
67
 
68
- The ProtT5-XL-BFD model was pretrained on [BFD](https://bfd.mmseqs.com/), a dataset consisting of 2.1 billion protein sequences.
69
 
70
  ## Training procedure
71
 
@@ -87,14 +87,15 @@ The details of the masking procedure for each sequence are as follows:
87
 
88
  ### Pretraining
89
 
90
- The model was trained on a single TPU Pod V3-1024 for 1.2 million steps in total, using sequence length 512 (batch size 4k).
 
91
  It has a total of approximately 3B parameters and was trained using the encoder-decoder architecture.
92
  The optimizer used is AdaFactor with inverse square root learning rate schedule for pre-training.
93
 
94
 
95
  ## Evaluation results
96
 
97
- When the model is used for feature etraction, this model achieves the following results:
98
 
99
  Test results :
100
 
 
15
 
16
  ## Model description
17
 
18
+ ProtT5-XL-UniRef50 is based on the `t5-3b` model and was pretrained on a large corpus of protein sequences in a self-supervised fashion.
19
  This means it was pretrained on the raw protein sequences only, with no humans labelling them in any way (which is why it can use lots of
20
  publicly available data) with an automatic process to generate inputs and labels from those protein sequences.
21
 
 
42
  import re
43
  import torch
44
 
45
+ tokenizer = T5Tokenizer.from_pretrained('Rostlab/prot_t5_xl_uniref50', do_lower_case=False)
46
 
47
+ model = T5Model.from_pretrained("Rostlab/prot_t5_xl_uniref50")
48
 
49
  sequences_Example = ["A E T C Z A O","S K T Z P"]
50
 
 
65
 
66
  ## Training data
67
 
68
+ The ProtT5-XL-UniRef50 model was pretrained on [UniRef50](https://www.uniprot.org/help/uniref), a dataset consisting of 45 million protein sequences.
69
 
70
  ## Training procedure
71
 
 
87
 
88
  ### Pretraining
89
 
90
+ The model was trained on a single TPU Pod V2-256 for 600 thousand steps in total, using sequence length 512 (batch size 2k).
91
+ It was trained using ProtT5-XL-BFD model as an initial checkpoint, rather than training from scratch.
92
  It has a total of approximately 3B parameters and was trained using the encoder-decoder architecture.
93
  The optimizer used is AdaFactor with inverse square root learning rate schedule for pre-training.
94
 
95
 
96
  ## Evaluation results
97
 
98
+ When the model is used for feature extraction, this model achieves the following results:
99
 
100
  Test results :
101