Rostlab
/

prot_t5_xl_uniref50

Text2Text Generation

protein language model

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

agemagician commited on Feb 15, 2021

Commit

78376fc

•

1 Parent(s): 99b17de

update readme

Files changed (1) hide show

README.md +7 -6

README.md CHANGED Viewed

@@ -15,7 +15,7 @@ Pretrained model on protein sequences using a masked language modeling (MLM) obj
 ## Model description
-ProtT5-XL-BFD is based on the `t5-3b` model and was pretrained on a large corpus of protein sequences in a self-supervised fashion.
 This means it was pretrained on the raw protein sequences only, with no humans labelling them in any way (which is why it can use lots of
 publicly available data) with an automatic process to generate inputs and labels from those protein sequences.
@@ -42,9 +42,9 @@ from transformers import T5Tokenizer, T5Model
 import re
 import torch
-tokenizer = T5Tokenizer.from_pretrained('Rostlab/prot_t5_xl_bfd', do_lower_case=False)
-model = T5Model.from_pretrained("Rostlab/prot_t5_xl_bfd")
 sequences_Example = ["A E T C Z A O","S K T Z P"]
@@ -65,7 +65,7 @@ decoder_embedding = embedding[0].cpu().numpy()
 ## Training data
-The ProtT5-XL-BFD model was pretrained on [BFD](https://bfd.mmseqs.com/), a dataset consisting of 2.1 billion protein sequences.
 ## Training procedure
@@ -87,14 +87,15 @@ The details of the masking procedure for each sequence are as follows:
 ### Pretraining
-The model was trained on a single TPU Pod V3-1024 for 1.2 million steps in total, using sequence length 512 (batch size 4k).
 It has a total of approximately 3B parameters and was trained using the encoder-decoder architecture.
 The optimizer used is AdaFactor with inverse square root learning rate schedule for pre-training.
 ## Evaluation results
-When the model is used for feature etraction, this model achieves the following results:
 Test results :

 ## Model description
+ProtT5-XL-UniRef50 is based on the `t5-3b` model and was pretrained on a large corpus of protein sequences in a self-supervised fashion.
 This means it was pretrained on the raw protein sequences only, with no humans labelling them in any way (which is why it can use lots of
 publicly available data) with an automatic process to generate inputs and labels from those protein sequences.
 import re
 import torch
+tokenizer = T5Tokenizer.from_pretrained('Rostlab/prot_t5_xl_uniref50', do_lower_case=False)
+model = T5Model.from_pretrained("Rostlab/prot_t5_xl_uniref50")
 sequences_Example = ["A E T C Z A O","S K T Z P"]
 ## Training data
+The ProtT5-XL-UniRef50 model was pretrained on [UniRef50](https://www.uniprot.org/help/uniref), a dataset consisting of 45 million protein sequences.
 ## Training procedure
 ### Pretraining
+The model was trained on a single TPU Pod V2-256 for 600 thousand steps in total, using sequence length 512 (batch size 2k).
+It was trained using ProtT5-XL-BFD model as an initial checkpoint, rather than training from scratch.
 It has a total of approximately 3B parameters and was trained using the encoder-decoder architecture.
 The optimizer used is AdaFactor with inverse square root learning rate schedule for pre-training.
 ## Evaluation results
+When the model is used for feature extraction, this model achieves the following results:
 Test results :