agemagician commited on
Commit
574a8b5
1 Parent(s): d59b2c6

Update README.md

Browse files

early beta release

Files changed (1) hide show
  1. README.md +120 -0
README.md CHANGED
@@ -1,3 +1,123 @@
1
  ---
2
  license: cc-by-nc-sa-4.0
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-nc-sa-4.0
3
+ tags:
4
+ - biology
5
+ - protein
6
+ - protein language model
7
+ - protein embedding
8
+ datasets:
9
+ - agemagician/uniref50
10
  ---
11
+
12
+ # ANKH2-Large model
13
+
14
+ Pretrained model on protein sequences using a masked language modeling (MLM) objective. It was introduced in
15
+ [this paper](https://arxiv.org/abs/2301.06568) and first released in
16
+ [this repository](https://github.com/agemagician/Ankh). This model is trained on uppercase amino acids: it only works with capital letter amino acids.
17
+
18
+
19
+ ## Model description
20
+
21
+ ANKH2-Large is based on the `ANKH-Large` model and was pretrained on a large corpus of protein sequences in a self-supervised fashion.
22
+ This means it was pretrained on the raw protein sequences only, with no humans labelling them in any way (which is why it can use lots of
23
+ publicly available data) with an automatic process to generate inputs and labels from those protein sequences.
24
+
25
+ One important difference between this ANKH2-Large model and the original ANKH-Large version is that the model was trained with more number of epochs.
26
+
27
+ It has been shown that the features extracted from this self-supervised model (LM-embeddings) captured important biophysical properties governing protein shape.
28
+ shape.
29
+ This implied learning some of the grammar of the language of life realized in protein sequences.
30
+
31
+ ## Intended uses & limitations
32
+
33
+ The model could be used for protein feature extraction or to be fine-tuned on downstream tasks.
34
+ We have noticed in some tasks you can gain more accuracy by fine-tuning the model using lora method rather than using it as a feature extractor.
35
+ We have also noticed that for feature extraction, its better to use the feature extracted from the encoder rather than from the decoder.
36
+
37
+ ### How to use
38
+
39
+ Here is how to use this model to extract the features of a given protein sequence in PyTorch:
40
+
41
+ ```python
42
+ sequence_examples = ["PRTEINO", "SEQWENCE"]
43
+ # tokenize sequences and pad up to the longest sequence in the batch
44
+ ids = tokenizer.batch_encode_plus(sequence_examples, add_special_tokens=True, padding="longest")
45
+ input_ids = torch.tensor(ids['input_ids']).to(device)
46
+ attention_mask = torch.tensor(ids['attention_mask']).to(device)
47
+ # generate embeddings
48
+ with torch.no_grad():
49
+ embedding_repr = model(input_ids=input_ids,attention_mask=attention_mask)
50
+ # extract embeddings for the first ([0,:]) sequence in the batch while removing padded & special tokens ([0,:7])
51
+ emb_0 = embedding_repr.last_hidden_state[0,:7] # shape (7 x 1536)
52
+ print(f"Shape of per-residue embedding of first sequences: {emb_0.shape}")
53
+ # do the same for the second ([1,:]) sequence in the batch while taking into account different sequence lengths ([1,:8])
54
+ emb_1 = embedding_repr.last_hidden_state[1,:8] # shape (8 x 1536)
55
+ # if you want to derive a single representation (per-protein embedding) for the whole protein
56
+ emb_0_per_protein = emb_0.mean(dim=0) # shape (1536)
57
+ print(f"Shape of per-protein embedding of first sequences: {emb_0_per_protein.shape}")
58
+ ```
59
+
60
+ ## Training data
61
+
62
+ The ANKH2-Large model was pretrained on [UniRef50](https://www.uniprot.org/help/uniref), a dataset consisting of 60 million protein sequences.
63
+
64
+ ## Training procedure
65
+
66
+ ### Preprocessing
67
+
68
+ The protein sequences are uppercased and tokenized using a single space and a vocabulary size of 25.
69
+ The inputs of the model are then of the form:
70
+
71
+ ```
72
+ Protein Sequence </s>
73
+ ```
74
+
75
+ The preprocessing step was performed on the fly, by cutting and padding the protein sequences up to 512 tokens.
76
+
77
+ The details of the masking procedure for each sequence are as follows:
78
+ - 20% of the amino acids are masked.
79
+ - In 100% of the cases, the masked amino acids are replaced by `<extra_id_num>` token, where "num" is a number in range 0 and 115.
80
+
81
+ ### Pretraining
82
+
83
+ The model was trained on a single TPU Pod V4-256 for 45 epochs in total, using sequence length 512 (batch size 1k).
84
+ It was trained using ANKH-Large model as an initial checkpoint, rather than training from scratch.
85
+ It has a total of approximately 2B parameters and was trained using the encoder-decoder architecture.
86
+ The optimizer used is Adafactor with linear warmup with linear decay learning rate schedule for pre-training.
87
+
88
+
89
+ ## Evaluation results
90
+
91
+ When the model is used for feature extraction "FE" and parameter efficient fine-tuning "Lora", this model achieves the following results:
92
+
93
+ Test results :
94
+
95
+ | Task/Dataset | Method | secondary structure (3-states) | secondary structure (8-states) | Localization | Membrane | Solubility | Fluorescence |
96
+ |:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|
97
+ | CASP12 | FE | comming soon | comming soon | | | | |
98
+ | CASP12 | Lora | comming soon | comming soon | | | | |
99
+ | TS115 | FE | comming soon | comming soon | | | | |
100
+ | TS115 | Lora | comming soon | comming soon | | | | |
101
+ | CB513 | FE | comming soon | comming soon | | | | |
102
+ | CB513 | Lora | comming soon | comming soon | | | | |
103
+ | DeepLoc | FE | | | comming soon | comming soon | |
104
+ | DeepLoc | Lora | | | comming soon | comming soon | | |
105
+ | Solubility | FE | | | | | comming soon | |
106
+ | Solubility | Lora | | | | | 74% | |
107
+ | Fluorescence | FE | | | | | | Comming Soon |
108
+ | Fluorescence | Lora | | | | | | 68% |
109
+
110
+ ### BibTeX entry and citation info
111
+
112
+ ```bibtex
113
+ @article{elnaggar2023ankh,
114
+ title={Ankh☥: Optimized protein language model unlocks general-purpose modelling},
115
+ author={Elnaggar, Ahmed and Essam, Hazem and Salah-Eldin, Wafaa and Moustafa, Walid and Elkerdawy, Mohamed and Rochereau, Charlotte and Rost, Burkhard},
116
+ journal={bioRxiv},
117
+ pages={2023--01},
118
+ year={2023},
119
+ publisher={Cold Spring Harbor Laboratory}
120
+ }
121
+ ```
122
+
123
+ > Created by [Ahmed Elnaggar/@Elnaggar_AI](https://twitter.com/Elnaggar_AI) | [LinkedIn](https://www.linkedin.com/in/prof-ahmed-elnaggar/)