KeXing
commited on
Upload README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,48 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# GH29BERT
|
2 |
+
This repository contains the code and testing sequence data for reproduce the prediction results for GH29BERT, a protein functional cluster prediction model devised for GH29 family sequences. It is trained based on a semi-supervised deep learning method with:
|
3 |
+
- a. 34,258 unlabeled and non-redundant GH29 sequences (i.e., unlabelled data) extracted from CAZy and Interpro databases and
|
4 |
+
- b. 2,796 labelled sequences with 45 cluster classes based on a thorough SSN analysis.
|
5 |
+
Specifically, the reproducible testing materials (code and data) on following two types of GH29 sequences used in submitted manuscript are provided, including:
|
6 |
+
- 559 labelled GH29 testing sequences (2,796 labelled data with a random 80%-20% split for training and testing), see file `data/test.fasta`
|
7 |
+
- 15 held-out characterized sequences that was excluded from both pre-training and task-training, see file `data/15_seq_for-test.fasta`
|
8 |
+
## Interactive deployment of GH29BERT for prediction testing
|
9 |
+
GH29BERT model is also accessible through a friendly user-interface on HuggingFace: https://huggingface.co/spaces/Oiliver/GH29BERT. It is easier to test the above provided GH29 sequences or your custom GH29 sequence using this web tool.
|
10 |
+
## Prerequisites
|
11 |
+
### Repository download
|
12 |
+
To get started, clone this repository, e.g., execute the following in the terminal: `git clone https://github.com/ke-xing/GH29BERT.git`
|
13 |
+
### Environment preparation
|
14 |
+
Please check all the useful packages in the file **environment.yml**.
|
15 |
+
With the help of [Conda](https://docs.conda.io/projects/conda/en/stable/user-guide/getting-started.html), run `conda env create --file environment.yml` to create an independent environment for implementing the testing
|
16 |
+
### Model parameter download
|
17 |
+
Due to the limit of single file size of GitHub repository, we upload the model parameter files at [Zenodo open repository](https://zenodo.org/records/10614689)
|
18 |
+
- GH29BERT
|
19 |
+
```python
|
20 |
+
# Load GH29BERT pre-trained model
|
21 |
+
GH29BERT=torch.load('transformer1500_95p_500.pt')
|
22 |
+
GH29BERT=GH29BERT.module
|
23 |
+
GH29BERT=GH29BERT.to('cuda:0')
|
24 |
+
# Load GH29BERT task model
|
25 |
+
downstream_GH29BERT=torch.load('down_model_500_kfold1.pt').to('cuda:0')
|
26 |
+
```
|
27 |
+
- ProtT5-XL
|
28 |
+
- Reproducing prediction testing based on pre-trained ProtT5-XL requires installing extra dependency libraries:
|
29 |
+
```
|
30 |
+
pip install torch
|
31 |
+
pip install transformers
|
32 |
+
pip install sentencepiece
|
33 |
+
```
|
34 |
+
- For more details, please follow the instructions of [ProtTrans](https://ieeexplore.ieee.org/document/9477085) repository from [github](https://github.com/agemagician/ProtTrans/?tab=readme-ov-file).
|
35 |
+
```python
|
36 |
+
from transformers import T5Tokenizer, T5EncoderModel
|
37 |
+
|
38 |
+
# Load ProtT5_XL pre-trained model
|
39 |
+
ProtT5_XL=T5EncoderModel.from_pretrained("Rostlab/prot_t5_xl_half_uniref50-enc",cache_dir='./').to('cuda:0')
|
40 |
+
# Load ProtT5_XL task model
|
41 |
+
downstream_ProtT5_XL=torch.load('down_model_500_kfold1.pt').to('cuda:0')
|
42 |
+
```
|
43 |
+
- ## Cluster prediction
|
44 |
+
Run `python python test.py` for predicting the fasta data. Model and data loading directory should be adjusted if need.
|
45 |
+
- ## Representation visualization
|
46 |
+
The visualization of GH29 representations with GH29BERT or other pre-training models can be implemented through `python visualization by UMAP.py` for obtaining the dimension-reduced intermediate representations and run `python figure1.py figure2.py` to get the visualization map.
|
47 |
+
- ## Code for model training
|
48 |
+
We also provide the model training code for pre-training and downstream task-training. Run `python Pretrain/transformer/transformer_train.py` for GH29BERT model pre-training. Run`python classification/downstream_embedding.py` for loading the pre-trained model parameters and the embedding data(.npz) preparing for the task-training, and then run `python classification/downstream_train.py` for cluster prediction for task-training.
|