File size: 5,658 Bytes
8ac6808
 
0a52d93
 
 
 
 
 
 
 
 
 
 
4a0c4d7
 
 
 
 
 
8ac6808
52a3820
0a52d93
17297ea
2a96e07
a7bd966
 
 
52a3820
b4c1176
 
52a3820
 
d15f1ce
52a3820
 
 
 
 
 
 
 
 
 
 
 
 
0a52d93
 
8294a2b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0a52d93
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
---
license: mit
language:
- en
library_name: transformers
tags:
- ems
- esm2
- biology
- protein
- protein language model
- cafa 5
- protein function prediction
datasets:
- AmelieSchreiber/cafa_5
metrics:
- f1
- recall
- precision
---
# ESM-2 for Protein Function Prediction

Please also see the more recent fine-tuned model [AmelieSchreiber/esm2_t6_8M_finetuned_cafa5](https://huggingface.co/AmelieSchreiber/esm2_t6_8M_finetuned_cafa5). 

This model is not intended for protein function prediction, but rather as a checkpoint for further fine-tuning, especially
with Low Rank Adaptation (LoRA). This is an experimental model fine-tuned from the 
[esm2_t6_8M_UR50D](https://huggingface.co/facebook/esm2_t6_8M_UR50D) model 
for multi-label classification. In particular, the model is fine-tuned on the CAFA-5 protein sequence dataset available 
[here](https://huggingface.co/datasets/AmelieSchreiber/cafa_5). More precisely, the `train_sequences.fasta` file is the 
list of protein sequences that were trained on, and the 
`train_terms.tsv` file contains the gene ontology protein function labels for each protein sequence. For more details on using 
ESM-2 models for multi-label sequence classification, [see here](https://huggingface.co/docs/transformers/model_doc/esm). 
Due to the potentially complicated class weighting necessary for the hierarchical ontology, further fine-tuning will be necessary. 

## Fine-Tuning

The model was fine-tuned for 7 epochs at a learning rate of `5e-5`, and achieves the following metrics:
```
Validation Loss: 0.0027,
Validation Micro F1: 0.3672,
Validation Macro F1: 0.9967,
Validation Micro Precision: 0.6052,
Validation Macro Precision: 0.9996,
Validation Micro Recall: 0.2626,
Validation Macro Recall: 0.9966
```

## Using the model

First, download the `train_sequences.fasta` file and the `train_terms.tsv` file, and provide the local paths in the code below:

```python
import os
import numpy as np
import torch
from transformers import AutoTokenizer, EsmForSequenceClassification, AdamW
from torch.nn.functional import binary_cross_entropy_with_logits
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, precision_score, recall_score
# from accelerate import Accelerator
from Bio import SeqIO

# Step 1: Data Preprocessing (Replace with your local paths)
fasta_file = "/Users/amelieschreiber/.cursor-tutor/projects/python/cafa5/cafa-5-protein-function-prediction/Train/train_sequences.fasta"
tsv_file = "/Users/amelieschreiber/.cursor-tutor/projects/python/cafa5/cafa-5-protein-function-prediction/Train/train_terms.tsv"

fasta_data = {}
tsv_data = {}

for record in SeqIO.parse(fasta_file, "fasta"):
    fasta_data[record.id] = str(record.seq)

with open(tsv_file, 'r') as f:
    for line in f:
        parts = line.strip().split("\t")
        tsv_data[parts[0]] = parts[1:]

# tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t6_8M_UR50D")
seq_length = 1022
# tokenized_data = tokenizer(list(fasta_data.values()), padding=True, truncation=True, return_tensors="pt", max_length=seq_length)

unique_terms = list(set(term for terms in tsv_data.values() for term in terms))
```


Second, downlowd the file `go-basic.obo` [from here](https://huggingface.co/datasets/AmelieSchreiber/cafa_5)
and store the file locally, then provide the local path in the the code below:

```python
import torch
from transformers import AutoTokenizer, EsmForSequenceClassification
from sklearn.metrics import precision_recall_fscore_support

# 1. Parsing the go-basic.obo file
def parse_obo_file(file_path):
    with open(file_path, 'r') as f:
        data = f.read().split("[Term]")
        
    terms = []
    for entry in data[1:]:
        lines = entry.strip().split("\n")
        term = {}
        for line in lines:
            if line.startswith("id:"):
                term["id"] = line.split("id:")[1].strip()
            elif line.startswith("name:"):
                term["name"] = line.split("name:")[1].strip()
            elif line.startswith("namespace:"):
                term["namespace"] = line.split("namespace:")[1].strip()
            elif line.startswith("def:"):
                term["definition"] = line.split("def:")[1].split('"')[1]
        terms.append(term)
    return terms

parsed_terms = parse_obo_file("go-basic.obo")  # Replace `go-basic.obo` with your path

# 2. Load the saved model and tokenizer
model_path = "AmelieSchreiber/cafa_5_protein_function_prediction"
loaded_model = EsmForSequenceClassification.from_pretrained(model_path)
loaded_tokenizer = AutoTokenizer.from_pretrained(model_path)

# 3. The predict_protein_function function
def predict_protein_function(sequence, model, tokenizer, go_terms):
    inputs = tokenizer(sequence, return_tensors="pt", padding=True, truncation=True, max_length=1022)
    model.eval()
    with torch.no_grad():
        outputs = model(**inputs)
        predictions = torch.sigmoid(outputs.logits)
        predicted_indices = torch.where(predictions > 0.05)[1].tolist()
    
    functions = []
    for idx in predicted_indices:
        term_id = unique_terms[idx]  # Use the unique_terms list from your training script
        for term in go_terms:
            if term["id"] == term_id:
                functions.append(term["name"])
                break
                
    return functions

# 4. Predicting protein function for an example sequence
example_sequence = "MAYLGSLVQRRLELASGDRLEASLGVGSELDVRGDRVKAVGSLDLEEGRLEQAGVSMA"  # Replace with your protein sequence
predicted_functions = predict_protein_function(example_sequence, loaded_model, loaded_tokenizer, parsed_terms)
print(predicted_functions)
```