AmelieSchreiber
commited on
Commit
•
49e961f
1
Parent(s):
191695a
Update README.md
Browse files
README.md
CHANGED
@@ -10,17 +10,40 @@ metrics:
|
|
10 |
- precision
|
11 |
- recall
|
12 |
- matthews_correlation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
13 |
---
|
14 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
15 |
## Training procedure
|
16 |
|
17 |
This model was finetuned on ~549K protein sequences from the UniProt database. The dataset can be found
|
18 |
[here](https://huggingface.co/datasets/AmelieSchreiber/binding_sites_random_split_by_family_550K). The model obtains
|
19 |
-
the following test metrics
|
20 |
|
21 |
```
|
22 |
-
Epoch
|
23 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
24 |
```
|
25 |
|
26 |
The dataset size increase from ~209K protein sequences to ~549K clearly improved performance in terms of test metric.
|
|
|
10 |
- precision
|
11 |
- recall
|
12 |
- matthews_correlation
|
13 |
+
language:
|
14 |
+
- en
|
15 |
+
tags:
|
16 |
+
- ESM-2
|
17 |
+
- protein language model
|
18 |
+
- binding sites
|
19 |
+
- biology
|
20 |
---
|
21 |
|
22 |
+
# ESM-2 for Binding Site Prediction
|
23 |
+
|
24 |
+
This model is a finetuned version of the 35M parameter `esm2_t12_35M_UR50D` ([see here](https://huggingface.co/facebook/esm2_t12_35M_UR50D)
|
25 |
+
and [here](https://huggingface.co/docs/transformers/model_doc/esm) for more detail). The model was finetuned with LoRA for
|
26 |
+
the binay token classification task of predicting binding sites (and active sites) of protein sequences based on sequence alone.
|
27 |
+
The model may be underfit and undertrained, however it still achieved better performance on the test set in terms of loss, accuracy,
|
28 |
+
precision, recall, F1 score, ROC_AUC, and Matthews Correlation Coefficient (MCC) compared to the models trained on the smaller
|
29 |
+
dataset [found here](https://huggingface.co/datasets/AmelieSchreiber/binding_sites_random_split_by_family) of ~209K protein sequences.
|
30 |
+
|
31 |
## Training procedure
|
32 |
|
33 |
This model was finetuned on ~549K protein sequences from the UniProt database. The dataset can be found
|
34 |
[here](https://huggingface.co/datasets/AmelieSchreiber/binding_sites_random_split_by_family_550K). The model obtains
|
35 |
+
the following test metrics:
|
36 |
|
37 |
```
|
38 |
+
Test: (Epoch 1)
|
39 |
+
{'Training Loss': 0.037400,
|
40 |
+
'Validation Loss': 0.301413,
|
41 |
+
'accuracy': 0.939431,
|
42 |
+
'precision': 0.366282,
|
43 |
+
'recall': 0.833003,
|
44 |
+
'f1': 0.508826,
|
45 |
+
'auc': 0.888300,
|
46 |
+
'mcc': 0.528311})
|
47 |
```
|
48 |
|
49 |
The dataset size increase from ~209K protein sequences to ~549K clearly improved performance in terms of test metric.
|