AmelieSchreiber commited on
Commit
49e961f
1 Parent(s): 191695a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +26 -3
README.md CHANGED
@@ -10,17 +10,40 @@ metrics:
10
  - precision
11
  - recall
12
  - matthews_correlation
 
 
 
 
 
 
 
13
  ---
14
 
 
 
 
 
 
 
 
 
 
15
  ## Training procedure
16
 
17
  This model was finetuned on ~549K protein sequences from the UniProt database. The dataset can be found
18
  [here](https://huggingface.co/datasets/AmelieSchreiber/binding_sites_random_split_by_family_550K). The model obtains
19
- the following test metrics (scroll to the right):
20
 
21
  ```
22
- Epoch Training Loss Validation Loss Accuracy Precision Recall F1 Auc Mcc
23
- 1 0.037400 0.301413 0.939431 0.366282 0.833003 0.508826 0.888300 0.528311
 
 
 
 
 
 
 
24
  ```
25
 
26
  The dataset size increase from ~209K protein sequences to ~549K clearly improved performance in terms of test metric.
 
10
  - precision
11
  - recall
12
  - matthews_correlation
13
+ language:
14
+ - en
15
+ tags:
16
+ - ESM-2
17
+ - protein language model
18
+ - binding sites
19
+ - biology
20
  ---
21
 
22
+ # ESM-2 for Binding Site Prediction
23
+
24
+ This model is a finetuned version of the 35M parameter `esm2_t12_35M_UR50D` ([see here](https://huggingface.co/facebook/esm2_t12_35M_UR50D)
25
+ and [here](https://huggingface.co/docs/transformers/model_doc/esm) for more detail). The model was finetuned with LoRA for
26
+ the binay token classification task of predicting binding sites (and active sites) of protein sequences based on sequence alone.
27
+ The model may be underfit and undertrained, however it still achieved better performance on the test set in terms of loss, accuracy,
28
+ precision, recall, F1 score, ROC_AUC, and Matthews Correlation Coefficient (MCC) compared to the models trained on the smaller
29
+ dataset [found here](https://huggingface.co/datasets/AmelieSchreiber/binding_sites_random_split_by_family) of ~209K protein sequences.
30
+
31
  ## Training procedure
32
 
33
  This model was finetuned on ~549K protein sequences from the UniProt database. The dataset can be found
34
  [here](https://huggingface.co/datasets/AmelieSchreiber/binding_sites_random_split_by_family_550K). The model obtains
35
+ the following test metrics:
36
 
37
  ```
38
+ Test: (Epoch 1)
39
+ {'Training Loss': 0.037400,
40
+ 'Validation Loss': 0.301413,
41
+ 'accuracy': 0.939431,
42
+ 'precision': 0.366282,
43
+ 'recall': 0.833003,
44
+ 'f1': 0.508826,
45
+ 'auc': 0.888300,
46
+ 'mcc': 0.528311})
47
  ```
48
 
49
  The dataset size increase from ~209K protein sequences to ~549K clearly improved performance in terms of test metric.