AmelieSchreiber
/

esm2_t12_35M_lora_binding_sites_v2_cp1

Token Classification

protein language model

Model card Files Files and versions Community

AmelieSchreiber commited on Sep 13, 2023

Commit

49e961f

·

1 Parent(s): 191695a

Update README.md

Files changed (1) hide show

README.md +26 -3

README.md CHANGED Viewed

@@ -10,17 +10,40 @@ metrics:
 - precision
 - recall
 - matthews_correlation
 ---
 ## Training procedure
 This model was finetuned on ~549K protein sequences from the UniProt database. The dataset can be found
 [here](https://huggingface.co/datasets/AmelieSchreiber/binding_sites_random_split_by_family_550K). The model obtains
-the following test metrics (scroll to the right):
 ```
-Epoch Training Loss  Validation Loss Accuracy  Precision  Recall	 F1	      Auc	   Mcc
-1	   0.037400	      0.301413        0.939431	0.366282   0.833003	 0.508826 0.888300 0.528311
 ```
 The dataset size increase from ~209K protein sequences to ~549K clearly improved performance in terms of test metric.

 - precision
 - recall
 - matthews_correlation
+language:
+- en
+tags:
+- ESM-2
+- protein language model
+- binding sites
+- biology
 ---
+# ESM-2 for Binding Site Prediction
+This model is a finetuned version of the 35M parameter `esm2_t12_35M_UR50D` ([see here](https://huggingface.co/facebook/esm2_t12_35M_UR50D)
+and [here](https://huggingface.co/docs/transformers/model_doc/esm) for more detail). The model was finetuned with LoRA for
+the binay token classification task of predicting binding sites (and active sites) of protein sequences based on sequence alone.
+The model may be underfit and undertrained, however it still achieved better performance on the test set in terms of loss, accuracy,
+precision, recall, F1 score, ROC_AUC, and Matthews Correlation Coefficient (MCC) compared to the models trained on the smaller
+dataset [found here](https://huggingface.co/datasets/AmelieSchreiber/binding_sites_random_split_by_family) of ~209K protein sequences.
 ## Training procedure
 This model was finetuned on ~549K protein sequences from the UniProt database. The dataset can be found
 [here](https://huggingface.co/datasets/AmelieSchreiber/binding_sites_random_split_by_family_550K). The model obtains
+the following test metrics:
 ```
+Test: (Epoch 1)
+ {'Training Loss': 0.037400,
+  'Validation Loss': 0.301413,
+  'accuracy': 0.939431,
+  'precision': 0.366282,
+  'recall': 0.833003,
+  'f1': 0.508826,
+  'auc': 0.888300,
+  'mcc': 0.528311})
 ```
 The dataset size increase from ~209K protein sequences to ~549K clearly improved performance in terms of test metric.