AmelieSchreiber
/

esm2_t12_35M_lora_binding_sites_v2_cp1

Token Classification

protein language model

Model card Files Files and versions Community

AmelieSchreiber commited on Sep 13, 2023

Commit

6187032

·

1 Parent(s): a728b2c

Update README.md

Files changed (1) hide show

README.md +70 -1

README.md CHANGED Viewed

@@ -1,14 +1,83 @@
 ---
 library_name: peft
 license: mit
 ---
 ## Training procedure
 ```
 Epoch  Training Loss  Validation Loss Accuracy  Precision  Recall	 F1	      Auc	   Mcc
 1	   0.037400	      0.301413        0.939431	0.366282   0.833003	 0.508826 0.888300 0.528311
 ```
-### Framework versions
 - PEFT 0.5.0

 ---
 library_name: peft
 license: mit
+datasets:
+- AmelieSchreiber/binding_sites_random_split_by_family_550K
+metrics:
+- accuracy
+- f1
+- roc_auc
+- precision
+- recall
+- matthews_correlation
 ---
 ## Training procedure
+This model was finetuned on ~549K protein sequences from the UniProt database. The dataset can be found
+[here](https://huggingface.co/datasets/AmelieSchreiber/binding_sites_random_split_by_family_550K). The model obtains
+the following test metrics:
 ```
 Epoch  Training Loss  Validation Loss Accuracy  Precision  Recall	 F1	      Auc	   Mcc
 1	   0.037400	      0.301413        0.939431	0.366282   0.833003	 0.508826 0.888300 0.528311
 ```
+The dataset size increase from ~209K protein sequences to ~549K clearly improved performance in terms of test metric.
+We used Hugging Face's parameter efficient finetuning (PEFT) library to finetune with Low Rank Adaptation (LoRA). We decided
+to use a rank of 2 for the LoRA, as this was shown to slightly improve the test metrics compared to rank 8 and rank 16 on the
+same model trained on the smaller dataset.
+### Framework versions
 - PEFT 0.5.0
+## Using the model
+To use the model on one of your protein sequences try running the following:
+```python
+from transformers import AutoModelForTokenClassification, AutoTokenizer
+from peft import PeftModel
+import torch
+# Path to the saved LoRA model
+model_path = "AmelieSchreiber/esm2_t12_35M_lora_binding_sites_v2_cp1"
+# ESM2 base model
+base_model_path = "facebook/esm2_t12_35M_UR50D"
+# Load the model
+base_model = AutoModelForTokenClassification.from_pretrained(base_model_path)
+loaded_model = PeftModel.from_pretrained(base_model, model_path)
+# Ensure the model is in evaluation mode
+loaded_model.eval()
+# Load the tokenizer
+loaded_tokenizer = AutoTokenizer.from_pretrained(base_model_path)
+# Protein sequence for inference
+protein_sequence = "MAVPETRPNHTIYINNLNEKIKKDELKKSLHAIFSRFGQILDILVSRSLKMRGQAFVIFKEVSSATNALRSMQGFPFYDKPMRIQYAKTDSDIIAKMKGT"  # Replace with your actual sequence
+# Tokenize the sequence
+inputs = loaded_tokenizer(protein_sequence, return_tensors="pt", truncation=True, max_length=1024, padding='max_length')
+# Run the model
+with torch.no_grad():
+    logits = loaded_model(**inputs).logits
+# Get predictions
+tokens = loaded_tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])  # Convert input ids back to tokens
+predictions = torch.argmax(logits, dim=2)
+# Define labels
+id2label = {
+    0: "No binding site",
+    1: "Binding site"
+}
+# Print the predicted labels for each token
+for token, prediction in zip(tokens, predictions[0].numpy()):
+    if token not in ['<pad>', '<cls>', '<eos>']:
+        print((token, id2label[prediction]))
+```