AmelieSchreiber
/

cafa_5_protein_function_prediction

Text Classification

protein language model

protein function prediction

Inference Endpoints

Model card Files Files and versions Community

AmelieSchreiber commited on Aug 21, 2023

Commit

0a52d93

•

1 Parent(s): 19e5543

Update README.md

Files changed (1) hide show

README.md +74 -0

README.md CHANGED Viewed

@@ -1,3 +1,77 @@
 ---
 license: mit
 ---

 ---
 license: mit
+language:
+- en
+library_name: transformers
+tags:
+- ems
+- esm2
+- biology
+- protein
+- protein language model
+- cafa 5
+- protein function prediction
 ---
+## Using the model
+First, downlowd the file `go-basic.obo` [from here](https://huggingface.co/datasets/AmelieSchreiber/cafa_5)
+and store the file locally, then provide the local path in the the code below:
+```python
+import torch
+from transformers import AutoTokenizer, EsmForSequenceClassification
+from sklearn.metrics import precision_recall_fscore_support
+# 1. Parsing the go-basic.obo file
+def parse_obo_file(file_path):
+    with open(file_path, 'r') as f:
+        data = f.read().split("[Term]")
+    terms = []
+    for entry in data[1:]:
+        lines = entry.strip().split("\n")
+        term = {}
+        for line in lines:
+            if line.startswith("id:"):
+                term["id"] = line.split("id:")[1].strip()
+            elif line.startswith("name:"):
+                term["name"] = line.split("name:")[1].strip()
+            elif line.startswith("namespace:"):
+                term["namespace"] = line.split("namespace:")[1].strip()
+            elif line.startswith("def:"):
+                term["definition"] = line.split("def:")[1].split('"')[1]
+        terms.append(term)
+    return terms
+parsed_terms = parse_obo_file("go-basic.obo")  # Replace `go-basic.obo` with your path
+# 2. Load the saved model and tokenizer
+model_path = "AmelieSchreiber/cafa_5_protein_function_prediction"
+loaded_model = EsmForSequenceClassification.from_pretrained(model_path)
+loaded_tokenizer = AutoTokenizer.from_pretrained(model_path)
+# 3. The predict_protein_function function
+def predict_protein_function(sequence, model, tokenizer, go_terms):
+    inputs = tokenizer(sequence, return_tensors="pt", padding=True, truncation=True, max_length=1022)
+    model.eval()
+    with torch.no_grad():
+        outputs = model(**inputs)
+        predictions = torch.sigmoid(outputs.logits)
+        predicted_indices = torch.where(predictions > 0.05)[1].tolist()
+    functions = []
+    for idx in predicted_indices:
+        term_id = unique_terms[idx]  # Use the unique_terms list from your training script
+        for term in go_terms:
+            if term["id"] == term_id:
+                functions.append(term["name"])
+                break
+    return functions
+# 4. Predicting protein function for an example sequence
+example_sequence = "MAYLGSLVQRRLELASGDRLEASLGVGSELDVRGDRVKAVGSLDLEEGRLEQAGVSMA"  # Replace with your protein sequence
+predicted_functions = predict_protein_function(example_sequence, loaded_model, loaded_tokenizer, parsed_terms)
+print(predicted_functions)
+```