AmelieSchreiber
/

cafa_5_protein_function_prediction

Text Classification

protein language model

protein function prediction

Inference Endpoints

Model card Files Files and versions Community

AmelieSchreiber commited on Aug 27, 2023

Commit

8294a2b

•

1 Parent(s): 4a0c4d7

Update README.md

Files changed (1) hide show

README.md +38 -1

README.md CHANGED Viewed

@@ -44,7 +44,44 @@ Validation Macro Recall: 0.9966
 ```
 ## Using the model
-First, downlowd the file `go-basic.obo` [from here](https://huggingface.co/datasets/AmelieSchreiber/cafa_5)
 and store the file locally, then provide the local path in the the code below:
 ```python

 ```
 ## Using the model
+First, download the `train_sequences.fasta` file and the `train_terms.tsv` file, and provide the local paths in the code below:
+```python
+import os
+import numpy as np
+import torch
+from transformers import AutoTokenizer, EsmForSequenceClassification, AdamW
+from torch.nn.functional import binary_cross_entropy_with_logits
+from sklearn.model_selection import train_test_split
+from sklearn.metrics import f1_score, precision_score, recall_score
+# from accelerate import Accelerator
+from Bio import SeqIO
+# Step 1: Data Preprocessing (Replace with your local paths)
+fasta_file = "/Users/amelieschreiber/.cursor-tutor/projects/python/cafa5/cafa-5-protein-function-prediction/Train/train_sequences.fasta"
+tsv_file = "/Users/amelieschreiber/.cursor-tutor/projects/python/cafa5/cafa-5-protein-function-prediction/Train/train_terms.tsv"
+fasta_data = {}
+tsv_data = {}
+for record in SeqIO.parse(fasta_file, "fasta"):
+    fasta_data[record.id] = str(record.seq)
+with open(tsv_file, 'r') as f:
+    for line in f:
+        parts = line.strip().split("\t")
+        tsv_data[parts[0]] = parts[1:]
+# tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t6_8M_UR50D")
+seq_length = 1022
+# tokenized_data = tokenizer(list(fasta_data.values()), padding=True, truncation=True, return_tensors="pt", max_length=seq_length)
+unique_terms = list(set(term for terms in tsv_data.values() for term in terms))
+```
+Second, downlowd the file `go-basic.obo` [from here](https://huggingface.co/datasets/AmelieSchreiber/cafa_5)
 and store the file locally, then provide the local path in the the code below:
 ```python