Knots ProtBert-BFD AlphaFold
Fine-tuned ProtBert-BFD to classify proteins as knotted vs. unknotted.
Model Details
- Model type: Bert
- Language: proteins (amino acid sequences)
- Finetuned from model: Rostlab/prot_bert_bfd
Model Sources:
- Repository: CEITEC
- Paper: TBD
Usage
Dataset format:
id,sequence,label
A0A2W5F4Z7,MGGIFRVNTYYTDLEPYLQSTKLPIYGALLDGENIYELVDKSKGILVIGNESKGIRSTIQNFIQKPITIPRIGQAESLNAAVATGIIVGQLTL,1
...
Load the dataset:
import pandas as pd
from datasets import Dataset, load_dataset
df = pd.read_csv(INPUT, sep=',')
dss = Dataset.from_pandas(df)
Predict:
import torch
import numpy as np
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments
from math import exp
def tokenize_function(s):
seq_split = ' '.join(s['Sequence'])
return tokenizerM1(seq_split)
tokenizer = AutoTokenizer.from_pretrained('roa7n/knots_protbertBFD_alphafold')
model = AutoModelForSequenceClassification.from_pretrained('roa7n/knots_protbertBFD_alphafold')
tokenized_dataset = dss.map(tokenize_function, num_proc=4)
tokenized_dataset.set_format('pt')
tokenized_dataset
training_args = TrainingArguments(<PATH>, fp16=True, per_device_eval_batch_size=50, report_to='none')
trainer = Trainer(
model,
training_args,
train_dataset=tokenized_dataset,
eval_dataset=tokenized_dataset,
tokenizer=tokenizerM1
)
predictions, _, _ = trainer.predict(tokenized_dataset)
predictions = [np.exp(p[1]) / np.sum(np.exp(p), axis=0) for p in predictions]
df['preds'] = predictions
Evaluation
Per protein family metrics:
M1 ProtBert-BFD | Dataset size | Unknotted set size | Accuracy | TPR | TNR |
---|---|---|---|---|---|
All | 39412 | 19718 | 0.9845 | 0.9865 | 0.9825 |
SPOUT | 7371 | 550 | 0.9887 | 0.9951 | 0.9090 |
TDD | 612 | 24 | 0.9901 | 0.9965 | 0.8333 |
DUF | 716 | 429 | 0.9748 | 0.9721 | 0.9766 |
AdoMet synthase | 1794 | 240 | 0.9899 | 0.9929 | 0.9708 |
Carbonic anhydrase | 1531 | 539 | 0.9588 | 0.9737 | 0.9313 |
UCH | 477 | 125 | 0.9056 | 0.9602 | 0.7520 |
ATCase/OTCase | 3799 | 3352 | 0.9994 | 0.9977 | 0.9997 |
ribosomal-mitochondrial | 147 | 41 | 0.8571 | 1.0000 | 0.4878 |
membrane | 8225 | 1493 | 0.9811 | 0.9904 | 0.9390 |
VIT | 14262 | 12555 | 0.9872 | 0.9420 | 0.9933 |
biosynthesis of lantibiotics | 392 | 286 | 0.9642 | 0.9528 | 0.9685 |
Citation [optional]
BibTeX: TODO
Model Authors
Simecek: simecek@mail.muni.cz Klimentova: vae@mail.muni.cz Sramkova: denisa.sramkova@mail.muni.cz
- Downloads last month
- 92
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.