File size: 3,400 Bytes
cc901d3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
---
datasets:
- EvaKlimentova/knots_AF
metrics:
- accuracy
---

# Knots ProtBert-BFD AlphaFold

Fine-tuned [ProtBert-BFD](https://huggingface.co/Rostlab/prot_bert_bfd) to classify proteins as knotted vs. unknotted. 

## Model Details

- **Model type:** Bert
- **Language:** proteins (amino acid sequences)
- **Finetuned from model:** [Rostlab/prot_bert_bfd](https://huggingface.co/Rostlab/prot_bert_bfd)

Model Sources:

- **Repository:** [CEITEC](https://github.com/ML-Bioinfo-CEITEC/pknots_experiments)
- **Paper:** TBD

## Usage

Dataset format:
```
id,sequence,label
A0A2W5F4Z7,MGGIFRVNTYYTDLEPYLQSTKLPIYGALLDGENIYELVDKSKGILVIGNESKGIRSTIQNFIQKPITIPRIGQAESLNAAVATGIIVGQLTL,1
...
```

Load the dataset:
```
import pandas as pd
from datasets import Dataset, load_dataset

df = pd.read_csv(INPUT, sep=',')
dss = Dataset.from_pandas(df)
```

Predict:
```
import torch
import numpy as np
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments
from math import exp

def tokenize_function(s):
    seq_split = ' '.join(s['Sequence'])
    return tokenizerM1(seq_split)

tokenizer = AutoTokenizer.from_pretrained('roa7n/knots_protbertBFD_alphafold')
model = AutoModelForSequenceClassification.from_pretrained('roa7n/knots_protbertBFD_alphafold')

tokenized_dataset = dss.map(tokenize_function, num_proc=4)
tokenized_dataset.set_format('pt')
tokenized_dataset

training_args = TrainingArguments(<PATH>, fp16=True, per_device_eval_batch_size=50, report_to='none')  

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_dataset,
    eval_dataset=tokenized_dataset,
    tokenizer=tokenizerM1
)

predictions, _, _ = trainer.predict(tokenized_dataset)
predictions = [np.exp(p[1]) / np.sum(np.exp(p), axis=0) for p in predictions]
df['preds'] = predictions
```

## Evaluation

Per protein family metrics:

|       M1 ProtBert-BFD        | Dataset size | Unknotted set size | Accuracy |   TPR  |   TNR  |
|:----------------------------:|:------------:|:------------------:|:--------:|:------:|:------:|
| All                          | 39412        | 19718              | **0.9845**   | 0.9865 | 0.9825 |
| SPOUT                        | 7371         | 550                | 0.9887   | 0.9951 | 0.9090 |
| TDD                          | 612          | 24                 | 0.9901   | 0.9965 | 0.8333 |
| DUF                          | 716          | 429                | 0.9748   | 0.9721 | 0.9766 |
| AdoMet synthase              | 1794         | 240                | 0.9899   | 0.9929 | 0.9708 |
| Carbonic anhydrase           | 1531         | 539                | 0.9588   | 0.9737 | 0.9313 |
| UCH                          | 477          | 125                | 0.9056   | 0.9602 | 0.7520 |
| ATCase/OTCase                | 3799         | 3352               | 0.9994   | 0.9977 | 0.9997 |
| ribosomal-mitochondrial      | 147          | 41                 | 0.8571   | 1.0000 | 0.4878 |
| membrane                     | 8225         | 1493               | 0.9811   | 0.9904 | 0.9390 |
| VIT                          | 14262        | 12555              | 0.9872   | 0.9420 | 0.9933 |
| biosynthesis of lantibiotics | 392          | 286                | 0.9642   | 0.9528 | 0.9685 |


## Citation [optional]

**BibTeX:** TODO

## Model Authors

Simecek: simecek@mail.muni.cz
Klimentova: vae@mail.muni.cz
Sramkova: denisa.sramkova@mail.muni.cz