Commit
·
afc2bea
1
Parent(s):
066f3ec
updated README.md
Browse files
README.md
CHANGED
@@ -1,98 +1,39 @@
|
|
1 |
-
|
2 |
-
language:
|
3 |
-
- ru
|
4 |
-
- en
|
5 |
-
thumbnail: https://raw.githubusercontent.com/JetRunner/BERT-of-Theseus/master/bert-of-theseus.png
|
6 |
-
tags:
|
7 |
-
- translation
|
8 |
-
- fsmt
|
9 |
-
license: Apache 2.0
|
10 |
-
datasets:
|
11 |
-
- wmt19
|
12 |
-
metrics:
|
13 |
-
- bleu
|
14 |
-
- sacrebleu
|
15 |
-
---
|
16 |
-
|
17 |
-
# MyModel
|
18 |
|
19 |
## Model description
|
20 |
|
21 |
-
This is a
|
22 |
-
|
23 |
-
For more details, please see, [Facebook FAIR's WMT19 News Translation Task Submission](https://arxiv.org/abs/1907.06616).
|
24 |
-
|
25 |
-
The abbreviation FSMT stands for FairSeqMachineTranslation
|
26 |
|
27 |
-
|
28 |
|
29 |
-
|
30 |
-
* [wmt19-ru-en](https://huggingface.co/facebook/wmt19-ru-en)
|
31 |
-
* [wmt19-en-de](https://huggingface.co/facebook/wmt19-en-de)
|
32 |
-
* [wmt19-de-en](https://huggingface.co/facebook/wmt19-de-en)
|
33 |
|
34 |
## Intended uses & limitations
|
35 |
|
|
|
|
|
36 |
#### How to use
|
37 |
|
38 |
-
|
39 |
-
from transformers.tokenization_fsmt import FSMTTokenizer
|
40 |
-
from transformers.modeling_fsmt import FSMTForConditionalGeneration
|
41 |
-
mname = "facebook/wmt19-ru-en"
|
42 |
-
tokenizer = FSMTTokenizer.from_pretrained(mname)
|
43 |
-
model = FSMTForConditionalGeneration.from_pretrained(mname)
|
44 |
|
45 |
-
|
46 |
-
|
47 |
-
|
48 |
-
|
49 |
-
print(decoded) # Machine learning is great, isn't it?
|
50 |
```
|
51 |
|
52 |
-
|
53 |
-
|
54 |
-
- The original (and this ported model) doesn't seem to handle well inputs with repeated sub-phrases, [content gets truncated](https://discuss.huggingface.co/t/issues-with-translating-inputs-containing-repeated-phrases/981)
|
55 |
-
|
56 |
-
|
57 |
-
## Training data
|
58 |
|
59 |
-
|
60 |
-
|
61 |
-
|
62 |
-
## Training procedure
|
63 |
|
|
|
64 |
|
65 |
## Eval results
|
66 |
|
67 |
-
|
68 |
-
|
69 |
-
|
70 |
-
|
71 |
-
|
72 |
-
The score was calculated using this code:
|
73 |
|
74 |
-
|
75 |
-
git clone https://github.com/huggingface/transformers
|
76 |
-
cd transformers
|
77 |
-
export PAIR=ru-en
|
78 |
-
export DATA_DIR=data/$PAIR
|
79 |
-
export SAVE_DIR=data/$PAIR
|
80 |
-
export BS=8
|
81 |
-
export NUM_BEAMS=15
|
82 |
-
mkdir -p $DATA_DIR
|
83 |
-
sacrebleu -t wmt19 -l $PAIR --echo src > $DATA_DIR/val.source
|
84 |
-
sacrebleu -t wmt19 -l $PAIR --echo ref > $DATA_DIR/val.target
|
85 |
-
echo $PAIR
|
86 |
-
PYTHONPATH="src:examples/seq2seq" python examples/seq2seq/run_eval.py facebook/wmt19-$PAIR $DATA_DIR/val.source $SAVE_DIR/test_translations.txt --reference_path $DATA_DIR/val.target --score_path $SAVE_DIR/test_bleu.json --bs $BS --task translation --num_beams $NUM_BEAMS
|
87 |
-
```
|
88 |
-
|
89 |
-
### BibTeX entry and citation info
|
90 |
-
|
91 |
-
```bibtex
|
92 |
-
@inproceedings{...,
|
93 |
-
year={2020},
|
94 |
-
title={Facebook FAIR's WMT19 News Translation Task Submission},
|
95 |
-
author={Ng, Nathan and Yee, Kyra and Baevski, Alexei and Ott, Myle and Auli, Michael and Edunov, Sergey},
|
96 |
-
booktitle={Proc. of WMT},
|
97 |
-
}
|
98 |
-
```
|
|
|
1 |
+
# Albumin-15s
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
|
3 |
## Model description
|
4 |
|
5 |
+
This is a version of [Albert-base-v2](https://huggingface.co/albert-base-v2) for 15's long aptamers comparison to determine which one is more affine to target protein Albumin.
|
|
|
|
|
|
|
|
|
6 |
|
7 |
+
The Albert model was pretrained in the English language, it has many similarities with language or proteins and aptamers which is why we had to fine-tune it to help the model learn embedded positioning for aptamers to be able to distinguish better sequences.
|
8 |
|
9 |
+
More information can be found in our [github]() and our iGEMs [wiki]().
|
|
|
|
|
|
|
10 |
|
11 |
## Intended uses & limitations
|
12 |
|
13 |
+
You can use the fine-tuned model for either masked aptamer pair sequence classification, which one is more affine for target protein Albumin, prediction, but it's mostly intended to be fine-tuned again on a different length aptamer or simply expanded datasets.
|
14 |
+
|
15 |
#### How to use
|
16 |
|
17 |
+
This model can be used to predict compared affinity with dataset preprocessing function which encodes the specific type of data (Sequence1, Sequence2, Label) where Label indicates binary if Sequence1 is more affine to target protein Albumin.
|
|
|
|
|
|
|
|
|
|
|
18 |
|
19 |
+
```python
|
20 |
+
from transformers import AutoTokenizer, BertModel
|
21 |
+
mname = "Vilnius-Lithuania-iGEM/Albumin"
|
22 |
+
model = BertModel.from_pretrained(mname)
|
|
|
23 |
```
|
24 |
|
25 |
+
To predict batches of sequences you have to employ custom functions shown in [git/prediction.ipynb]()
|
|
|
|
|
|
|
|
|
|
|
26 |
|
27 |
+
#### Limitations and bias
|
|
|
|
|
|
|
28 |
|
29 |
+
- It seems that fine-tuned Albert model for this kind of task has limition of 90 % accuracy predicting which aptamer is more suitable for a target protein, also Albert-large or immense dataset of 15s aptamer could increase accuracy few %, however extrapolation case is not studied and we cannot confirm this model is state-of-The-art when one of aptamers is SUPER good (has almost maximum entropy to the Albumin).
|
30 |
|
31 |
## Eval results
|
32 |
|
33 |
+
accuracy : 0.8601
|
34 |
+
precision: 0.8515
|
35 |
+
recall : 0.8725
|
36 |
+
f1 : 0.8618
|
37 |
+
roc_auc : 0.9388
|
|
|
38 |
|
39 |
+
The score was calculated using sklearn.metrics.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|