Migrate model card from transformers-repo
Browse filesRead announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/savasy/bert-base-turkish-ner-cased/README.md
README.md
CHANGED
@@ -1,18 +1,13 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
4 |
|
5 |
-
|
6 |
-
|
7 |
|
8 |
-
* The WikiANN dataset (Pan et al. 2017) is a dataset with NER annotations for PER, ORG and LOC. It has been constructed using the linked entities in Wikipedia pages for 282 different languages including Danish. The dataset can be loaded with the DaNLP package:
|
9 |
|
10 |
-
|
11 |
|
12 |
-
Thank to @stefan-it, I downloaded the data from the link as follows
|
13 |
-
|
14 |
-
```
|
15 |
-
mkdir tr-data
|
16 |
|
17 |
cd tr-data
|
18 |
|
@@ -20,29 +15,24 @@ for file in train.txt dev.txt test.txt labels.txt
|
|
20 |
do
|
21 |
wget https://schweter.eu/storage/turkish-bert-wikiann/$file
|
22 |
done
|
23 |
-
```
|
24 |
|
25 |
-
|
26 |
-
|
27 |
|
|
|
|
|
28 |
```
|
29 |
export MAX_LENGTH=128
|
30 |
-
export BERT_MODEL=dbmdz/bert-base-turkish-cased
|
31 |
export OUTPUT_DIR=tr-new-model
|
32 |
export BATCH_SIZE=32
|
33 |
export NUM_EPOCHS=3
|
34 |
export SAVE_STEPS=625
|
35 |
export SEED=1
|
36 |
-
|
37 |
```
|
38 |
-
|
39 |
-
I run the following ner-training code(you can find it under transformer github repo)
|
40 |
-
|
41 |
-
|
42 |
```
|
43 |
-
|
44 |
-
|
45 |
-
python3 run_ner.py --data_dir ./tr-data3 \
|
46 |
--model_type bert \
|
47 |
--labels ./tr-data/labels.txt \
|
48 |
--model_name_or_path $BERT_MODEL \
|
@@ -58,84 +48,46 @@ python3 run_ner.py --data_dir ./tr-data3 \
|
|
58 |
--fp16
|
59 |
```
|
60 |
|
61 |
-
If you dont have GPU-enabled computer, please skip last --fp16 parameter.
|
62 |
-
Finally, you can find your trained model and model performance unde tr-new-model folder
|
63 |
-
|
64 |
-
|
65 |
-
## Some Results
|
66 |
-
|
67 |
-
###Performance for WikiANN dataset
|
68 |
-
```
|
69 |
-
cat tr-new-model-1/eval_results.txt
|
70 |
-
cat tr-new-model-1/test_results.txt
|
71 |
-
|
72 |
-
*Eval Results:*
|
73 |
-
|
74 |
-
precision = 0.916400580551524
|
75 |
-
recall = 0.9342309684101502
|
76 |
-
f1 = 0.9252298787412536
|
77 |
-
loss = 0.11335893666411284
|
78 |
-
|
79 |
-
*Test Results:*
|
80 |
-
precision = 0.9192058759362955
|
81 |
-
recall = 0.9303010230367262
|
82 |
-
f1 = 0.9247201697271198
|
83 |
-
loss = 0.11182546521618497
|
84 |
-
|
85 |
-
```
|
86 |
-
|
87 |
-
### Performance with another dataset at the link
|
88 |
-
https://github.com/stefan-it/turkish-bert/files/4558187/nerdata.txt
|
89 |
-
|
90 |
-
```
|
91 |
-
savas@savas-lenova:~/Desktop/trans/tr-new-model-1$ cat eval_results.txt
|
92 |
-
precision = 0.9461980692049029
|
93 |
-
recall = 0.959309358847465
|
94 |
-
f1 = 0.9527086063783312
|
95 |
-
loss = 0.037054269206847804
|
96 |
-
|
97 |
-
savas@savas-lenova:~/Desktop/trans/tr-new-model-1$ cat test_results.txt
|
98 |
-
precision = 0.9458370635631155
|
99 |
-
recall = 0.9588201928530913
|
100 |
-
f1 = 0.952284378344882
|
101 |
-
loss = 0.035431676572445225
|
102 |
-
```
|
103 |
|
104 |
# Usage
|
105 |
|
106 |
-
You should install transformer library first
|
107 |
-
|
108 |
-
```
|
109 |
-
pip install transformers
|
110 |
-
```
|
111 |
-
|
112 |
-
And, open python environment and run the following code
|
113 |
-
|
114 |
```
|
115 |
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
|
116 |
-
|
117 |
model = AutoModelForTokenClassification.from_pretrained("savasy/bert-base-turkish-ner-cased")
|
118 |
tokenizer = AutoTokenizer.from_pretrained("savasy/bert-base-turkish-ner-cased")
|
119 |
ner=pipeline('ner', model=model, tokenizer=tokenizer)
|
120 |
ner("Mustafa Kemal Atatürk 19 Mayıs 1919'da Samsun'a ayak bastı.")
|
121 |
-
|
122 |
-
[{'word': 'Mustafa', 'score': 0.9938516616821289, 'entity': 'B-PER'}, {'word': 'Kemal', 'score': 0.9881671071052551, 'entity': 'I-PER'}, {'word': 'Atatürk', 'score': 0.9957979321479797, 'entity': 'I-PER'}, {'word': 'Samsun', 'score': 0.9059973359107971, 'entity': 'B-LOC'}]
|
123 |
-
|
124 |
```
|
|
|
|
|
|
|
125 |
|
|
|
|
|
|
|
|
|
126 |
|
|
|
|
|
|
|
|
|
|
|
127 |
|
128 |
|
129 |
|
|
|
|
|
|
|
130 |
|
|
|
|
|
|
|
|
|
|
|
131 |
|
132 |
-
|
133 |
-
|
134 |
-
|
135 |
-
|
136 |
-
|
137 |
-
|
138 |
-
|
139 |
-
|
140 |
-
|
141 |
|
|
|
1 |
+
---
|
2 |
+
language: tr
|
3 |
+
---
|
4 |
|
5 |
+
# For Turkish language, here is an easy-to-use NER application.
|
6 |
+
** Türkçe için kolay bir python NER (Bert + Transfer Learning) (İsim Varlık Tanıma) modeli...
|
7 |
|
|
|
8 |
|
9 |
+
Thanks to @stefan-it, I applied the followings for training
|
10 |
|
|
|
|
|
|
|
|
|
11 |
|
12 |
cd tr-data
|
13 |
|
|
|
15 |
do
|
16 |
wget https://schweter.eu/storage/turkish-bert-wikiann/$file
|
17 |
done
|
|
|
18 |
|
19 |
+
cd ..
|
20 |
+
It will download the pre-processed datasets with training, dev and test splits and put them in a tr-data folder.
|
21 |
|
22 |
+
Run pre-training
|
23 |
+
After downloading the dataset, pre-training can be started. Just set the following environment variables:
|
24 |
```
|
25 |
export MAX_LENGTH=128
|
26 |
+
export BERT_MODEL=dbmdz/bert-base-turkish-cased
|
27 |
export OUTPUT_DIR=tr-new-model
|
28 |
export BATCH_SIZE=32
|
29 |
export NUM_EPOCHS=3
|
30 |
export SAVE_STEPS=625
|
31 |
export SEED=1
|
|
|
32 |
```
|
33 |
+
Then run pre-training:
|
|
|
|
|
|
|
34 |
```
|
35 |
+
python3 run_ner_old.py --data_dir ./tr-data3 \
|
|
|
|
|
36 |
--model_type bert \
|
37 |
--labels ./tr-data/labels.txt \
|
38 |
--model_name_or_path $BERT_MODEL \
|
|
|
48 |
--fp16
|
49 |
```
|
50 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
51 |
|
52 |
# Usage
|
53 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
54 |
```
|
55 |
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
|
|
|
56 |
model = AutoModelForTokenClassification.from_pretrained("savasy/bert-base-turkish-ner-cased")
|
57 |
tokenizer = AutoTokenizer.from_pretrained("savasy/bert-base-turkish-ner-cased")
|
58 |
ner=pipeline('ner', model=model, tokenizer=tokenizer)
|
59 |
ner("Mustafa Kemal Atatürk 19 Mayıs 1919'da Samsun'a ayak bastı.")
|
|
|
|
|
|
|
60 |
```
|
61 |
+
# Some results
|
62 |
+
Data1: For the data above
|
63 |
+
Eval Results:
|
64 |
|
65 |
+
* precision = 0.916400580551524
|
66 |
+
* recall = 0.9342309684101502
|
67 |
+
* f1 = 0.9252298787412536
|
68 |
+
* loss = 0.11335893666411284
|
69 |
|
70 |
+
Test Results:
|
71 |
+
* precision = 0.9192058759362955
|
72 |
+
* recall = 0.9303010230367262
|
73 |
+
* f1 = 0.9247201697271198
|
74 |
+
* loss = 0.11182546521618497
|
75 |
|
76 |
|
77 |
|
78 |
+
Data2:
|
79 |
+
https://github.com/stefan-it/turkish-bert/files/4558187/nerdata.txt
|
80 |
+
The performance for the data given by @kemalaraz is as follows
|
81 |
|
82 |
+
savas@savas-lenova:~/Desktop/trans/tr-new-model-1$ cat eval_results.txt
|
83 |
+
* precision = 0.9461980692049029
|
84 |
+
* recall = 0.959309358847465
|
85 |
+
* f1 = 0.9527086063783312
|
86 |
+
* loss = 0.037054269206847804
|
87 |
|
88 |
+
savas@savas-lenova:~/Desktop/trans/tr-new-model-1$ cat test_results.txt
|
89 |
+
* precision = 0.9458370635631155
|
90 |
+
* recall = 0.9588201928530913
|
91 |
+
* f1 = 0.952284378344882
|
92 |
+
* loss = 0.035431676572445225
|
|
|
|
|
|
|
|
|
93 |
|