julien-c HF staff commited on
Commit
a76392a
·
1 Parent(s): 8eabdf3

Migrate model card from transformers-repo

Browse files

Read announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/savasy/bert-base-turkish-ner-cased/README.md

Files changed (1) hide show
  1. README.md +38 -86
README.md CHANGED
@@ -1,18 +1,13 @@
1
- # How the model was trained
2
- This model is based on BERTurk
3
- https://huggingface.co/dbmdz/bert-base-turkish-cased
4
 
5
- ## DataSet
6
- Training dataset is WikiAnn
7
 
8
- * The WikiANN dataset (Pan et al. 2017) is a dataset with NER annotations for PER, ORG and LOC. It has been constructed using the linked entities in Wikipedia pages for 282 different languages including Danish. The dataset can be loaded with the DaNLP package:
9
 
10
- https://www.aclweb.org/anthology/P17-1178.pdf
11
 
12
- Thank to @stefan-it, I downloaded the data from the link as follows
13
-
14
- ```
15
- mkdir tr-data
16
 
17
  cd tr-data
18
 
@@ -20,29 +15,24 @@ for file in train.txt dev.txt test.txt labels.txt
20
  do
21
  wget https://schweter.eu/storage/turkish-bert-wikiann/$file
22
  done
23
- ```
24
 
25
- ## Fine-tuning the bert-model
26
- The base bert model is dbmdz/bert-base-turkish-cased . With following system environment
27
 
 
 
28
  ```
29
  export MAX_LENGTH=128
30
- export BERT_MODEL=dbmdz/bert-base-turkish-cased
31
  export OUTPUT_DIR=tr-new-model
32
  export BATCH_SIZE=32
33
  export NUM_EPOCHS=3
34
  export SAVE_STEPS=625
35
  export SEED=1
36
-
37
  ```
38
-
39
- I run the following ner-training code(you can find it under transformer github repo)
40
-
41
-
42
  ```
43
- Then run training:
44
-
45
- python3 run_ner.py --data_dir ./tr-data3 \
46
  --model_type bert \
47
  --labels ./tr-data/labels.txt \
48
  --model_name_or_path $BERT_MODEL \
@@ -58,84 +48,46 @@ python3 run_ner.py --data_dir ./tr-data3 \
58
  --fp16
59
  ```
60
 
61
- If you dont have GPU-enabled computer, please skip last --fp16 parameter.
62
- Finally, you can find your trained model and model performance unde tr-new-model folder
63
-
64
-
65
- ## Some Results
66
-
67
- ###Performance for WikiANN dataset
68
- ```
69
- cat tr-new-model-1/eval_results.txt
70
- cat tr-new-model-1/test_results.txt
71
-
72
- *Eval Results:*
73
-
74
- precision = 0.916400580551524
75
- recall = 0.9342309684101502
76
- f1 = 0.9252298787412536
77
- loss = 0.11335893666411284
78
-
79
- *Test Results:*
80
- precision = 0.9192058759362955
81
- recall = 0.9303010230367262
82
- f1 = 0.9247201697271198
83
- loss = 0.11182546521618497
84
-
85
- ```
86
-
87
- ### Performance with another dataset at the link
88
- https://github.com/stefan-it/turkish-bert/files/4558187/nerdata.txt
89
-
90
- ```
91
- savas@savas-lenova:~/Desktop/trans/tr-new-model-1$ cat eval_results.txt
92
- precision = 0.9461980692049029
93
- recall = 0.959309358847465
94
- f1 = 0.9527086063783312
95
- loss = 0.037054269206847804
96
-
97
- savas@savas-lenova:~/Desktop/trans/tr-new-model-1$ cat test_results.txt
98
- precision = 0.9458370635631155
99
- recall = 0.9588201928530913
100
- f1 = 0.952284378344882
101
- loss = 0.035431676572445225
102
- ```
103
 
104
  # Usage
105
 
106
- You should install transformer library first
107
-
108
- ```
109
- pip install transformers
110
- ```
111
-
112
- And, open python environment and run the following code
113
-
114
  ```
115
  from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
116
-
117
  model = AutoModelForTokenClassification.from_pretrained("savasy/bert-base-turkish-ner-cased")
118
  tokenizer = AutoTokenizer.from_pretrained("savasy/bert-base-turkish-ner-cased")
119
  ner=pipeline('ner', model=model, tokenizer=tokenizer)
120
  ner("Mustafa Kemal Atatürk 19 Mayıs 1919'da Samsun'a ayak bastı.")
121
-
122
- [{'word': 'Mustafa', 'score': 0.9938516616821289, 'entity': 'B-PER'}, {'word': 'Kemal', 'score': 0.9881671071052551, 'entity': 'I-PER'}, {'word': 'Atatürk', 'score': 0.9957979321479797, 'entity': 'I-PER'}, {'word': 'Samsun', 'score': 0.9059973359107971, 'entity': 'B-LOC'}]
123
-
124
  ```
 
 
 
125
 
 
 
 
 
126
 
 
 
 
 
 
127
 
128
 
129
 
 
 
 
130
 
 
 
 
 
 
131
 
132
-
133
-
134
-
135
-
136
-
137
-
138
-
139
-
140
-
141
 
 
1
+ ---
2
+ language: tr
3
+ ---
4
 
5
+ # For Turkish language, here is an easy-to-use NER application.
6
+ ** Türkçe için kolay bir python NER (Bert + Transfer Learning) (İsim Varlık Tanıma) modeli...
7
 
 
8
 
9
+ Thanks to @stefan-it, I applied the followings for training
10
 
 
 
 
 
11
 
12
  cd tr-data
13
 
 
15
  do
16
  wget https://schweter.eu/storage/turkish-bert-wikiann/$file
17
  done
 
18
 
19
+ cd ..
20
+ It will download the pre-processed datasets with training, dev and test splits and put them in a tr-data folder.
21
 
22
+ Run pre-training
23
+ After downloading the dataset, pre-training can be started. Just set the following environment variables:
24
  ```
25
  export MAX_LENGTH=128
26
+ export BERT_MODEL=dbmdz/bert-base-turkish-cased
27
  export OUTPUT_DIR=tr-new-model
28
  export BATCH_SIZE=32
29
  export NUM_EPOCHS=3
30
  export SAVE_STEPS=625
31
  export SEED=1
 
32
  ```
33
+ Then run pre-training:
 
 
 
34
  ```
35
+ python3 run_ner_old.py --data_dir ./tr-data3 \
 
 
36
  --model_type bert \
37
  --labels ./tr-data/labels.txt \
38
  --model_name_or_path $BERT_MODEL \
 
48
  --fp16
49
  ```
50
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
 
52
  # Usage
53
 
 
 
 
 
 
 
 
 
54
  ```
55
  from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
 
56
  model = AutoModelForTokenClassification.from_pretrained("savasy/bert-base-turkish-ner-cased")
57
  tokenizer = AutoTokenizer.from_pretrained("savasy/bert-base-turkish-ner-cased")
58
  ner=pipeline('ner', model=model, tokenizer=tokenizer)
59
  ner("Mustafa Kemal Atatürk 19 Mayıs 1919'da Samsun'a ayak bastı.")
 
 
 
60
  ```
61
+ # Some results
62
+ Data1: For the data above
63
+ Eval Results:
64
 
65
+ * precision = 0.916400580551524
66
+ * recall = 0.9342309684101502
67
+ * f1 = 0.9252298787412536
68
+ * loss = 0.11335893666411284
69
 
70
+ Test Results:
71
+ * precision = 0.9192058759362955
72
+ * recall = 0.9303010230367262
73
+ * f1 = 0.9247201697271198
74
+ * loss = 0.11182546521618497
75
 
76
 
77
 
78
+ Data2:
79
+ https://github.com/stefan-it/turkish-bert/files/4558187/nerdata.txt
80
+ The performance for the data given by @kemalaraz is as follows
81
 
82
+ savas@savas-lenova:~/Desktop/trans/tr-new-model-1$ cat eval_results.txt
83
+ * precision = 0.9461980692049029
84
+ * recall = 0.959309358847465
85
+ * f1 = 0.9527086063783312
86
+ * loss = 0.037054269206847804
87
 
88
+ savas@savas-lenova:~/Desktop/trans/tr-new-model-1$ cat test_results.txt
89
+ * precision = 0.9458370635631155
90
+ * recall = 0.9588201928530913
91
+ * f1 = 0.952284378344882
92
+ * loss = 0.035431676572445225
 
 
 
 
93