burakaytan commited on
Commit
0519b79
·
1 Parent(s): aba33b4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +75 -1
README.md CHANGED
@@ -1,3 +1,77 @@
1
  ---
2
- license: apache-2.0
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: tr
3
+ license: mit
4
  ---
5
+ 🇹🇷 RoBERTaTurk-Small-Clean
6
+
7
+ ## Model description
8
+ This is a Turkish small clean RoBERTa model, trained to understand Turkish language better.
9
+ We used special, clean data from Turkish Wikipedia, Turkish OSCAR, and news websites.
10
+ First, we had 38 GB of data, but we took out all the sentences with mistakes in them.
11
+ So, the model was trained with 20 GB of good quality data. This helps the model work really well with Turkish texts that don't have errors.
12
+
13
+ The model is a bit smaller than the usual RoBERTa model. It has 8 layers instead of 12, which makes it faster and easier to use but still very good at understanding Turkish.
14
+
15
+ It's built to be really good at understanding Turkish, especially when the texts are written correctly without errors.
16
+ Thanks to Turkcell we could train the model on Intel(R) Xeon(R) Gold 6230R CPU @ 2.10GHz 256GB RAM 2 x GV100GL [Tesla V100 PCIe 32GB] GPU for 1.5M steps.
17
+
18
+ # Usage
19
+ Load transformers library with:
20
+ ```python
21
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
22
+
23
+ tokenizer = AutoTokenizer.from_pretrained("burakaytan/roberta-small-turkish-clean-uncased")
24
+ model = AutoModelForMaskedLM.from_pretrained("burakaytan/roberta-small-turkish-clean-uncased")
25
+ ```
26
+
27
+
28
+ # Fill Mask Usage
29
+
30
+ ```python
31
+ from transformers import pipeline
32
+
33
+ fill_mask = pipeline(
34
+ "fill-mask",
35
+ model="burakaytan/roberta-small-turkish-clean-uncased",
36
+ tokenizer="burakaytan/roberta-small-turkish-clean-uncased"
37
+ )
38
+
39
+ fill_mask("iki ülke arasında <mask> başladı")
40
+
41
+ [{'sequence': 'iki ülke arasında savaş başladı',
42
+ 'score': 0.14830906689167023,
43
+ 'token': 1745,
44
+ 'token_str': ' savaş'},
45
+ {'sequence': 'iki ülke arasında çatışmalar başladı',
46
+ 'score': 0.1442396193742752,
47
+ 'token': 18223,
48
+ 'token_str': ' çatışmalar'},
49
+ {'sequence': 'iki ülke arasında gerginlik başladı',
50
+ 'score': 0.12025047093629837,
51
+ 'token': 13638,
52
+ 'token_str': ' gerginlik'},
53
+ {'sequence': 'iki ülke arasında çatışma başladı',
54
+ 'score': 0.0615813322365284,
55
+ 'token': 5452,
56
+ 'token_str': ' çatışma'},
57
+ {'sequence': 'iki ülke arasında görüşmeler başladı',
58
+ 'score': 0.04512731358408928,
59
+ 'token': 4736,
60
+ 'token_str': ' görüşmeler'}]
61
+ ```
62
+ ## Citation and Related Information
63
+
64
+ To cite this model:
65
+ ```bibtex
66
+
67
+ @article{aytan2023deep,
68
+ title={Deep learning-based Turkish spelling error detection with a multi-class false positive reduction model},
69
+ author={AYTAN, BURAK and {\c{S}}AKAR, CEMAL OKAN},
70
+ journal={Turkish Journal of Electrical Engineering and Computer Sciences},
71
+ volume={31},
72
+ number={3},
73
+ pages={581--595},
74
+ year={2023}
75
+ }
76
+
77
+ ```