ctoraman commited on
Commit
f2db548
1 Parent(s): 14e5cec

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +50 -41
README.md CHANGED
@@ -1,41 +1,50 @@
1
- ---
2
- language:
3
- - tr
4
- tags:
5
- - roberta
6
- license: cc-by-nc-sa-4.0
7
- datasets:
8
- - oscar
9
- ---
10
-
11
- # RoBERTa Turkish medium WordPiece 28k (uncased)
12
-
13
- Pretrained model on Turkish language using a masked language modeling (MLM) objective. The model is uncased.
14
- The pretrained corpus is OSCAR's Turkish split, but it is further filtered and cleaned.
15
-
16
- Model architecture is similar to bert-medium (8 layers, 8 heads, and 512 hidden size). Tokenization algorithm is WordPiece. Vocabulary size is 28.6k.
17
-
18
- The details can be found at this paper:
19
- https://arxiv.org/...
20
-
21
- The following code can be used for model loading and tokenization, example max length (514) can be changed:
22
- ```
23
- model = AutoModel.from_pretrained([model_path])
24
- #for sequence classification:
25
- #model = AutoModelForSequenceClassification.from_pretrained([model_path], num_labels=[num_classes])
26
-
27
- tokenizer = PreTrainedTokenizerFast(tokenizer_file=[file_path])
28
- tokenizer.mask_token = "[MASK]"
29
- tokenizer.cls_token = "[CLS]"
30
- tokenizer.sep_token = "[SEP]"
31
- tokenizer.pad_token = "[PAD]"
32
- tokenizer.unk_token = "[UNK]"
33
- tokenizer.bos_token = "[CLS]"
34
- tokenizer.eos_token = "[SEP]"
35
- tokenizer.model_max_length = 514
36
- ```
37
-
38
- ### BibTeX entry and citation info
39
- ```bibtex
40
- @article{}
41
- ```
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - tr
4
+ tags:
5
+ - roberta
6
+ license: cc-by-nc-sa-4.0
7
+ datasets:
8
+ - oscar
9
+ ---
10
+
11
+ # RoBERTa Turkish medium WordPiece 28k (uncased)
12
+
13
+ Pretrained model on Turkish language using a masked language modeling (MLM) objective. The model is uncased.
14
+ The pretrained corpus is OSCAR's Turkish split, but it is further filtered and cleaned.
15
+
16
+ Model architecture is similar to bert-medium (8 layers, 8 heads, and 512 hidden size). Tokenization algorithm is WordPiece. Vocabulary size is 28.6k.
17
+
18
+ The details and performance comparisons can be found at this paper:
19
+ https://arxiv.org/abs/2204.08832
20
+
21
+ The following code can be used for model loading and tokenization, example max length (514) can be changed:
22
+ ```
23
+ model = AutoModel.from_pretrained([model_path])
24
+ #for sequence classification:
25
+ #model = AutoModelForSequenceClassification.from_pretrained([model_path], num_labels=[num_classes])
26
+
27
+ tokenizer = PreTrainedTokenizerFast(tokenizer_file=[file_path])
28
+ tokenizer.mask_token = "[MASK]"
29
+ tokenizer.cls_token = "[CLS]"
30
+ tokenizer.sep_token = "[SEP]"
31
+ tokenizer.pad_token = "[PAD]"
32
+ tokenizer.unk_token = "[UNK]"
33
+ tokenizer.bos_token = "[CLS]"
34
+ tokenizer.eos_token = "[SEP]"
35
+ tokenizer.model_max_length = 514
36
+ ```
37
+
38
+ ### BibTeX entry and citation info
39
+ ```bibtex
40
+ @misc{https://doi.org/10.48550/arxiv.2204.08832,
41
+ doi = {10.48550/ARXIV.2204.08832},
42
+ url = {https://arxiv.org/abs/2204.08832},
43
+ author = {Toraman, Cagri and Yilmaz, Eyup Halit and Şahinuç, Furkan and Ozcelik, Oguzhan},
44
+ keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
45
+ title = {Impact of Tokenization on Language Models: An Analysis for Turkish},
46
+ publisher = {arXiv},
47
+ year = {2022},
48
+ copyright = {Creative Commons Attribution Non Commercial Share Alike 4.0 International}
49
+ }
50
+ ```