mertcobanov commited on
Commit
abbfe42
·
verified ·
1 Parent(s): 8b1439d

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +91 -0
README.md ADDED
@@ -0,0 +1,91 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - tr
5
+ tags:
6
+ - turkish
7
+ ---
8
+ # Turkish WordPiece Tokenizer
9
+
10
+ This repository contains a **WordPiece tokenizer** specifically trained on **1 billion Turkish sentences**, making it highly suitable for natural language processing (NLP) tasks in the Turkish language. The tokenizer has been built using the `tokenizers` library and includes both cased and uncased versions for flexibility.
11
+
12
+
13
+ ## Repository Structure
14
+
15
+ | File Name | Description |
16
+ |-----------------------------------------|--------------------------------------------------------------------------------------------------|
17
+ | `special_tokens_map.json` | Maps special tokens like `[UNK]`, `[PAD]`, `[CLS]`, and `[SEP]` to their respective identifiers. |
18
+ | `tokenizer_config.json` | Contains configuration details for the tokenizer, including model type and special token settings. |
19
+ | `turkish_wordpiece_tokenizer.json` | The primary WordPiece tokenizer trained on 1 billion Turkish sentences (cased). |
20
+ | `turkish_wordpiece_tokenizer_uncased.json` | The uncased version of the WordPiece tokenizer. |
21
+ | `turkish_wordpiece_tokenizer_post_token_uncased.json` | The post-tokenization configuration for the uncased tokenizer. |
22
+
23
+
24
+ ## Features
25
+
26
+ - **WordPiece Tokenization**: Breaks words into subword units for better handling of rare or unseen words.
27
+ - **Support for Cased and Uncased Text**: Includes separate tokenizers for preserving case sensitivity and ignoring case.
28
+ - **Optimized for Turkish**: Trained on a large-scale Turkish dataset (1 billion sentences), ensuring strong coverage of Turkish vocabulary and grammar.
29
+ - **Special Tokens**: Includes commonly used tokens such as:
30
+ - `[UNK]` (unknown token)
31
+ - `[PAD]` (padding token)
32
+ - `[CLS]` (classification token)
33
+ - `[SEP]` (separator token)
34
+
35
+
36
+ ## Usage
37
+
38
+ To use the tokenizer, you can load it with the Hugging Face `transformers` library or the `tokenizers` library.
39
+
40
+
41
+ ### Loading with `tokenizers`:
42
+
43
+ ```python
44
+ from tokenizers import Tokenizer
45
+
46
+ # Load the uncased tokenizer
47
+ tokenizer = Tokenizer.from_file("path/to/turkish_wordpiece_tokenizer_uncased.json")
48
+
49
+ # Tokenize a sentence
50
+ output = tokenizer.encode("Merhaba dünya!")
51
+ print(output.tokens)
52
+ ```
53
+
54
+
55
+ ## Tokenizer Training Details
56
+
57
+ - **Dataset**: 1 billion Turkish sentences, sourced from diverse domains (news, social media, literature, etc.).
58
+ - **Model**: WordPiece tokenizer, trained with a vocabulary size suitable for the Turkish language.
59
+ - **Uncased Variant**: Lowercases all text during tokenization to ignore case distinctions.
60
+
61
+
62
+ ## Applications
63
+
64
+ - **Text Classification**
65
+ - **Machine Translation**
66
+ - **Question Answering**
67
+ - **Text Summarization**
68
+ - **Named Entity Recognition (NER)**
69
+
70
+
71
+ ## Citation
72
+
73
+ If you use this tokenizer in your research or applications, please cite it as follows:
74
+
75
+ ```
76
+ @misc{turkish_wordpiece_tokenizer,
77
+ title={Turkish WordPiece Tokenizer},
78
+ author={Mert Cobanov},
79
+ year={2024},
80
+ url={https://huggingface.co/mertcobanov/turkish-wordpiece-tokenizer}
81
+ }
82
+ ```
83
+
84
+
85
+ ## Contributions
86
+
87
+ Contributions are welcome! If you have suggestions or improvements, please create an issue or submit a pull request.
88
+
89
+
90
+
91
+ Let me know if you'd like further adjustments!