mertcobanov
commited on
Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,91 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
language:
|
4 |
+
- tr
|
5 |
+
tags:
|
6 |
+
- turkish
|
7 |
+
---
|
8 |
+
# Turkish WordPiece Tokenizer
|
9 |
+
|
10 |
+
This repository contains a **WordPiece tokenizer** specifically trained on **1 billion Turkish sentences**, making it highly suitable for natural language processing (NLP) tasks in the Turkish language. The tokenizer has been built using the `tokenizers` library and includes both cased and uncased versions for flexibility.
|
11 |
+
|
12 |
+
|
13 |
+
## Repository Structure
|
14 |
+
|
15 |
+
| File Name | Description |
|
16 |
+
|-----------------------------------------|--------------------------------------------------------------------------------------------------|
|
17 |
+
| `special_tokens_map.json` | Maps special tokens like `[UNK]`, `[PAD]`, `[CLS]`, and `[SEP]` to their respective identifiers. |
|
18 |
+
| `tokenizer_config.json` | Contains configuration details for the tokenizer, including model type and special token settings. |
|
19 |
+
| `turkish_wordpiece_tokenizer.json` | The primary WordPiece tokenizer trained on 1 billion Turkish sentences (cased). |
|
20 |
+
| `turkish_wordpiece_tokenizer_uncased.json` | The uncased version of the WordPiece tokenizer. |
|
21 |
+
| `turkish_wordpiece_tokenizer_post_token_uncased.json` | The post-tokenization configuration for the uncased tokenizer. |
|
22 |
+
|
23 |
+
|
24 |
+
## Features
|
25 |
+
|
26 |
+
- **WordPiece Tokenization**: Breaks words into subword units for better handling of rare or unseen words.
|
27 |
+
- **Support for Cased and Uncased Text**: Includes separate tokenizers for preserving case sensitivity and ignoring case.
|
28 |
+
- **Optimized for Turkish**: Trained on a large-scale Turkish dataset (1 billion sentences), ensuring strong coverage of Turkish vocabulary and grammar.
|
29 |
+
- **Special Tokens**: Includes commonly used tokens such as:
|
30 |
+
- `[UNK]` (unknown token)
|
31 |
+
- `[PAD]` (padding token)
|
32 |
+
- `[CLS]` (classification token)
|
33 |
+
- `[SEP]` (separator token)
|
34 |
+
|
35 |
+
|
36 |
+
## Usage
|
37 |
+
|
38 |
+
To use the tokenizer, you can load it with the Hugging Face `transformers` library or the `tokenizers` library.
|
39 |
+
|
40 |
+
|
41 |
+
### Loading with `tokenizers`:
|
42 |
+
|
43 |
+
```python
|
44 |
+
from tokenizers import Tokenizer
|
45 |
+
|
46 |
+
# Load the uncased tokenizer
|
47 |
+
tokenizer = Tokenizer.from_file("path/to/turkish_wordpiece_tokenizer_uncased.json")
|
48 |
+
|
49 |
+
# Tokenize a sentence
|
50 |
+
output = tokenizer.encode("Merhaba dünya!")
|
51 |
+
print(output.tokens)
|
52 |
+
```
|
53 |
+
|
54 |
+
|
55 |
+
## Tokenizer Training Details
|
56 |
+
|
57 |
+
- **Dataset**: 1 billion Turkish sentences, sourced from diverse domains (news, social media, literature, etc.).
|
58 |
+
- **Model**: WordPiece tokenizer, trained with a vocabulary size suitable for the Turkish language.
|
59 |
+
- **Uncased Variant**: Lowercases all text during tokenization to ignore case distinctions.
|
60 |
+
|
61 |
+
|
62 |
+
## Applications
|
63 |
+
|
64 |
+
- **Text Classification**
|
65 |
+
- **Machine Translation**
|
66 |
+
- **Question Answering**
|
67 |
+
- **Text Summarization**
|
68 |
+
- **Named Entity Recognition (NER)**
|
69 |
+
|
70 |
+
|
71 |
+
## Citation
|
72 |
+
|
73 |
+
If you use this tokenizer in your research or applications, please cite it as follows:
|
74 |
+
|
75 |
+
```
|
76 |
+
@misc{turkish_wordpiece_tokenizer,
|
77 |
+
title={Turkish WordPiece Tokenizer},
|
78 |
+
author={Mert Cobanov},
|
79 |
+
year={2024},
|
80 |
+
url={https://huggingface.co/mertcobanov/turkish-wordpiece-tokenizer}
|
81 |
+
}
|
82 |
+
```
|
83 |
+
|
84 |
+
|
85 |
+
## Contributions
|
86 |
+
|
87 |
+
Contributions are welcome! If you have suggestions or improvements, please create an issue or submit a pull request.
|
88 |
+
|
89 |
+
|
90 |
+
|
91 |
+
Let me know if you'd like further adjustments!
|