tahaenesaslanturk
commited on
Commit
•
42fe871
1
Parent(s):
9fb14ec
Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,34 @@
|
|
1 |
-
---
|
2 |
-
license: mit
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
language:
|
4 |
+
- tr
|
5 |
+
library_name: transformers
|
6 |
+
---
|
7 |
+
|
8 |
+
# TS-Corpus WordPiece Tokenizer (256k, Cased)
|
9 |
+
|
10 |
+
## Overview
|
11 |
+
This repository contains a WordPiece tokenizer with a vocabulary size of 256,000, trained uncased on various datasets from the TS Corpus website. It is designed to handle Turkish text, leveraging rich and diverse sources to provide a robust tool for natural language processing tasks.
|
12 |
+
|
13 |
+
## Dataset Sources
|
14 |
+
The tokenizer was trained using multiple corpora from the TS Corpus, specifically:
|
15 |
+
- [TS Corpus V2](https://tscorpus.com/corpora/ts-corpus-v2/)
|
16 |
+
- [TS Wikipedia Corpus](https://tscorpus.com/corpora/ts-wikipedia-corpus/)
|
17 |
+
- [TS Abstract Corpus](https://tscorpus.com/corpora/ts-abstract-corpus/)
|
18 |
+
- [TS Idioms and Proverbs Corpus](https://tscorpus.com/corpora/ts-idioms-and-proverbs-corpus/)
|
19 |
+
- [Syllable Corpus](https://tscorpus.com/corpora/syllable-corpus/)
|
20 |
+
- [Turkish Constitution Corpus](https://tscorpus.com/corpora/turkish-constitution-corpus/)
|
21 |
+
|
22 |
+
These diverse sources include a wide range of texts from encyclopedic articles to legal documents, providing a comprehensive linguistic foundation for the tokenizer.
|
23 |
+
|
24 |
+
## Tokenizer Model
|
25 |
+
The tokenizer uses the WordPiece model, which is widely utilized in many modern NLP systems. It is particularly effective in handling languages with rich morphology like Turkish due to its subword segmentation approach. This tokenizer differentiates between uppercase and lowercase letters.
|
26 |
+
|
27 |
+
## Usage
|
28 |
+
To use this tokenizer, you can load it via the Hugging Face `transformers` library as follows:
|
29 |
+
|
30 |
+
```python
|
31 |
+
from transformers import AutoTokenizer
|
32 |
+
tokenizer = AutoTokenizer.from_pretrained("tahaenesaslanturk/ts-corpus-wordpiece-256k-cased")
|
33 |
+
```
|
34 |
+
|