mertcobanov
/

turkish-wordpiece-tokenizer

Turkish

turkish

Model card Files Files and versions Community

mertcobanov commited on Nov 23, 2024

Commit

abbfe42

verified ·

1 Parent(s): 8b1439d

Create README.md

Browse files

Files changed (1) hide show

README.md +91 -0

README.md ADDED Viewed

	@@ -0,0 +1,91 @@

+---
+license: apache-2.0
+language:
+- tr
+tags:
+- turkish
+---
+# Turkish WordPiece Tokenizer
+This repository contains a **WordPiece tokenizer** specifically trained on **1 billion Turkish sentences**, making it highly suitable for natural language processing (NLP) tasks in the Turkish language. The tokenizer has been built using the `tokenizers` library and includes both cased and uncased versions for flexibility.
+## Repository Structure
+| File Name                               | Description                                                                                      |
+|-----------------------------------------|--------------------------------------------------------------------------------------------------|
+| `special_tokens_map.json`               | Maps special tokens like `[UNK]`, `[PAD]`, `[CLS]`, and `[SEP]` to their respective identifiers. |
+| `tokenizer_config.json`                 | Contains configuration details for the tokenizer, including model type and special token settings. |
+| `turkish_wordpiece_tokenizer.json`      | The primary WordPiece tokenizer trained on 1 billion Turkish sentences (cased).                 |
+| `turkish_wordpiece_tokenizer_uncased.json` | The uncased version of the WordPiece tokenizer.                                                 |
+| `turkish_wordpiece_tokenizer_post_token_uncased.json` | The post-tokenization configuration for the uncased tokenizer.                                   |
+## Features
+- **WordPiece Tokenization**: Breaks words into subword units for better handling of rare or unseen words.
+- **Support for Cased and Uncased Text**: Includes separate tokenizers for preserving case sensitivity and ignoring case.
+- **Optimized for Turkish**: Trained on a large-scale Turkish dataset (1 billion sentences), ensuring strong coverage of Turkish vocabulary and grammar.
+- **Special Tokens**: Includes commonly used tokens such as:
+  - `[UNK]` (unknown token)
+  - `[PAD]` (padding token)
+  - `[CLS]` (classification token)
+  - `[SEP]` (separator token)
+## Usage
+To use the tokenizer, you can load it with the Hugging Face `transformers` library or the `tokenizers` library.
+### Loading with `tokenizers`:
+```python
+from tokenizers import Tokenizer
+# Load the uncased tokenizer
+tokenizer = Tokenizer.from_file("path/to/turkish_wordpiece_tokenizer_uncased.json")
+# Tokenize a sentence
+output = tokenizer.encode("Merhaba dünya!")
+print(output.tokens)
+```
+## Tokenizer Training Details
+- **Dataset**: 1 billion Turkish sentences, sourced from diverse domains (news, social media, literature, etc.).
+- **Model**: WordPiece tokenizer, trained with a vocabulary size suitable for the Turkish language.
+- **Uncased Variant**: Lowercases all text during tokenization to ignore case distinctions.
+## Applications
+- **Text Classification**
+- **Machine Translation**
+- **Question Answering**
+- **Text Summarization**
+- **Named Entity Recognition (NER)**
+## Citation
+If you use this tokenizer in your research or applications, please cite it as follows:
+```
+@misc{turkish_wordpiece_tokenizer,
+  title={Turkish WordPiece Tokenizer},
+  author={Mert Cobanov},
+  year={2024},
+  url={https://huggingface.co/mertcobanov/turkish-wordpiece-tokenizer}
+}
+```
+## Contributions
+Contributions are welcome! If you have suggestions or improvements, please create an issue or submit a pull request.
+Let me know if you'd like further adjustments!