Model Card for Turkish Byte Pair Encoding Tokenizer
This model provides a tokenizer specifically designed for the Turkish language. It includes nearly 25,000 Turkish word roots, all Turkish suffixes in both lowercase and uppercase forms, and extends with approximately 14,000 additional tokens using Byte Pair Encoding (BPE). The tokenizer is intended to improve the tokenization quality for NLP tasks involving Turkish text.
Model Details
Model Description
This tokenizer is developed to handle the complex morphology and agglutinative nature of the Turkish language. By leveraging a comprehensive set of word roots and suffixes combined with BPE, it ensures efficient tokenization, preserving linguistic structure and reducing the vocabulary size for downstream tasks.
- Developed by: Ali Arda Fincan
- Model type: Tokenizer (Byte Pair Encoding & Pre-Defined Turkish Words)
- Language(s) (NLP): Turkish
- License: Apache-2.0
Model Sources [optional]
- Repository: umarigan/turkish_corpus_small
Direct Use
This tokenizer can be directly used for tokenizing Turkish text in tasks like text classification, translation, or sentiment analysis. It efficiently handles the linguistic properties of Turkish, making it suitable for tasks requiring morphological analysis or text processing.
Downstream Use
The tokenizer can be fine-tuned or integrated into NLP pipelines for Turkish language processing, including model training or inference tasks.
Out-of-Scope Use
The tokenizer is not designed for non-Turkish languages or tasks requiring domain-specific tokenization not covered in its training.
Bias, Risks, and Limitations
While this tokenizer is optimized for Turkish, biases may arise if the training data contains imbalances or stereotypes. It may also perform suboptimally on highly informal or domain-specific text.
Recommendations
Users should evaluate the tokenizer on their specific datasets and tasks to identify any biases or limitations. Supplementary preprocessing or token adjustments may be required for optimal results.
How to Get Started with the Model
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("aliarda/turkish_tokenizer")
# Example usage:
text = "Türkçe metin işleme için bir örnek."
tokens = tokenizer.tokenize(text)
print(tokens)