|
--- |
|
license: mit |
|
language: |
|
- ky |
|
tags: |
|
- tokenization |
|
- WordPiece |
|
- kyrgyz |
|
- tokenizer |
|
--- |
|
A tokenizer tailored for the Kyrgyz language, now utilizing WordPiece segmentation to offer efficient, context-aware tokenization. Featuring a 100,000-subword vocabulary, this tokenizer is optimized for various Kyrgyz NLP tasks while maintaining robust linguistic coverage. Developed in collaboration with UlutSoft LLC, it reflects authentic Kyrgyz language usage. |
|
Features: |
|
|
|
Language: Kyrgyz |
|
Vocabulary Size: 100,000 subwords |
|
Method: WordPiece |
|
|
|
Applications: Data preparation for language models, machine translation, sentiment analysis, chatbots. |
|
Usage Example (Python with transformers): |
|
|
|
```python |
|
from transformers import AutoTokenizer |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("Your/Tokenizer/Path") |
|
text = "Кыргыз тили – бай жана кооз тил." |
|
tokens = tokenizer(text) |
|
print(tokens) |
|
``` |
|
Tip: Consider applying normalization or lemmatization during preprocessing to further enhance the results. |
|
|
|
License and Attribution |
|
This tokenizer is licensed under the MIT License and was developed in collaboration with UlutSoft LLC. Proper attribution is required when using this tokenizer or derived resources. |
|
|
|
Feedback and Contributions |
|
We welcome feedback, suggestions, and contributions! Please open an issue or a pull request in the repository to help us refine and enhance this resource. |