metinovadilet
/

KyrgyzTokenizer-WordPiece-100k

Model card Files Files and versions Community

KyrgyzTokenizer-WordPiece-100k / README.md

metinovadilet's picture

Update README.md

becc9f7 verified 28 days ago

|

history blame contribute delete

1.4 kB

	---
	license: mit
	language:
	- ky
	tags:
	- tokenization
	- WordPiece
	- kyrgyz
	- tokenizer
	---
	A tokenizer tailored for the Kyrgyz language, now utilizing WordPiece segmentation to offer efficient, context-aware tokenization. Featuring a 100,000-subword vocabulary, this tokenizer is optimized for various Kyrgyz NLP tasks while maintaining robust linguistic coverage. Developed in collaboration with UlutSoft LLC, it reflects authentic Kyrgyz language usage.
	Features:

	Language: Kyrgyz
	Vocabulary Size: 100,000 subwords
	Method: WordPiece

	Applications: Data preparation for language models, machine translation, sentiment analysis, chatbots.
	Usage Example (Python with transformers):

	```python
	from transformers import AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("Your/Tokenizer/Path")
	text = "Кыргыз тили – бай жана кооз тил."
	tokens = tokenizer(text)
	print(tokens)
	```
	Tip: Consider applying normalization or lemmatization during preprocessing to further enhance the results.

	License and Attribution
	This tokenizer is licensed under the MIT License and was developed in collaboration with UlutSoft LLC. Proper attribution is required when using this tokenizer or derived resources.

	Feedback and Contributions
	We welcome feedback, suggestions, and contributions! Please open an issue or a pull request in the repository to help us refine and enhance this resource.