Odia SentencePiece Tokenizer Model
This repository hosts the SentencePiece tokenizer model for the Odia language, created to support the efficient tokenization of Odia text in NLP applications. The tokenizer was built using a diverse dataset of Odia text, ensuring comprehensive language coverage and accurate tokenization.
Model Details
- Model Prefix:
odia_tokenizers_test
- Model Type: BPE (Byte-Pair Encoding)
- Vocabulary Size: 50,000 tokens
File Structure
odia_tokenizers_test.model
: SentencePiece tokenizer model file.odia_tokenizers_test.vocab
: Vocabulary file containing all token mappings.
Installation and Usage
To load and use this tokenizer model, make sure you have the sentencepiece
package installed:
pip install sentencepiece
import sentencepiece as spm
from huggingface_hub import hf_hub_download
# Download the model file from Hugging Face
model_path = hf_hub_download(repo_id="shantipriya/OdiaTokenizer", filename="odia_tokenizers_test.model")
# Load the tokenizer model
sp = spm.SentencePieceProcessor()
sp.load(model_path)
# Sample text for tokenization
text = "ଦୀପାବଳି ଏକ ଭାରତୀୟ ପର୍ବ ।"
# Tokenize the text into pieces (subwords or tokens)
tokens = sp.encode_as_pieces(text)
# Tokenize the text into token IDs (integer representations of the tokens)
token_ids = sp.encode_as_ids(text)
# Print the tokenized output
print("Tokens:", tokens)
print("Token IDs:", token_ids)
Sample Tokenization
The model has been specifically trained on a diverse corpus of Odia text, ensuring high-quality tokenization results. Here’s an example of how the model tokenizes Odia sentences:
Input: ଦୀପାବଳି ଏକ ଭାରତୀୟ ପର୍ବ ।
Tokens: ['▁ଦୀପାବଳି', '▁ଏକ', '▁ଭାରତୀୟ', '▁ପର୍ବ', '▁।']
Token IDs: [1234, 5678, 91011, 121314, 1516]
(example IDs)
Vocabulary Coverage
The vocabulary size was chosen to balance memory efficiency with language coverage, making it suitable for applications ranging from language modeling to text classification.
Vocabulary Statistics
- Total Tokens: 50,000
- Average Token Length: 6.46
- Max Token Length: 16
- Min Token Length: 1
Training and Configuration Details
The tokenizer was trained using the SentencePiece library with the following configurations:
- Character Coverage: 99.995%
- Input Sentence Size: 200 million sentences
- Maximum Sentence Length: 4192 characters
Model Training Parameters:
shuffle_input_sentence=True
split_by_unicode_script=True
split_by_whitespace=True
byte_fallback=True
Intended Use
This model is intended for use in various NLP applications involving the Odia language, such as:
- Language Modeling
- Text Classification
- Named Entity Recognition (NER)
- Translation tasks involving Odia
License
This model is released under the cc-by-nc-sa-4.0 License.
Acknowledgments
This model was developed as part of a project to support low-resource language processing. Thanks to OdiaGenAI for providing the initial training data, which made this model possible.
Contributors
- Shantipriya Parida
- Sambit Sekhar
- Sahil Khan
- Downloads last month
- 2