YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
Swahili BERT WordPiece Tokenizer
A BERT WordPiece tokenizer specifically trained for the Swahili language. This tokenizer is designed to provide effective tokenization for Swahili text, supporting BERT-based models and other transformer architectures.
Model Details
- Model type: BERT WordPiece Tokenizer
- Language: Swahili
- Vocabulary size: 30,000 tokens
- Training Datasets: publicly available online data + 3D & Robotics Lab proprietary data.
Features
- Specifically optimized for Swahili language patterns
- Handles common Swahili morphological structures
- Includes standard BERT special tokens ([CLS], [SEP], [MASK], [PAD], [UNK])
- Full compatibility with HuggingFace Transformers library
Usage
from transformers import AutoTokenizer
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("Benjamin-png/bert-tokenizer-swahili_30000_minfreq_2")
# Example usage
text = "Habari za asubuhi"
encoded = tokenizer(text)
print(encoded.tokens)
Training Details
The tokenizer was trained with the following specifications:
- Vocabulary size: 30,000 tokens
- Minimum frequency: 2
- Special tokens: [PAD], [UNK], [CLS], [SEP], [MASK]
- Clean text: True
- Handle Chinese characters: False
- Strip accents: True
- Lowercase: True
Example Outputs
Input: "Habari za asubuhi"
Tokens: ['[CLS]', 'habari', 'za', 'asubuhi', '[SEP]']
Input: "Ninafurahi kukutana nawe"
Tokens: ['[CLS]', 'ninafurahi', 'kukutana', 'nawe', '[SEP]']
Input: "Karibu Tanzania"
Tokens: ['[CLS]', 'karibu', 'tanzania', '[SEP]']
Limitations
- The tokenizer's vocabulary is limited to the training data from the specified datasets
- Performance may vary for specialized domains or dialects not well-represented in the training data
- Rare or complex Swahili words might be split into subwords
Intended Use
This tokenizer is designed for:
- Pre-processing Swahili text for BERT-based models
- Natural Language Processing tasks in Swahili
- Text analysis and processing applications
Citation
If you use this tokenizer in your research, please cite:
@misc{swahili-bert-tokenizer,
author = {Benjamin-png},
title = {BERT WordPiece Tokenizer for Swahili},
year = {2025},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/Benjamin-png/bert-tokenizer-swahili}}
}
Contact
For questions and feedback, please open an issue in the GitHub repository or contact through Hugging Face.
License
MIT License
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API:
The model has no library tag.