YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Swahili BERT WordPiece Tokenizer

A BERT WordPiece tokenizer specifically trained for the Swahili language. This tokenizer is designed to provide effective tokenization for Swahili text, supporting BERT-based models and other transformer architectures.

Model Details

  • Model type: BERT WordPiece Tokenizer
  • Language: Swahili
  • Vocabulary size: 30,000 tokens
  • Training Datasets: publicly available online data + 3D & Robotics Lab proprietary data.

Features

  • Specifically optimized for Swahili language patterns
  • Handles common Swahili morphological structures
  • Includes standard BERT special tokens ([CLS], [SEP], [MASK], [PAD], [UNK])
  • Full compatibility with HuggingFace Transformers library

Usage

from transformers import AutoTokenizer

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("Benjamin-png/bert-tokenizer-swahili_30000_minfreq_2")

# Example usage
text = "Habari za asubuhi"
encoded = tokenizer(text)
print(encoded.tokens)

Training Details

The tokenizer was trained with the following specifications:

  • Vocabulary size: 30,000 tokens
  • Minimum frequency: 2
  • Special tokens: [PAD], [UNK], [CLS], [SEP], [MASK]
  • Clean text: True
  • Handle Chinese characters: False
  • Strip accents: True
  • Lowercase: True

Example Outputs

Input: "Habari za asubuhi"
Tokens: ['[CLS]', 'habari', 'za', 'asubuhi', '[SEP]']

Input: "Ninafurahi kukutana nawe"
Tokens: ['[CLS]', 'ninafurahi', 'kukutana', 'nawe', '[SEP]']

Input: "Karibu Tanzania"
Tokens: ['[CLS]', 'karibu', 'tanzania', '[SEP]']

Limitations

  • The tokenizer's vocabulary is limited to the training data from the specified datasets
  • Performance may vary for specialized domains or dialects not well-represented in the training data
  • Rare or complex Swahili words might be split into subwords

Intended Use

This tokenizer is designed for:

  • Pre-processing Swahili text for BERT-based models
  • Natural Language Processing tasks in Swahili
  • Text analysis and processing applications

Citation

If you use this tokenizer in your research, please cite:

@misc{swahili-bert-tokenizer,
  author = {Benjamin-png},
  title = {BERT WordPiece Tokenizer for Swahili},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/Benjamin-png/bert-tokenizer-swahili}}
}

Contact

For questions and feedback, please open an issue in the GitHub repository or contact through Hugging Face.

License

MIT License

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.