Edit model card

PuoBerta: A curated Setswana Language Model

Zenodo doi badge arXiv πŸ€— https://huggingface.co/dsfsi/PuoBERTa

Give Feedback πŸ“‘: DSFSI Resource Feedback Form

A Roberta-based language model specially designed for Setswana, using the new PuoData dataset.

Model Details

Model Description

This is a masked language model trained on Setswana corpora, making it a valuable tool for a range of downstream applications from translation to content creation. It's powered by the PuoData dataset to ensure accuracy and cultural relevance.

  • Developed by: Vukosi Marivate (@vukosi), Moseli Mots'Oehli (@MoseliMotsoehli) , Valencia Wagner, Richard Lastrucci and Isheanesu Dzingirai
  • Model type: RoBERTa Model
  • Language(s) (NLP): Setswana
  • License: CC BY 4.0

Usage

Use this model filling in masks or finetune for downstream tasks. Here’s a simple example for masked prediction:

from transformers import RobertaTokenizer, RobertaModel

# Load model and tokenizer
model = RobertaModel.from_pretrained('dsfsi/PuoBERTa')
tokenizer = RobertaTokenizer.from_pretrained('dsfsi/PuoBERTa')

Downstream Use

Downstream Performance

Daily News Dikgang

Learn more about the dataset in the Dataset Folder

Model 5-fold Cross Validation F1 Test F1
Logistic Regression + TFIDF 60.1 56.2
NCHLT TSN RoBERTa 64.7 60.3
PuoBERTa 63.8 62.9
PuoBERTaJW300 66.2 65.4

Downstream News Categorisation model πŸ€— https://huggingface.co/dsfsi/PuoBERTa-News

MasakhaPOS

Performance of models on the MasakhaPOS downstream task.

Model Test Performance
Multilingual Models
AfroLM 83.8
AfriBERTa 82.5
AfroXLMR-base 82.7
AfroXLMR-large 83.0
Monolingual Models
NCHLT TSN RoBERTa 82.3
PuoBERTa 83.4
PuoBERTa+JW300 84.1

Downstream POS model πŸ€— https://huggingface.co/dsfsi/PuoBERTa-POS

MasakhaNER

Performance of models on the MasakhaNER downstream task.

Model Test Performance (f1 score)
Multilingual Models
AfriBERTa 83.2
AfroXLMR-base 87.7
AfroXLMR-large 89.4
Monolingual Models
NCHLT TSN RoBERTa 74.2
PuoBERTa 78.2
PuoBERTa+JW300 80.2

Downstream NER model πŸ€— https://huggingface.co/dsfsi/PuoBERTa-NER

Pre-Training Dataset

We used the PuoData dataset, a rich source of Setswana text, ensuring that our model is well-trained and culturally attuned.

Github, πŸ€— https://huggingface.co/datasets/dsfsi/PuoData

Citation Information

Bibtex Reference

@inproceedings{marivate2023puoberta,
  title   = {PuoBERTa: Training and evaluation of a curated language model for Setswana},
  author  = {Vukosi Marivate and Moseli Mots'Oehli and Valencia Wagner and Richard Lastrucci and Isheanesu Dzingirai},
  year    = {2023},
  booktitle= {Artificial Intelligence Research. SACAIR 2023. Communications in Computer and Information Science},
  url= {https://link.springer.com/chapter/10.1007/978-3-031-49002-6_17},
  keywords = {NLP},
  preprint_url = {https://arxiv.org/abs/2310.09141},
  dataset_url = {https://github.com/dsfsi/PuoBERTa},
  software_url = {https://huggingface.co/dsfsi/PuoBERTa}
}

Contributing

Your contributions are welcome! Feel free to improve the model.

Model Card Authors

Vukosi Marivate

Model Card Contact

For more details, reach out or check our website.

Email: vukosi.marivate@cs.up.ac.za

Enjoy exploring Setswana through AI!

Downloads last month
168
Safetensors
Model size
83.5M params
Tensor type
F32
Β·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Datasets used to train dsfsi/PuoBERTa

Spaces using dsfsi/PuoBERTa 2