File size: 4,866 Bytes
fc12b4b eb0affd b76e003 eb0affd fc12b4b eb0affd 00e38c4 eb0affd ecb1e1d eb0affd ecb1e1d eb0affd b15142e eb0affd b15142e eb0affd b15142e eb0affd b15142e eb0affd b15142e eb0affd b15142e eb0affd 00e38c4 fb3eeb9 00e38c4 eb0affd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 |
---
license: cc-by-4.0
datasets:
- dsfsi/vukuzenzele-monolingual
- nchlt
- dsfsi/PuoData
- dsfsi/gov-za-monolingual
language:
- tn
library_name: transformers
pipeline_tag: fill-mask
tags:
- masked langauge model
- setswana
---
# PuoBertaJW300: A curated Setswana Language Model (trained on PuoData + JW300 Setswana)
[](https://doi.org/10.5281/zenodo.8434795) [](https://arxiv.org/abs/2310.09141) 🤗 [https://huggingface.co/dsfsi/PuoBERTa](https://huggingface.co/dsfsi/PuoBERTa)
A Roberta-based language model specially designed for Setswana, using the new PuoData dataset + JW300 corpora.
**NOTE**: If you are looking for the model without JW300, go to [https://huggingface.co/dsfsi/PuoBERTa](https://huggingface.co/dsfsi/PuoBERTa)
## Model Details
### Model Description
This is a masked language model trained on Setswana corpora, making it a valuable tool for a range of downstream applications from translation to content creation. It's powered by the PuoData dataset to ensure accuracy and cultural relevance.
- **Developed by:** Vukosi Marivate ([@vukosi](https://huggingface.co/@vukosi)), Moseli Mots'Oehli ([@MoseliMotsoehli](https://huggingface.co/@MoseliMotsoehli)) , Valencia Wagner, Richard Lastrucci and Isheanesu Dzingirai
- **Model type:** RoBERTa Model
- **Language(s) (NLP):** Setswana
- **License:** CC BY 4.0
### Usage
Use this model filling in masks or finetune for downstream tasks. Here’s a simple example for masked prediction:
```python
from transformers import RobertaTokenizer, RobertaModel
# Load model and tokenizer
model = RobertaModel.from_pretrained('dsfsi/PuoBERTaJW300')
tokenizer = RobertaTokenizer.from_pretrained('dsfsi/PuoBERTaJW300')
```
### Downstream Use
## Downstream Performance
### Daily News Dikgang
Learn more about the dataset in the [Dataset Folder](daily-news-dikgang)
| **Model** | **5-fold Cross Validation F1** | **Test F1** |
|-----------------------------|--------------------------------------|-------------------|
| Logistic Regression + TFIDF | 60.1 | 56.2 |
| NCHLT TSN RoBERTa | 64.7 | 60.3 |
| PuoBERTa | **63.8** | **62.9** |
| PuoBERTaJW300 | 66.2 | 65.4 |
Downstream News Categorisation model 🤗 [https://huggingface.co/dsfsi/PuoBERTa-News](https://huggingface.co/dsfsi/PuoBERTa-News)
### MasakhaPOS
Performance of models on the MasakhaPOS downstream task.
| Model | Test Performance |
|---|---|
| **Multilingual Models** | |
| AfroLM | 83.8 |
| AfriBERTa | 82.5 |
| AfroXLMR-base | 82.7 |
| AfroXLMR-large | 83.0 |
| **Monolingual Models** | |
| NCHLT TSN RoBERTa | 82.3 |
| PuoBERTa | **83.4** |
| PuoBERTa+JW300 | 84.1 |
Downstream POS model 🤗 [https://huggingface.co/dsfsi/PuoBERTa-POS](https://huggingface.co/dsfsi/PuoBERTa-POS)
### MasakhaNER
Performance of models on the MasakhaNER downstream task.
| Model | Test Performance (f1 score) |
|---|---|
| **Multilingual Models** | |
| AfriBERTa | 83.2 |
| AfroXLMR-base | 87.7 |
| AfroXLMR-large | 89.4 |
| **Monolingual Models** | |
| NCHLT TSN RoBERTa | 74.2 |
| PuoBERTa | **78.2** |
| PuoBERTa+JW300 | 80.2 |
Downstream NER model 🤗 [https://huggingface.co/dsfsi/PuoBERTa-NER](https://huggingface.co/dsfsi/PuoBERTa-NER)
## Pre-Training Dataset
We used the PuoData dataset, a rich source of Setswana text, ensuring that our model is well-trained and culturally attuned.
[Github](https://github.com/dsfsi/PuoData), 🤗 [https://huggingface.co/datasets/dsfsi/PuoData](https://huggingface.co/datasets/dsfsi/PuoData)
## Citation Information
Bibtex Reference
```
@inproceedings{marivate2023puoberta,
title = {PuoBERTa: Training and evaluation of a curated language model for Setswana},
author = {Vukosi Marivate and Moseli Mots'Oehli and Valencia Wagner and Richard Lastrucci and Isheanesu Dzingirai},
year = {2023},
booktitle= {Artificial Intelligence Research. SACAIR 2023. Communications in Computer and Information Science},
url= {https://link.springer.com/chapter/10.1007/978-3-031-49002-6_17},
keywords = {NLP},
preprint_url = {https://arxiv.org/abs/2310.09141},
dataset_url = {https://github.com/dsfsi/PuoBERTa},
software_url = {https://huggingface.co/dsfsi/PuoBERTa}
}
```
## Contributing
Your contributions are welcome! Feel free to improve the model.
## Model Card Authors
Vukosi Marivate
## Model Card Contact
For more details, reach out or check our [website](https://dsfsi.github.io/).
Email: vukosi.marivate@cs.up.ac.za
**Enjoy exploring Setswana through AI!** |