File size: 4,866 Bytes
fc12b4b
 
eb0affd
 
 
 
b76e003
eb0affd
 
 
 
 
 
 
fc12b4b
eb0affd
00e38c4
eb0affd
 
 
ecb1e1d
eb0affd
ecb1e1d
eb0affd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b15142e
 
 
 
 
 
 
 
 
 
 
 
 
eb0affd
 
 
 
 
 
 
 
 
 
 
 
 
b15142e
 
 
 
eb0affd
 
 
 
 
 
 
 
 
 
 
 
 
b15142e
 
 
 
 
 
eb0affd
b15142e
eb0affd
b15142e
eb0affd
 
 
b15142e
eb0affd
 
00e38c4
 
 
 
fb3eeb9
 
00e38c4
 
 
 
eb0affd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
---
license: cc-by-4.0
datasets:
- dsfsi/vukuzenzele-monolingual
- nchlt
- dsfsi/PuoData
- dsfsi/gov-za-monolingual
language:
- tn
library_name: transformers
pipeline_tag: fill-mask
tags:
- masked langauge model
- setswana
---
# PuoBertaJW300: A curated Setswana Language Model (trained on PuoData + JW300 Setswana)
[![Zenodo doi badge](https://img.shields.io/badge/DOI-10.5281%2Fzenodo.8434795-blue.svg)](https://doi.org/10.5281/zenodo.8434795) [![arXiv](https://img.shields.io/badge/arXiv-2310.09141-b31b1b.svg)](https://arxiv.org/abs/2310.09141) 🤗 [https://huggingface.co/dsfsi/PuoBERTa](https://huggingface.co/dsfsi/PuoBERTa)

A Roberta-based language model specially designed for Setswana, using the new PuoData dataset + JW300 corpora.

**NOTE**: If you are looking for the model without JW300, go to [https://huggingface.co/dsfsi/PuoBERTa](https://huggingface.co/dsfsi/PuoBERTa)

## Model Details

### Model Description

This is a masked language model trained on Setswana corpora, making it a valuable tool for a range of downstream applications from translation to content creation. It's powered by the PuoData dataset to ensure accuracy and cultural relevance.

- **Developed by:** Vukosi Marivate ([@vukosi](https://huggingface.co/@vukosi)), Moseli Mots'Oehli ([@MoseliMotsoehli](https://huggingface.co/@MoseliMotsoehli)) , Valencia Wagner, Richard Lastrucci and Isheanesu Dzingirai
- **Model type:** RoBERTa Model
- **Language(s) (NLP):** Setswana
- **License:** CC BY 4.0


### Usage

Use this model filling in masks or finetune for downstream tasks. Here’s a simple example for masked prediction:

```python
from transformers import RobertaTokenizer, RobertaModel

# Load model and tokenizer
model = RobertaModel.from_pretrained('dsfsi/PuoBERTaJW300')
tokenizer = RobertaTokenizer.from_pretrained('dsfsi/PuoBERTaJW300')

```
 
### Downstream Use 

## Downstream Performance

### Daily News Dikgang

Learn more about the dataset in the [Dataset Folder](daily-news-dikgang)

| **Model**                   | **5-fold Cross Validation F1**       | **Test F1**       |
|-----------------------------|--------------------------------------|-------------------|
| Logistic Regression + TFIDF | 60.1                                 | 56.2              |
| NCHLT TSN RoBERTa           | 64.7                                 | 60.3              |
| PuoBERTa                    | **63.8**                             | **62.9**          |
| PuoBERTaJW300               | 66.2                                 | 65.4              |

Downstream News Categorisation model 🤗 [https://huggingface.co/dsfsi/PuoBERTa-News](https://huggingface.co/dsfsi/PuoBERTa-News)

### MasakhaPOS

Performance of models on the MasakhaPOS downstream task.

| Model | Test Performance |
|---|---|
| **Multilingual Models** |  |
| AfroLM | 83.8 |
| AfriBERTa | 82.5 |
| AfroXLMR-base | 82.7 |
| AfroXLMR-large | 83.0 |
| **Monolingual Models** |  |
| NCHLT TSN RoBERTa | 82.3 |
| PuoBERTa | **83.4** |
| PuoBERTa+JW300 | 84.1 |

Downstream POS model 🤗 [https://huggingface.co/dsfsi/PuoBERTa-POS](https://huggingface.co/dsfsi/PuoBERTa-POS)

### MasakhaNER

Performance of models on the MasakhaNER downstream task.

| Model | Test Performance (f1 score) |
|---|---|
| **Multilingual Models** |  |
| AfriBERTa | 83.2 |
| AfroXLMR-base | 87.7 |
| AfroXLMR-large | 89.4 |
| **Monolingual Models** |  |
| NCHLT TSN RoBERTa | 74.2 |
| PuoBERTa | **78.2** |
| PuoBERTa+JW300 | 80.2 |

Downstream NER model 🤗 [https://huggingface.co/dsfsi/PuoBERTa-NER](https://huggingface.co/dsfsi/PuoBERTa-NER)

## Pre-Training Dataset

We used the PuoData dataset, a rich source of Setswana text, ensuring that our model is well-trained and culturally attuned.

[Github](https://github.com/dsfsi/PuoData), 🤗 [https://huggingface.co/datasets/dsfsi/PuoData](https://huggingface.co/datasets/dsfsi/PuoData)

## Citation Information

Bibtex Reference

```
@inproceedings{marivate2023puoberta,
  title   = {PuoBERTa: Training and evaluation of a curated language model for Setswana},
  author  = {Vukosi Marivate and Moseli Mots'Oehli and Valencia Wagner and Richard Lastrucci and Isheanesu Dzingirai},
  year    = {2023},
  booktitle= {Artificial Intelligence Research. SACAIR 2023. Communications in Computer and Information Science},
  url= {https://link.springer.com/chapter/10.1007/978-3-031-49002-6_17},
  keywords = {NLP},
  preprint_url = {https://arxiv.org/abs/2310.09141},
  dataset_url = {https://github.com/dsfsi/PuoBERTa},
  software_url = {https://huggingface.co/dsfsi/PuoBERTa}
}
```

## Contributing

Your contributions are welcome! Feel free to improve the model.

## Model Card Authors

Vukosi Marivate

## Model Card Contact

For more details, reach out or check our [website](https://dsfsi.github.io/).

Email: vukosi.marivate@cs.up.ac.za

**Enjoy exploring Setswana through AI!**