--- license: cc language: - ve - ts - zu - xh - nso - tn library_name: transformers tags: - low-resouce - masked-language-model - south africa - tshivenda --- # Zabantu - Exploring Multilingual Language Model training for South African Bantu Languages > Zabantu( "Za" for South Africa, "bantu" for Bantu languages) is a collection of masked language models that have been trained from scratch using a compact dataset comprising various subsets of Bantu languages spoken in South Africa. These models are inspired by the work done on AfriBERTa, which demonstrated the effectiveness of training on XLM-R architecture using a smaller dataset. The focus of this work was to use LLMs to advance NLP applications in Tshivenda and also to serve as a benchmark for future works covering Bantu languages. # Model Details - **Model Name:** Zabantu-XLM-Roberta - **Model Version:** 0.0.1 - **Model Architecture:** [XLM-RoBERTa](https://arxiv.org/abs/1911.02116) - **Model Size:** 80 - 250 million parameters - **Language Support:** Tshivenda, Nguni languages (Zulu, Xhosa, Swati), Sotho languages (Northern Sotho, Southern Sotho, Setswana), and Xitsonga. ## Usage example(s) ```python from transformers import pipeline # Initialize the pipeline for masked language model # Note: You might need to login, and request permissions to access dsfsi while the model is in private-beta unmasker = pipeline('fill-mask', model='dsfsi/zabantu-bantu-250m') sample_sentences = { 'zulu': "Le ndoda ithi izo____ ukudla.", # Masked word for Zulu 'tshivenda': "Mufana uyo____ vhukuma.", # Masked word for Tshivenda 'sepedi': "Mosadi o ____ pheka.", # Masked word for Sepedi 'tswana': "Monna o ____ tsamaya.", # Masked word for Tswana 'tsonga': "N'wana wa xisati u ____ ku tsaka." # Masked word for Tsonga } for language, sentence in sample_sentences.items(): masked_sentence = sentence.replace('____', unmasker.tokenizer.mask_token) # Get the model predictions results = unmasker(masked_sentence) print(f"Original sentence ({language}): {sentence}") print(f"Top prediction for the masked token: {results[0]['sequence']}\n") ``` * For fine-tuning tasks, checkout these examples: * [Text Classification]() * [NER]() * [POS]() ## Model Variants This model card provides an overview of the multilingual language models developed for South African languages, with a specific focus on advancing Tshivenda natural language processing (NLP) coverage. Zabantu-XLMR refers to a fleet of models trained on different combinations of South African Bantu languages. These include: - [Zabantu-VEN](https://huggingface.co/dsfsi/zabantu-ven-120m): A monolingual language model trained on 73k raw sentences in Tshivenda - [Zabantu-NSO](https://huggingface.co/dsfsi/zabantu-nso-80m): A monolingual language model trained on 179k raw sentences in Sepedi - [Zabantu-NSO+VEN](https://huggingface.co/dsfsi/zabantu-nso-ven-170m): A bilingual language model trained on 179k raw sentences in Sepedi and 73k sentences in Tshivenda - [Zabantu-SOT+VEN](https://huggingface.co/dsfsi/zabantu-sot-ven-170m): A multilingual language model trained on 479k raw sentences from Sesotho, Sepedi, Setswana, and Tshivenda - [Zabantu-BANTU](https://huggingface.co/dsfsi/zabantu-bantu-250m): A multilingual language model trained on 1.4M raw sentences from 9 South African Bantu languages ## Intended Use Like any [Masked Language Model (MLM)](https://huggingface.co/docs/transformers/tasks/masked_language_modeling), Zabantu models can be adapted to a variety of semantic tasks such as: - Text Classification/Categorization: Assigning categories or labels to a whole document, or sections of a document, based on its content. - Sentiment Analysis: Determining the sentiment of a text, such as whether the opinion is positive, negative, or neutral. - Named Entity Recognition (NER): Identifying and classifying key information (entities) in text into predefined categories such as the names of people, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. - Part-of-Speech Tagging (POS): Assigning word types to each word (like noun, verb, adjective, etc.), based on both its definition and its context. - Semantic Text Similarity: Measuring how similar two pieces of texts are, which is useful in various applications such as information retrieval, document clustering, and duplicate detection. - etc. ## Performance and Limitations - **Performance:** The Zabantu models demonstrate promising performance on various NLP tasks, including news topic classification with competitive results compared to similar pre-trained cross-lingual models such as [AfriBERTa](https://huggingface.co/castorini/afriberta_base) and [AfroXLMR](https://huggingface.co/Davlan/afro-xlmr-base). **Monolingual test F1 scores on News Topic Classification** | Weighted F1 [%] | Afriberta-large | Afroxlmr | zabantu-nsoven | zabantu-sotven | zabantu-bantu | |-----------------|-----------------|----------|----------------|----------------|---------------| | nso | 71.4 | 71.6 | 74.3 | 69 | 70.6 | | ven | 74.3 | 74.1 | 77 | 76 | 75.6 | **Few-shot(50 shots) test F1 scores on News Topic Classification** | Weighted F1 [%] | Afriberta | Afroxlmr | zabantu-nsoven | zabantu-sotven | zabantu-bantu | |-----------------|-----------|----------|----------------|----------------|---------------| | ven | 60 | 62 | 66 | 69 | 55 | - **Limitations:** * Although efforts have been made to include a wide range of South African languages, the model's coverage may still be limited for certain dialects. We note that the training set was largely dominated by Setwana and IsiXhosa. * We also acknowledge the potential to further improve the model by training it on more data, including additional domains and topics. * As with any language model, the generated output should be carefully reviewed and post-processed to ensure accuracy and cultural sensitivity. # Training Data The models have been trained on a large corpus of text data collected from various sources, including [SADiLaR](https://repo.sadilar.org/handle/20.500.12185/7), [Leipnets](https://wortschatz.uni-leipzig.de/en/download/Venda#ven_community_2017), [Flores](https://github.com/facebookresearch/flores), [CC-100](https://data.statmt.org/cc-100/), [Opus](https://opus.nlpl.eu/opus-100.php) and various South African government websites. The training data covers a wide range of topics and domains, notably religion, politics, academics and health (mostly Covid-19).