library_name: transformers
tags: []
Lugha-Llama/Lugha-Llama-8B-wura
Lugha-Llama is an Africa-centric language model developed through continual pretraining with WURA dataset, a large African languages corpora which consists of sixteen low-resource African languages and four high-resource languages commonly spoken on the African continent. Using UniMax sampling, we sample as uniformly as possible across languages while limiting the number of times data is repeated and upsample rare languages by at most four epochs. We combine WURA data with high-quality English documents from FineWeb-Edu and OpenWebMath which results in improved Lugha-Llama-Edu and Lugha-Llama-Maths models respectively. On the challenging IrokoBench dataset, our models consistently achieve the best performance amongst similary-sized baselines. In a separate ablation experiment, we translate English education documents to Swahili to study whether the performance gains from FineWeb-Edu data is due to its content or English source language.
We demonstrate the findings in our paper Adapting Large Language Models for African Languages: The Lugha-Llama Model
Authors: Happy Buzaaba*, Alexander Wettig*, David Ifeoluwa Adelani, Christiane Fellbaum (* equal contribution)
contact {happy.buzaaba@, awettig@cs}princeton.edu
- Translated Swahili data 200M tokens: FineWeb_Edu-swahili-translated