Lugha-Llama
/

Lugha-Llama-8B-wura

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

Happyb commited on 24 days ago

Commit

8cb3929

·

verified ·

1 Parent(s): e11673b

Update README.md

Files changed (1) hide show

README.md +3 -1

README.md CHANGED Viewed

@@ -8,7 +8,9 @@ tags: []
 <!-- Provide a quick summary of what the model is/does. -->
 Lugha-Llama is an Africa-centric language model developed through continual pretraining with [WURA dataset](https://huggingface.co/datasets/castorini/wura), a large African languages corpora which consists of sixteen low-resource African languages and four high-resource
-languages commonly spoken on the African continent. Using [UniMax sampling](https://openreview.net/forum?id=kXwdL1cWOAi), we sample as uniformly as possible across languages while limiting the number of times data is repeated and upsample rare languages by at most four epochs.
 We combine [WURA data](https://huggingface.co/datasets/castorini/wura) with high-quality English documents from [FineWeb-Edu](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1) and [OpenWebMath](https://huggingface.co/datasets/open-web-math/open-web-math) which results in improved Lugha-Llama-Edu and Lugha-Llama-Maths models respectively.
 On the challenging [IrokoBench](https://huggingface.co/collections/masakhane/irokobench-665a21b6d4714ed3f81af3b1) dataset, our models consistently achieve the best performance amongst similary-sized baselines. In a separate ablation experiment, we translate English education documents to Swahili to study whether the performance gains from FineWeb-Edu data is due to its content or English source language.

 <!-- Provide a quick summary of what the model is/does. -->
 Lugha-Llama is an Africa-centric language model developed through continual pretraining with [WURA dataset](https://huggingface.co/datasets/castorini/wura), a large African languages corpora which consists of sixteen low-resource African languages and four high-resource
+languages commonly spoken on the African continent.
+To train the model, we sample as uniformly as possible across languages while limiting the number of times data is repeated and upsample rare languages by at most four epochs.
 We combine [WURA data](https://huggingface.co/datasets/castorini/wura) with high-quality English documents from [FineWeb-Edu](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1) and [OpenWebMath](https://huggingface.co/datasets/open-web-math/open-web-math) which results in improved Lugha-Llama-Edu and Lugha-Llama-Maths models respectively.
 On the challenging [IrokoBench](https://huggingface.co/collections/masakhane/irokobench-665a21b6d4714ed3f81af3b1) dataset, our models consistently achieve the best performance amongst similary-sized baselines. In a separate ablation experiment, we translate English education documents to Swahili to study whether the performance gains from FineWeb-Edu data is due to its content or English source language.