Update README.md
Browse files
README.md
CHANGED
@@ -8,7 +8,9 @@ tags: []
|
|
8 |
<!-- Provide a quick summary of what the model is/does. -->
|
9 |
|
10 |
Lugha-Llama is an Africa-centric language model developed through continual pretraining with [WURA dataset](https://huggingface.co/datasets/castorini/wura), a large African languages corpora which consists of sixteen low-resource African languages and four high-resource
|
11 |
-
languages commonly spoken on the African continent.
|
|
|
|
|
12 |
We combine [WURA data](https://huggingface.co/datasets/castorini/wura) with high-quality English documents from [FineWeb-Edu](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1) and [OpenWebMath](https://huggingface.co/datasets/open-web-math/open-web-math) which results in improved Lugha-Llama-Edu and Lugha-Llama-Maths models respectively.
|
13 |
On the challenging [IrokoBench](https://huggingface.co/collections/masakhane/irokobench-665a21b6d4714ed3f81af3b1) dataset, our models consistently achieve the best performance amongst similary-sized baselines. In a separate ablation experiment, we translate English education documents to Swahili to study whether the performance gains from FineWeb-Edu data is due to its content or English source language.
|
14 |
|
|
|
8 |
<!-- Provide a quick summary of what the model is/does. -->
|
9 |
|
10 |
Lugha-Llama is an Africa-centric language model developed through continual pretraining with [WURA dataset](https://huggingface.co/datasets/castorini/wura), a large African languages corpora which consists of sixteen low-resource African languages and four high-resource
|
11 |
+
languages commonly spoken on the African continent.
|
12 |
+
|
13 |
+
To train the model, we sample as uniformly as possible across languages while limiting the number of times data is repeated and upsample rare languages by at most four epochs.
|
14 |
We combine [WURA data](https://huggingface.co/datasets/castorini/wura) with high-quality English documents from [FineWeb-Edu](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1) and [OpenWebMath](https://huggingface.co/datasets/open-web-math/open-web-math) which results in improved Lugha-Llama-Edu and Lugha-Llama-Maths models respectively.
|
15 |
On the challenging [IrokoBench](https://huggingface.co/collections/masakhane/irokobench-665a21b6d4714ed3f81af3b1) dataset, our models consistently achieve the best performance amongst similary-sized baselines. In a separate ablation experiment, we translate English education documents to Swahili to study whether the performance gains from FineWeb-Edu data is due to its content or English source language.
|
16 |
|