Update README.md
Browse files
README.md
CHANGED
@@ -11,8 +11,12 @@ Lugha-Llama is an Africa-centric language model developed through continual pret
|
|
11 |
languages commonly spoken on the African continent.
|
12 |
|
13 |
To train the model, we sample as uniformly as possible across languages while limiting the number of times data is repeated and upsample rare languages by at most four epochs.
|
14 |
-
We combine [WURA data](https://huggingface.co/datasets/castorini/wura) with high-quality English documents from [FineWeb-Edu](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1) and [OpenWebMath](https://huggingface.co/datasets/open-web-math/open-web-math) which results
|
15 |
-
|
|
|
|
|
|
|
|
|
16 |
|
17 |
We demonstrate the findings in our paper [Adapting Large Language Models for African Languages:
|
18 |
The Lugha-Llama Model]()
|
@@ -22,9 +26,6 @@ Authors: [Happy Buzaaba](https://buzaabah.github.io/)\*, [Alexander Wettig](http
|
|
22 |
contact {happy.buzaaba@, awettig@cs}princeton.edu
|
23 |
|
24 |
|
25 |
-
* Translated Swahili data 200M tokens: [FineWeb_Edu-swahili-translated](https://huggingface.co/datasets/princeton-nlp/fineweb_edu-swahili-translated)
|
26 |
-
|
27 |
-
|
28 |
## Lugha-Llama models
|
29 |
|
30 |
* [Lugha-Llama/Lugha-Llama-8B-wura](https://huggingface.co/Lugha-Llama/Lugha-Llama-8B-wura)
|
@@ -35,3 +36,4 @@ contact {happy.buzaaba@, awettig@cs}princeton.edu
|
|
35 |
|
36 |
|
37 |
|
|
|
|
11 |
languages commonly spoken on the African continent.
|
12 |
|
13 |
To train the model, we sample as uniformly as possible across languages while limiting the number of times data is repeated and upsample rare languages by at most four epochs.
|
14 |
+
We combine [WURA data](https://huggingface.co/datasets/castorini/wura) with high-quality English documents from [FineWeb-Edu](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1) and [OpenWebMath](https://huggingface.co/datasets/open-web-math/open-web-math) which results into improved Lugha-Llama-Edu and Lugha-Llama-Maths models respectively.
|
15 |
+
Our models consistently achieve the best performance amongst similary-sized baselines.
|
16 |
+
|
17 |
+
In a separate ablation experiment, we translate English education documents to Swahili to study whether the performance gains from FineWeb-Edu data is due to its content or English source language.
|
18 |
+
* Translated Swahili data 200M tokens: [FineWeb_Edu-swahili-translated](https://huggingface.co/datasets/princeton-nlp/fineweb_edu-swahili-translated)
|
19 |
+
|
20 |
|
21 |
We demonstrate the findings in our paper [Adapting Large Language Models for African Languages:
|
22 |
The Lugha-Llama Model]()
|
|
|
26 |
contact {happy.buzaaba@, awettig@cs}princeton.edu
|
27 |
|
28 |
|
|
|
|
|
|
|
29 |
## Lugha-Llama models
|
30 |
|
31 |
* [Lugha-Llama/Lugha-Llama-8B-wura](https://huggingface.co/Lugha-Llama/Lugha-Llama-8B-wura)
|
|
|
36 |
|
37 |
|
38 |
|
39 |
+
|