Happyb commited on
Commit
2155474
·
verified ·
1 Parent(s): 8cb3929

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -5
README.md CHANGED
@@ -11,8 +11,12 @@ Lugha-Llama is an Africa-centric language model developed through continual pret
11
  languages commonly spoken on the African continent.
12
 
13
  To train the model, we sample as uniformly as possible across languages while limiting the number of times data is repeated and upsample rare languages by at most four epochs.
14
- We combine [WURA data](https://huggingface.co/datasets/castorini/wura) with high-quality English documents from [FineWeb-Edu](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1) and [OpenWebMath](https://huggingface.co/datasets/open-web-math/open-web-math) which results in improved Lugha-Llama-Edu and Lugha-Llama-Maths models respectively.
15
- On the challenging [IrokoBench](https://huggingface.co/collections/masakhane/irokobench-665a21b6d4714ed3f81af3b1) dataset, our models consistently achieve the best performance amongst similary-sized baselines. In a separate ablation experiment, we translate English education documents to Swahili to study whether the performance gains from FineWeb-Edu data is due to its content or English source language.
 
 
 
 
16
 
17
  We demonstrate the findings in our paper [Adapting Large Language Models for African Languages:
18
  The Lugha-Llama Model]()
@@ -22,9 +26,6 @@ Authors: [Happy Buzaaba](https://buzaabah.github.io/)\*, [Alexander Wettig](http
22
  contact {happy.buzaaba@, awettig@cs}princeton.edu
23
 
24
 
25
- * Translated Swahili data 200M tokens: [FineWeb_Edu-swahili-translated](https://huggingface.co/datasets/princeton-nlp/fineweb_edu-swahili-translated)
26
-
27
-
28
  ## Lugha-Llama models
29
 
30
  * [Lugha-Llama/Lugha-Llama-8B-wura](https://huggingface.co/Lugha-Llama/Lugha-Llama-8B-wura)
@@ -35,3 +36,4 @@ contact {happy.buzaaba@, awettig@cs}princeton.edu
35
 
36
 
37
 
 
 
11
  languages commonly spoken on the African continent.
12
 
13
  To train the model, we sample as uniformly as possible across languages while limiting the number of times data is repeated and upsample rare languages by at most four epochs.
14
+ We combine [WURA data](https://huggingface.co/datasets/castorini/wura) with high-quality English documents from [FineWeb-Edu](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1) and [OpenWebMath](https://huggingface.co/datasets/open-web-math/open-web-math) which results into improved Lugha-Llama-Edu and Lugha-Llama-Maths models respectively.
15
+ Our models consistently achieve the best performance amongst similary-sized baselines.
16
+
17
+ In a separate ablation experiment, we translate English education documents to Swahili to study whether the performance gains from FineWeb-Edu data is due to its content or English source language.
18
+ * Translated Swahili data 200M tokens: [FineWeb_Edu-swahili-translated](https://huggingface.co/datasets/princeton-nlp/fineweb_edu-swahili-translated)
19
+
20
 
21
  We demonstrate the findings in our paper [Adapting Large Language Models for African Languages:
22
  The Lugha-Llama Model]()
 
26
  contact {happy.buzaaba@, awettig@cs}princeton.edu
27
 
28
 
 
 
 
29
  ## Lugha-Llama models
30
 
31
  * [Lugha-Llama/Lugha-Llama-8B-wura](https://huggingface.co/Lugha-Llama/Lugha-Llama-8B-wura)
 
36
 
37
 
38
 
39
+