File size: 4,100 Bytes
c49488f
6b792dd
f27da9d
 
 
 
 
 
 
6f3329f
f27da9d
6f3329f
 
 
 
 
 
 
f27da9d
6f3329f
 
 
f27da9d
6f3329f
f27da9d
6f3329f
f27da9d
6f3329f
 
 
f27da9d
6f3329f
 
f27da9d
 
 
6f3329f
6bfe828
 
 
 
 
 
 
 
 
ba56c45
6bfe828
 
 
 
 
 
 
 
5cda625
6bfe828
 
 
 
 
 
36d1954
6bfe828
 
 
23336a3
6bfe828
 
c49488f
f27da9d
 
 
 
 
 
 
 
 
 
 
 
2000270
f27da9d
6f3329f
 
 
b5d2a35
6f3329f
 
f27da9d
6f3329f
 
f27da9d
6f3329f
 
 
f27da9d
 
 
6f3329f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23336a3
 
 
6f3329f
23336a3
 
 
 
 
6f3329f
 
 
23336a3
6f3329f
23336a3
 
 
6f3329f
23336a3
 
 
6f3329f
23336a3
 
6f3329f
 
f27da9d
 
878eea5
504a3e6
878eea5
f27da9d
fa8babb
f27da9d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
---
license: mit
tags:
- generated_from_trainer
model-index:
- name: afro-xlmr-large-61L
  results: []
language:
- en
- am
- ar
- so
- sw
- pt
- af
- fr
- zu
- mg
- ha
- sn
- arz
- ny
- ig
- xh
- yo
- st
- rw
- tn
- ti
- ts
- om
- run
- nso
- ee
- ln
- tw
- pcm
- gaa
- loz
- lg
- guw
- bem
- efi
- lue
- lua
- toi
- ve
- tum
- tll
- iso
- kqn
- zne
- umb
- mos
- tiv
- lu
- ff
- kwy
- bci
- rnd
- luo
- wal
- ss
- lun
- wo
- nyk
- kj
- ki
- fon
---


# afro-xlmr-large-61L

AfroXLMR-large was created by MLM adaptation of XLM-R-large model on 61 languages widely spoken in Africa
including 4 high resource languages. 

### Pre-training corpus
A mix of mC4, Wikipedia and OPUS data

### Languages

There are 61 languages available :
- English (eng)
- Amharic (amh)
- Arabic (ara) 
- Somali (som)
- Kiswahili (swa)
- Portuguese (por)
- Afrikaans (afr)
- French (fra)
- isiZulu (zul)
- Malagasy (mlg)
- Hausa (hau)
- chiShona (sna)
- Egyptian Arabic (arz)
- Chichewa (nya)
- Igbo (ibo)
- isiXhosa (xho)
- Yorùbá (yor)
- Sesotho (sot)
- Kinyarwanda (kin)
- Tigrinya (tir)
- Tsonga (tso)
- Oromo (orm)
- Rundi (run)
- Northern Sotho (nso)
- Ewe (ewe)
- Lingala (lin)
- Twi (twi)
- Nigerian Pidgin (pcm)
- Ga (gaa)
- Lozi (loz)
- Luganda (lug)
- Gun (guw)
- Bemba (bem)
- Efik (efi)
- Luvale (lue) 
- Luba-Lulua (lua)
- Tonga (toi)
- Tshivenḓa (ven)
- Tumbuka (tum)
- Tetela (tll)
- Isoko (iso)
- Kaonde (kqn)
- Zande (zne)
- Umbundu (umb)
- Mossi (mos)
- Tiv (tiv)
- Luba-Katanga (lub)
- Fula (fuv)
- San Salvador Kongo (kwy)
- Baoulé (bci)
- Ruund (rnd)
- Luo (luo)
- Wolaitta (wal) 
- Swazi (ssw)
- Lunda (lun)
- Wolof (wol)
- Nyaneka (nyk) 
- Kwanyama (kua)
- Kikuyu (kik)
- Fon (fon)


### Acknowledgment
We would like to thank Google Cloud for providing us access to TPU v3-8 through the free cloud credits. Model trained using flax, before converted to pytorch.


### BibTeX entry and citation info.
```
@inproceedings{alabi-etal-2022-adapting,
    title = "Adapting Pre-trained Language Models to {A}frican Languages via Multilingual Adaptive Fine-Tuning",
    author = "Alabi, Jesujoba O.  and
      Adelani, David Ifeoluwa  and
      Mosbach, Marius  and
      Klakow, Dietrich",
    booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
    month = oct,
    year = "2022",
    address = "Gyeongju, Republic of Korea",
    publisher = "International Committee on Computational Linguistics",
    url = "https://aclanthology.org/2022.coling-1.382",
    pages = "4336--4349",
    abstract = "Multilingual pre-trained language models (PLMs) have demonstrated impressive performance on several downstream tasks for both high-resourced and low-resourced languages. However, there is still a large performance drop for languages unseen during pre-training, especially African languages. One of the most effective approaches to adapt to a new language is language adaptive fine-tuning (LAFT) {---} fine-tuning a multilingual PLM on monolingual texts of a language using the pre-training objective. However, adapting to target language individually takes large disk space and limits the cross-lingual transfer abilities of the resulting models because they have been specialized for a single language. In this paper, we perform multilingual adaptive fine-tuning on 17 most-resourced African languages and three other high-resource languages widely spoken on the African continent to encourage cross-lingual transfer learning. To further specialize the multilingual PLM, we removed vocabulary tokens from the embedding layer that corresponds to non-African writing scripts before MAFT, thus reducing the model size by around 50{\%}. Our evaluation on two multilingual PLMs (AfriBERTa and XLM-R) and three NLP tasks (NER, news topic classification, and sentiment classification) shows that our approach is competitive to applying LAFT on individual languages while requiring significantly less disk space. Additionally, we show that our adapted PLM also improves the zero-shot cross-lingual transfer abilities of parameter efficient fine-tuning methods.",
}

```