--- license: mit tags: - generated_from_trainer model-index: - name: afro-xlmr-large-76L_script results: [] language: - en - am - ar - so - sw - pt - af - fr - zu - mg - ha - sn - arz - ny - ig - xh - yo - st - rw - tn - ti - ts - om - run - nso - ee - ln - tw - pcm - gaa - loz - lg - guw - bem - efi - lue - lua - toi - ve - tum - tll - iso - kqn - zne - umb - mos - tiv - lu - ff - kwy - bci - rnd - luo - wal - ss - lun - wo - nyk - kj - ki - fon - bm - cjk - din - dyu - kab - kam - kbp - kr - kmb - kg - nus - sg - taq - tzm - nqo --- # afro-xlmr-large-75L_script AfroXLMR-large was created by first augmenting the XLM-R-large model with missing scripts (N'Ko and Tifinagh), followed by an MLM adaptation of the expanded XLM-R-large model on 76 languages widely spoken in Africa including 4 high-resource languages. ### Pre-training corpus A mix of mC4, Wikipedia and OPUS data ### Languages There are 75 languages available : - English (eng) - Amharic (amh) - Arabic (ara) - Somali (som) - Kiswahili (swa) - Portuguese (por) - Afrikaans (afr) - French (fra) - isiZulu (zul) - Malagasy (mlg) - Hausa (hau) - chiShona (sna) - Egyptian Arabic (arz) - Chichewa (nya) - Igbo (ibo) - isiXhosa (xho) - Yorùbá (yor) - Sesotho (sot) - Kinyarwanda (kin) - Tigrinya (tir) - Tsonga (tso) - Oromo (orm) - Rundi (run) - Northern Sotho (nso) - Ewe (ewe) - Lingala (lin) - Twi (twi) - Nigerian Pidgin (pcm) - Ga (gaa) - Lozi (loz) - Luganda (lug) - Gun (guw) - Bemba (bem) - Efik (efi) - Luvale (lue) - Luba-Lulua (lua) - Tonga (toi) - Tshivenḓa (ven) - Tumbuka (tum) - Tetela (tll) - Isoko (iso) - Kaonde (kqn) - Zande (zne) - Umbundu (umb) - Mossi (mos) - Tiv (tiv) - Luba-Katanga (lub) - Fula (fuv) - San Salvador Kongo (kwy) - Baoulé (bci) - Ruund (rnd) - Luo (luo) - Wolaitta (wal) - Swazi (ssw) - Lunda (lun) - Wolof (wol) - Nyaneka (nyk) - Kwanyama (kua) - Kikuyu (kik) - Fon (fon) - Bambara (bam) - Chokwe (cjk) - Dinka (dik) - Dyula (dyu) - Kabyle (kab) - Kamba (kam) - Kabiyè (kbp) - Kanuri (knc) - Kimbundu (kmb) - Kikongo (kon) - Nuer (nus) - Sango (sag) - Tamasheq (taq) - Tamazight (tzm) ### Acknowledgment We would like to thank Google Cloud for providing us access to TPU v3-8 through the free cloud credits. Model trained using flax, before converted to pytorch. ### BibTeX entry and citation info. ``` @misc{adelani2023sib200, title={SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects}, author={David Ifeoluwa Adelani and Hannah Liu and Xiaoyu Shen and Nikita Vassilyev and Jesujoba O. Alabi and Yanke Mao and Haonan Gao and Annie En-Shiun Lee}, year={2023}, eprint={2309.07445}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```