Davlan's picture
Update README.md
c59daa4 verified
|
raw
history blame
2.76 kB
metadata
license: mit
tags:
  - generated_from_trainer
model-index:
  - name: afro-xlmr-large-76L_script
    results: []
language:
  - en
  - am
  - ar
  - so
  - sw
  - pt
  - af
  - fr
  - zu
  - mg
  - ha
  - sn
  - arz
  - ny
  - ig
  - xh
  - yo
  - st
  - rw
  - tn
  - ti
  - ts
  - om
  - run
  - nso
  - ee
  - ln
  - tw
  - pcm
  - gaa
  - loz
  - lg
  - guw
  - bem
  - efi
  - lue
  - lua
  - toi
  - ve
  - tum
  - tll
  - iso
  - kqn
  - zne
  - umb
  - mos
  - tiv
  - lu
  - ff
  - kwy
  - bci
  - rnd
  - luo
  - wal
  - ss
  - lun
  - wo
  - nyk
  - kj
  - ki
  - fon
  - bm
  - cjk
  - din
  - dyu
  - kab
  - kam
  - kbp
  - kr
  - kmb
  - kg
  - nus
  - sg
  - taq
  - tzm
  - nqo

afro-xlmr-large-75L_script

AfroXLMR-large was created by first augmenting the XLM-R-large model with missing scripts (N'Ko and Tifinagh), followed by an MLM adaptation of the expanded XLM-R-large model on 76 languages widely spoken in Africa including 4 high-resource languages.

Pre-training corpus

A mix of mC4, Wikipedia and OPUS data

Languages

There are 75 languages available :

  • English (eng)
  • Amharic (amh)
  • Arabic (ara)
  • Somali (som)
  • Kiswahili (swa)
  • Portuguese (por)
  • Afrikaans (afr)
  • French (fra)
  • isiZulu (zul)
  • Malagasy (mlg)
  • Hausa (hau)
  • chiShona (sna)
  • Egyptian Arabic (arz)
  • Chichewa (nya)
  • Igbo (ibo)
  • isiXhosa (xho)
  • Yorùbá (yor)
  • Sesotho (sot)
  • Kinyarwanda (kin)
  • Tigrinya (tir)
  • Tsonga (tso)
  • Oromo (orm)
  • Rundi (run)
  • Northern Sotho (nso)
  • Ewe (ewe)
  • Lingala (lin)
  • Twi (twi)
  • Nigerian Pidgin (pcm)
  • Ga (gaa)
  • Lozi (loz)
  • Luganda (lug)
  • Gun (guw)
  • Bemba (bem)
  • Efik (efi)
  • Luvale (lue)
  • Luba-Lulua (lua)
  • Tonga (toi)
  • Tshivenḓa (ven)
  • Tumbuka (tum)
  • Tetela (tll)
  • Isoko (iso)
  • Kaonde (kqn)
  • Zande (zne)
  • Umbundu (umb)
  • Mossi (mos)
  • Tiv (tiv)
  • Luba-Katanga (lub)
  • Fula (fuv)
  • San Salvador Kongo (kwy)
  • Baoulé (bci)
  • Ruund (rnd)
  • Luo (luo)
  • Wolaitta (wal)
  • Swazi (ssw)
  • Lunda (lun)
  • Wolof (wol)
  • Nyaneka (nyk)
  • Kwanyama (kua)
  • Kikuyu (kik)
  • Fon (fon)
  • Bambara (bam)
  • Chokwe (cjk)
  • Dinka (dik)
  • Dyula (dyu)
  • Kabyle (kab)
  • Kamba (kam)
  • Kabiyè (kbp)
  • Kanuri (knc)
  • Kimbundu (kmb)
  • Kikongo (kon)
  • Nuer (nus)
  • Sango (sag)
  • Tamasheq (taq)
  • Tamazight (tzm)

Acknowledgment

We would like to thank Google Cloud for providing us access to TPU v3-8 through the free cloud credits. Model trained using flax, before converted to pytorch.

BibTeX entry and citation info.

@misc{adelani2023sib200,
      title={SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects}, 
      author={David Ifeoluwa Adelani and Hannah Liu and Xiaoyu Shen and Nikita Vassilyev and Jesujoba O. Alabi and Yanke Mao and Haonan Gao and Annie En-Shiun Lee},
      year={2023},
      eprint={2309.07445},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}