license: cc-by-nc-4.0 | |
language: | |
- ab | |
- af | |
- am | |
- ar | |
- as | |
- az | |
- ba | |
- be | |
- bn | |
- bo | |
- bs | |
- br | |
- bg | |
- ca | |
- cs | |
- cv | |
- cy | |
- da | |
- de | |
- dv | |
- el | |
- en | |
- eo | |
- et | |
- eu | |
- ee | |
- fo | |
- fa | |
- tl | |
- fi | |
- fr | |
- fy | |
- ga | |
- gl | |
- gv | |
- gn | |
- gu | |
- ht | |
- ha | |
- he | |
- hi | |
- hr | |
- hu | |
- hy | |
- ig | |
- ia | |
- id | |
- is | |
- it | |
- jv | |
- ja | |
- kn | |
- ka | |
- kk | |
- km | |
- rw | |
- ky | |
- ku | |
- ko | |
- lo | |
- la | |
- lv | |
- ln | |
- lt | |
- lb | |
- lg | |
- ml | |
- mr | |
- mk | |
- mg | |
- mt | |
- mn | |
- mi | |
- ms | |
- my | |
- ne | |
- nl | |
- nn | |
- no | |
- oc | |
- or | |
- pa | |
- pl | |
- pt | |
- ps | |
- ro | |
- ru | |
- sa | |
- si | |
- sl | |
- sk | |
- sn | |
- sd | |
- so | |
- st | |
- es | |
- sq | |
- sc | |
- sr | |
- su | |
- sw | |
- sv | |
- ta | |
- tt | |
- te | |
- tg | |
- th | |
- tn | |
- tk | |
- tr | |
- tw | |
- ug | |
- uk | |
- ur | |
- uz | |
- vi | |
- xh | |
- yi | |
- yo | |
- zh | |
## mHuBERT-147 models | |
mHuBERT-147 are multilingual general-purpose HuBERT models trained on 90K hours of open-license data in 147 languages. | |
This repository contains: | |
* Fairseq checkpoint (original); | |
* HuggingFace checkpoint; | |
* Faiss index for continuous pre-training (OPQ16_64,IVF1000_HNSW32,PQ16x4fsr). | |
# Citing | |
``` | |
[PAPER GOES HERE] | |
''' | |
# Other information | |
**Languages present not indexed by Huggingface:** Asturian (ast), Basaa (bas), Cebuano (ceb), Central Kurdish/Sorani (ckb), Hakha Chin (cnh), Hawaiian (haw), Upper Sorbian (hsb) Kabyle (kab), Moksha (mdf), Meadow Mari (mhr), Hill Mari (mrj), Erzya (myv), Taiwanese Hokkien (nan-tw), Sursilvan (rm-sursilv), Vallader (rm-vallader), Sakha (sah), Santali (sat), Scots (sco), Saraiki (skr), Tigre (tig), Tok Pisin (tpi), Akwapen Twi (tw-akuapem), Asante Twi (tw-asante), Votic (vot), Waray (war), Cantonese (yue). | |
**Datasets:** | |
* Aishell | |
* BibleTTS | |
* ClovaCall | |
* CommonVoice v11 | |
* Google TTS data | |
* IISc-MILE | |
* JVS | |
* Kokoro | |
* Kosp2e | |
* Media Speech | |
* Multilingual LibriSpeech | |
* Samrómur | |
* THCHS-30 and THUYG-20 | |
* VoxLingua107 | |
* VoxPopuli |