--- license: cc-by-nc-4.0 language: - ab - af - am - ar - as - az - ba - be - bn - bo - bs - br - bg - ca - cs - cv - cy - da - de - dv - el - en - eo - et - eu - ee - fo - fa - tl - fi - fr - fy - ga - gl - gv - gn - gu - ht - ha - he - hi - hr - hu - hy - ig - ia - id - is - it - jv - ja - kn - ka - kk - km - rw - ky - ku - ko - lo - la - lv - ln - lt - lb - lg - ml - mr - mk - mg - mt - mn - mi - ms - my - ne - nl - nn - no - oc - or - pa - pl - pt - ps - ro - ru - sa - si - sl - sk - sn - sd - so - st - es - sq - sc - sr - su - sw - sv - ta - tt - te - tg - th - tn - tk - tr - tw - ug - uk - ur - uz - vi - xh - yi - yo - zh --- ## mHuBERT-147 models mHuBERT-147 are multilingual general-purpose HuBERT models trained on 90K hours of open-license data in 147 languages. This repository contains: * Fairseq checkpoint (original); * HuggingFace checkpoint; * Faiss index for continuous pre-training (OPQ16_64,IVF1000_HNSW32,PQ16x4fsr). # Citing ``` [PAPER GOES HERE] ''' # Other information **Languages present not indexed by Huggingface:** Asturian (ast), Basaa (bas), Cebuano (ceb), Central Kurdish/Sorani (ckb), Hakha Chin (cnh), Hawaiian (haw), Upper Sorbian (hsb) Kabyle (kab), Moksha (mdf), Meadow Mari (mhr), Hill Mari (mrj), Erzya (myv), Taiwanese Hokkien (nan-tw), Sursilvan (rm-sursilv), Vallader (rm-vallader), Sakha (sah), Santali (sat), Scots (sco), Saraiki (skr), Tigre (tig), Tok Pisin (tpi), Akwapen Twi (tw-akuapem), Asante Twi (tw-asante), Votic (vot), Waray (war), Cantonese (yue). **Datasets:** * Aishell * BibleTTS * ClovaCall * CommonVoice v11 * Google TTS data * IISc-MILE * JVS * Kokoro * Kosp2e * Media Speech * Multilingual LibriSpeech * Samrómur * THCHS-30 and THUYG-20 * VoxLingua107 * VoxPopuli