--- license: cc-by-nc-4.0 tags: - mms --- # Massively Multilingual Speech (MMS) - Common Crawl Language Models This repository consists of the n-gram language models trained on Common Crawl data ([Conneau et al. 2020b](https://aclanthology.org/2020.acl-main.747/), [NLLB_Team et al. 2022](https://arxiv.org/abs/2207.04672)) using [KenLM library](https://github.com/kpu/kenlm). For the following languages, the LMs are not present in the repository (due to 50GB limit on HuggingFace) and can be downloaded using the link provided here. Mandarin Chinese (Simplified) - [Download LM](https://dl.fbaipublicfiles.com/mms/lms/cmn-script_simplified/char_20gram.bin) Japanese - [Download LM](https://dl.fbaipublicfiles.com/mms/lms/jpn/char_20gram.bin) Thai - [Download LM](https://dl.fbaipublicfiles.com/mms/lms/tha/char_20gram.bin) Cantonese(Traditional) - [Download LM](https://dl.fbaipublicfiles.com/mms/lms/yue-script_traditional/char_20gram.bin) ## Table Of Content - [Example](#example) - [Supported Languages](#supported-languages) - [Model details](#model-details) - [Additional links](#additional-links) ## Example Checkout the code here - https://huggingface.co/spaces/mms-meta/MMS/blob/main/asr.py which uses LMs for decoding the output from ASR models. ## Supported Languages We support language models in 102 languages. Unclick the following to toogle all supported languages of this checkpoint in [ISO 639-3 code](https://en.wikipedia.org/wiki/ISO_639-3). You can find more details about the languages and their ISO 639-3 codes in the [MMS Language Coverage Overview](https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mms.html).
Click to toggle - afr - amh - ara - asm - ast - azj - bel - ben - bos - bul - cat - ceb - ces - ckb - cmn - cym - dan - deu - ell - eng - est - fas - fin - fra - ful - gle - glg - guj - hau - heb - hin - hrv - hun - hye - ibo - ind - isl - ita - jav - jpn - kam - kan - kat - kaz - kea - khm - kir - kor - lao - lav - lin - lit - ltz - lug - luo - mal - mar - mkd - mlt - mon - mri - mya - nld - nob - npi - nso - nya - oci - orm - ory - pan - pol - por - pus - ron - rus - slk - slv - sna - snd - som - spa - srp - swe - swh - tam - tel - tgk - tgl - tha - tur - ukr - umb - urd - uzb - vie - wol - xho - yor - yue - zlm - zul
## Model details - **Developed by:** Vineel Pratap et al. - **Model type:** Multi-Lingual Automatic Speech Recognition model - **Language(s):** 126 languages, see [supported languages](#supported-languages) - **License:** CC-BY-NC 4.0 license - **Num parameters**: 1 billion - **Audio sampling rate**: 16,000 kHz - **Cite as:** @article{pratap2023mms, title={Scaling Speech Technology to 1,000+ Languages}, author={Vineel Pratap and Andros Tjandra and Bowen Shi and Paden Tomasello and Arun Babu and Sayani Kundu and Ali Elkahky and Zhaoheng Ni and Apoorv Vyas and Maryam Fazel-Zarandi and Alexei Baevski and Yossi Adi and Xiaohui Zhang and Wei-Ning Hsu and Alexis Conneau and Michael Auli}, journal={arXiv}, year={2023} } ## Additional Links - [Blog post](https://ai.facebook.com/blog/multilingual-model-speech-recognition/) - [Transformers documentation](https://huggingface.co/docs/transformers/main/en/model_doc/mms). - [Paper](https://arxiv.org/abs/2305.13516) - [GitHub Repository](https://github.com/facebookresearch/fairseq/tree/main/examples/mms#asr) - [Other **MMS** checkpoints](https://huggingface.co/models?other=mms) - MMS base checkpoints: - [facebook/mms-1b](https://huggingface.co/facebook/mms-1b) - [facebook/mms-300m](https://huggingface.co/facebook/mms-300m) - [Official Space](https://huggingface.co/spaces/facebook/MMS)