metadata

tags:
  - mms
language:
  - ab
  - af
  - ak
  - am
  - ar
  - as
  - av
  - ay
  - az
  - ba
  - bm
  - be
  - bn
  - bi
  - bo
  - sh
  - br
  - bg
  - ca
  - cs
  - ce
  - cv
  - ku
  - cy
  - da
  - de
  - dv
  - dz
  - el
  - en
  - eo
  - et
  - eu
  - ee
  - fo
  - fa
  - fj
  - fi
  - fr
  - fy
  - ff
  - ga
  - gl
  - gn
  - gu
  - zh
  - ht
  - ha
  - he
  - hi
  - sh
  - hu
  - hy
  - ig
  - ia
  - ms
  - is
  - it
  - jv
  - ja
  - kn
  - ka
  - kk
  - kr
  - km
  - ki
  - rw
  - ky
  - ko
  - kv
  - lo
  - la
  - lv
  - ln
  - lt
  - lb
  - lg
  - mh
  - ml
  - mr
  - ms
  - mk
  - mg
  - mt
  - mn
  - mi
  - my
  - zh
  - nl
  - 'no'
  - 'no'
  - ne
  - ny
  - oc
  - om
  - or
  - os
  - pa
  - pl
  - pt
  - ms
  - ps
  - qu
  - qu
  - qu
  - qu
  - qu
  - qu
  - qu
  - qu
  - qu
  - qu
  - qu
  - qu
  - qu
  - qu
  - qu
  - qu
  - qu
  - qu
  - qu
  - qu
  - qu
  - qu
  - ro
  - rn
  - ru
  - sg
  - sk
  - sl
  - sm
  - sn
  - sd
  - so
  - es
  - sq
  - su
  - sv
  - sw
  - ta
  - tt
  - te
  - tg
  - tl
  - th
  - ti
  - ts
  - tr
  - uk
  - ms
  - vi
  - wo
  - xh
  - ms
  - yo
  - ms
  - zu
  - za
license: cc-by-nc-4.0
datasets:
  - google/fleurs
metrics:
  - acc

Massively Multilingual Speech (MMS) - Finetuned LID

This checkpoint is a model fine-tuned for speech language identification (LID) and part of Facebook's Massive Multilingual Speech project. This checkpoint is based on the Wav2Vec2 architecture and classifies raw audio input to a probability distribution over 512 output classes (each class representing a language). The checkpoint consists of 1 billion parameters and has been fine-tuned from facebook/mms-1b on 512 languages.

Example

This MMS checkpoint can be used with Transformers to identify the spoken language of an audio. It can recognize the following 512 languages.

Let's look at a simple example.

First, we install transformers and some other libraries

pip install torch accelerate torchaudio datasets
pip install --upgrade transformers

Note: In order to use MMS you need to have at least transformers >= 4.30 installed. If the 4.30 version is not yet available on PyPI make sure to install transformers from source:

pip install git+https://github.com/huggingface/transformers.git

Next, we load a couple of audio samples via datasets. Make sure that the audio data is sampled to 16000 kHz.

from datasets import load_dataset, Audio

# English
stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "en", split="test", streaming=True)
stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000))
en_sample = next(iter(stream_data))["audio"]["array"]

# Arabic
stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "ar", split="test", streaming=True)
stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000))
ar_sample = next(iter(stream_data))["audio"]["array"]

Next, we load the model and processor

from transformers import Wav2Vec2ForSequenceClassification, AutoFeatureExtractor
import torch

model_id = "facebook/mms-lid-512"

processor = AutoFeatureExtractor.from_pretrained(model_id)
model = Wav2Vec2ForSequenceClassification.from_pretrained(model_id)

Now we process the audio data, pass the processed audio data to the model to classify it into a language, just like we usually do for Wav2Vec2 audio classification models such as ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition

# English
inputs = processor(en_sample, sampling_rate=16_000, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs).logits

lang_id = torch.argmax(outputs, dim=-1)[0].item()
detected_lang = model.config.id2label[lang_id]
# 'eng'

# Arabic
inputs = processor(ar_sample, sampling_rate=16_000, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs).logits

lang_id = torch.argmax(outputs, dim=-1)[0].item()
detected_lang = model.config.id2label[lang_id]
# 'ara'

To see all the supported languages of a checkpoint, you can print out the language ids as follows:

processor.id2label.values()

For more details, about the architecture please have a look at the official docs.

Supported Languages

This model supports 512 languages. Unclick the following to toogle all supported languages of this checkpoint in ISO 639-3 code. You can find more details about the languages and their ISO 649-3 codes in the MMS Language Coverage Overview.

Click to toggle

Model details

Developed by: Vineel Pratap et al.
Model type: Multi-Lingual Automatic Speech Recognition model
Language(s): 512 languages, see supported languages
License: CC-BY-NC 4.0 license
Num parameters: 1 billion
Audio sampling rate: 16,000 kHz

Cite as:

@article{pratap2023mms,
  title={Scaling Speech Technology to 1,000+ Languages},
  author={Vineel Pratap and Andros Tjandra and Bowen Shi and Paden Tomasello and Arun Babu and Sayani Kundu and Ali Elkahky and Zhaoheng Ni and Apoorv Vyas and Maryam Fazel-Zarandi and Alexei Baevski and Yossi Adi and Xiaohui Zhang and Wei-Ning Hsu and Alexis Conneau and Michael Auli},
journal={arXiv},
year={2023}
}

Additional Links

Blog post
Transformers documentation.
Paper
GitHub Repository
Other MMS checkpoints
MMS base checkpoints:
- facebook/mms-1b
- facebook/mms-300m
Official Space