TalTechNLP
/

voxlingua107-epaca-tdnn-ce

+---
+language: multilingual
+thumbnail:
+tags:
+- audio-classification
+- speechbrain
+- embeddings
+- Language
+- Identification
+- pytorch
+- ECAPA-TDNN
+- TDNN
+- VoxLingua107
+license: "apache-2.0"
+datasets:
+- VoxLingua107
+metrics:
+- Accuracy
+widget:
+- label: English Sample
+  src: https://cdn-media.huggingface.co/speech_samples/LibriSpeech_61-70968-0000.flac
+---
+# VoxLingua107 ECAPA-TDNN Spoken Language Identification Model
+## Model description
+This is a spoken language recognition model trained on the VoxLingua107 dataset using SpeechBrain.
+The model uses the ECAPA-TDNN architecture that has previously been used for speaker recognition. However, it uses
+more fully connected hidden layers after the embedding layer, and cross-entropy loss was used for training.
+We observed that this improved the performance of extracted utterance embeddings for downstream tasks.
+The model can classify a speech utterance according to the language spoken.
+It covers 107 different languages (
+Abkhazian,
+Afrikaans,
+Amharic,
+Arabic,
+Assamese,
+Azerbaijani,
+Bashkir,
+Belarusian,
+Bulgarian,
+Bengali,
+Tibetan,
+Breton,
+Bosnian,
+Catalan,
+Cebuano,
+Czech,
+Welsh,
+Danish,
+German,
+Greek,
+English,
+Esperanto,
+Spanish,
+Estonian,
+Basque,
+Persian,
+Finnish,
+Faroese,
+French,
+Galician,
+Guarani,
+Gujarati,
+Manx,
+Hausa,
+Hawaiian,
+Hindi,
+Croatian,
+Haitian,
+Hungarian,
+Armenian,
+Interlingua,
+Indonesian,
+Icelandic,
+Italian,
+Hebrew,
+Japanese,
+Javanese,
+Georgian,
+Kazakh,
+Central Khmer,
+Kannada,
+Korean,
+Latin,
+Luxembourgish,
+Lingala,
+Lao,
+Lithuanian,
+Latvian,
+Malagasy,
+Maori,
+Macedonian,
+Malayalam,
+Mongolian,
+Marathi,
+Malay,
+Maltese,
+Burmese,
+Nepali,
+Dutch,
+Norwegian Nynorsk,
+Norwegian,
+Occitan,
+Panjabi,
+Polish,
+Pushto,
+Portuguese,
+Romanian,
+Russian,
+Sanskrit,
+Scots,
+Sindhi,
+Sinhala,
+Slovak,
+Slovenian,
+Shona,
+Somali,
+Albanian,
+Serbian,
+Sundanese,
+Swedish,
+Swahili,
+Tamil,
+Telugu,
+Tajik,
+Thai,
+Turkmen,
+Tagalog,
+Turkish,
+Tatar,
+Ukrainian,
+Urdu,
+Uzbek,
+Vietnamese,
+Waray,
+Yiddish,
+Yoruba,
+Mandarin Chinese).
+## Intended uses & limitations
+The model has two uses:
+  - use 'as is' for spoken language recognition
+  - use as an utterance-level feature (embedding) extractor, for creating a dedicated language ID model on your own data
+The model is trained on automatically collected YouTube data. For more
+information about the dataset, see [here](http://bark.phon.ioc.ee/voxlingua107/).
+#### How to use
+```python
+import torchaudio
+from speechbrain.pretrained import EncoderClassifier
+language_id = EncoderClassifier.from_hparams(source="TalTechNLP/voxlingua107-epaca-tdnn-ce", savedir="tmp")
+# Download Thai language sample from Omniglot and cvert to suitable form
+signal = language_id.load_audio("https://omniglot.com/soundfiles/udhr/udhr_th.mp3")
+prediction =  language_id.classify_batch(signal)
+print(prediction)
+  (tensor([[-2.8646e+01, -3.0346e+01, -2.0748e+01, -2.9562e+01, -2.2187e+01,
+         -3.2668e+01, -3.6677e+01, -3.3573e+01, -3.2545e+01, -2.4365e+01,
+         -2.4688e+01, -3.1171e+01, -2.7743e+01, -2.9918e+01, -2.4770e+01,
+         -3.2250e+01, -2.4727e+01, -2.6087e+01, -2.1870e+01, -3.2821e+01,
+         -2.2128e+01, -2.2822e+01, -3.0888e+01, -3.3564e+01, -2.9906e+01,
+         -2.2392e+01, -2.5573e+01, -2.6443e+01, -3.2429e+01, -3.2652e+01,
+         -3.0030e+01, -2.4607e+01, -2.2967e+01, -2.4396e+01, -2.8578e+01,
+         -2.5153e+01, -2.8475e+01, -2.6409e+01, -2.5230e+01, -2.7957e+01,
+         -2.6298e+01, -2.3609e+01, -2.5863e+01, -2.8225e+01, -2.7225e+01,
+         -3.0486e+01, -2.1185e+01, -2.7938e+01, -3.3155e+01, -1.9076e+01,
+         -2.9181e+01, -2.2160e+01, -1.8352e+01, -2.5866e+01, -3.3636e+01,
+         -4.2016e+00, -3.1581e+01, -3.1894e+01, -2.7834e+01, -2.5429e+01,
+         -3.2235e+01, -3.2280e+01, -2.8786e+01, -2.3366e+01, -2.6047e+01,
+         -2.2075e+01, -2.3770e+01, -2.2518e+01, -2.8101e+01, -2.5745e+01,
+         -2.6441e+01, -2.9822e+01, -2.7109e+01, -3.0225e+01, -2.4566e+01,
+         -2.9268e+01, -2.7651e+01, -3.4221e+01, -2.9026e+01, -2.6009e+01,
+         -3.1968e+01, -3.1747e+01, -2.8156e+01, -2.9025e+01, -2.7756e+01,
+         -2.8052e+01, -2.9341e+01, -2.8806e+01, -2.1636e+01, -2.3992e+01,
+         -2.3794e+01, -3.3743e+01, -2.8332e+01, -2.7465e+01, -1.5085e-02,
+         -2.9094e+01, -2.1444e+01, -2.9780e+01, -3.6046e+01, -3.7401e+01,
+         -3.0888e+01, -3.3172e+01, -1.8931e+01, -2.2679e+01, -3.0225e+01,
+         -2.4995e+01, -2.1028e+01]]), tensor([-0.0151]), tensor([94]), ['th'])
+# The scores in the prediction[0] tensor can be interpreted as log-likelihoods that
+# the given utterance belongs to the given language (i.e., the larger the better)
+# The linear-scale likelihood can be retrieved using the following:
+print(prediction[1].exp())
+  tensor([0.9850])
+# The identified language ISO code is given in prediction[3]
+print(prediction[3])
+  ['th']
+# Alternatively, use the utterance embedding extractor:
+emb =  language_id.encode_batch(signal)
+print(emb.shape)
+  torch.Size([1, 1, 256])
+```
+#### Limitations and bias
+Since the model is trained on VoxLingua107, it has many limitations and biases, some of which are:
+ - Probably it's accuracy on smaller languages  is quite limited
+ - Probably it works worse on female speech than male speech (because YouTube data includes much more male speech)
+ - Based on subjective experiments, it doesn't work well on speech with a foreign accent
+ - Probably it doesn't work well on children's speech and on persons with speech disorders
+## Training data
+The model is trained on [VoxLingua107](http://bark.phon.ioc.ee/voxlingua107/).
+VoxLingua107 is a speech dataset for training spoken language identification models.
+The dataset consists of short speech segments automatically extracted from YouTube videos and labeled according the language of the video title and description, with some post-processing steps to filter out false positives.
+VoxLingua107 contains data for 107 languages. The total amount of speech in the training set is 6628 hours.
+The average amount of data per language is 62 hours. However, the real amount per language varies a lot. There is also a seperate development set containing 1609 speech segments from 33 languages, validated by at least two volunteers to really contain the given language.
+## Training procedure
+We used [SpeechBrain](https://github.com/speechbrain/speechbrain) to train the model.
+Training recipe will be published soon.
+## Evaluation results
+Error rate: 7% on the development dataset
+### BibTeX entry and citation info
+```bibtex
+@inproceedings{valk2021slt,
+  title={{VoxLingua107}: a Dataset for Spoken Language Recognition},
+  author={J{\"o}rgen Valk and Tanel Alum{\"a}e},
+  booktitle={Proc. IEEE SLT Workshop},
+  year={2021},
+}
+```