TanelAlumae commited on
Commit
5665f33
1 Parent(s): ecba8fc

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +240 -0
README.md ADDED
@@ -0,0 +1,240 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: multilingual
3
+ thumbnail:
4
+ tags:
5
+ - audio-classification
6
+ - speechbrain
7
+ - embeddings
8
+ - Language
9
+ - Identification
10
+ - pytorch
11
+ - ECAPA-TDNN
12
+ - TDNN
13
+ - VoxLingua107
14
+ license: "apache-2.0"
15
+ datasets:
16
+ - VoxLingua107
17
+ metrics:
18
+ - Accuracy
19
+ widget:
20
+ - label: English Sample
21
+ src: https://cdn-media.huggingface.co/speech_samples/LibriSpeech_61-70968-0000.flac
22
+ ---
23
+
24
+ # VoxLingua107 ECAPA-TDNN Spoken Language Identification Model
25
+
26
+ ## Model description
27
+
28
+ This is a spoken language recognition model trained on the VoxLingua107 dataset using SpeechBrain.
29
+ The model uses the ECAPA-TDNN architecture that has previously been used for speaker recognition. However, it uses
30
+ more fully connected hidden layers after the embedding layer, and cross-entropy loss was used for training.
31
+ We observed that this improved the performance of extracted utterance embeddings for downstream tasks.
32
+
33
+ The model can classify a speech utterance according to the language spoken.
34
+ It covers 107 different languages (
35
+ Abkhazian,
36
+ Afrikaans,
37
+ Amharic,
38
+ Arabic,
39
+ Assamese,
40
+ Azerbaijani,
41
+ Bashkir,
42
+ Belarusian,
43
+ Bulgarian,
44
+ Bengali,
45
+ Tibetan,
46
+ Breton,
47
+ Bosnian,
48
+ Catalan,
49
+ Cebuano,
50
+ Czech,
51
+ Welsh,
52
+ Danish,
53
+ German,
54
+ Greek,
55
+ English,
56
+ Esperanto,
57
+ Spanish,
58
+ Estonian,
59
+ Basque,
60
+ Persian,
61
+ Finnish,
62
+ Faroese,
63
+ French,
64
+ Galician,
65
+ Guarani,
66
+ Gujarati,
67
+ Manx,
68
+ Hausa,
69
+ Hawaiian,
70
+ Hindi,
71
+ Croatian,
72
+ Haitian,
73
+ Hungarian,
74
+ Armenian,
75
+ Interlingua,
76
+ Indonesian,
77
+ Icelandic,
78
+ Italian,
79
+ Hebrew,
80
+ Japanese,
81
+ Javanese,
82
+ Georgian,
83
+ Kazakh,
84
+ Central Khmer,
85
+ Kannada,
86
+ Korean,
87
+ Latin,
88
+ Luxembourgish,
89
+ Lingala,
90
+ Lao,
91
+ Lithuanian,
92
+ Latvian,
93
+ Malagasy,
94
+ Maori,
95
+ Macedonian,
96
+ Malayalam,
97
+ Mongolian,
98
+ Marathi,
99
+ Malay,
100
+ Maltese,
101
+ Burmese,
102
+ Nepali,
103
+ Dutch,
104
+ Norwegian Nynorsk,
105
+ Norwegian,
106
+ Occitan,
107
+ Panjabi,
108
+ Polish,
109
+ Pushto,
110
+ Portuguese,
111
+ Romanian,
112
+ Russian,
113
+ Sanskrit,
114
+ Scots,
115
+ Sindhi,
116
+ Sinhala,
117
+ Slovak,
118
+ Slovenian,
119
+ Shona,
120
+ Somali,
121
+ Albanian,
122
+ Serbian,
123
+ Sundanese,
124
+ Swedish,
125
+ Swahili,
126
+ Tamil,
127
+ Telugu,
128
+ Tajik,
129
+ Thai,
130
+ Turkmen,
131
+ Tagalog,
132
+ Turkish,
133
+ Tatar,
134
+ Ukrainian,
135
+ Urdu,
136
+ Uzbek,
137
+ Vietnamese,
138
+ Waray,
139
+ Yiddish,
140
+ Yoruba,
141
+ Mandarin Chinese).
142
+
143
+ ## Intended uses & limitations
144
+
145
+ The model has two uses:
146
+
147
+ - use 'as is' for spoken language recognition
148
+ - use as an utterance-level feature (embedding) extractor, for creating a dedicated language ID model on your own data
149
+
150
+ The model is trained on automatically collected YouTube data. For more
151
+ information about the dataset, see [here](http://bark.phon.ioc.ee/voxlingua107/).
152
+
153
+
154
+ #### How to use
155
+
156
+ ```python
157
+ import torchaudio
158
+ from speechbrain.pretrained import EncoderClassifier
159
+ language_id = EncoderClassifier.from_hparams(source="TalTechNLP/voxlingua107-epaca-tdnn-ce", savedir="tmp")
160
+ # Download Thai language sample from Omniglot and cvert to suitable form
161
+ signal = language_id.load_audio("https://omniglot.com/soundfiles/udhr/udhr_th.mp3")
162
+ prediction = language_id.classify_batch(signal)
163
+ print(prediction)
164
+ (tensor([[-2.8646e+01, -3.0346e+01, -2.0748e+01, -2.9562e+01, -2.2187e+01,
165
+ -3.2668e+01, -3.6677e+01, -3.3573e+01, -3.2545e+01, -2.4365e+01,
166
+ -2.4688e+01, -3.1171e+01, -2.7743e+01, -2.9918e+01, -2.4770e+01,
167
+ -3.2250e+01, -2.4727e+01, -2.6087e+01, -2.1870e+01, -3.2821e+01,
168
+ -2.2128e+01, -2.2822e+01, -3.0888e+01, -3.3564e+01, -2.9906e+01,
169
+ -2.2392e+01, -2.5573e+01, -2.6443e+01, -3.2429e+01, -3.2652e+01,
170
+ -3.0030e+01, -2.4607e+01, -2.2967e+01, -2.4396e+01, -2.8578e+01,
171
+ -2.5153e+01, -2.8475e+01, -2.6409e+01, -2.5230e+01, -2.7957e+01,
172
+ -2.6298e+01, -2.3609e+01, -2.5863e+01, -2.8225e+01, -2.7225e+01,
173
+ -3.0486e+01, -2.1185e+01, -2.7938e+01, -3.3155e+01, -1.9076e+01,
174
+ -2.9181e+01, -2.2160e+01, -1.8352e+01, -2.5866e+01, -3.3636e+01,
175
+ -4.2016e+00, -3.1581e+01, -3.1894e+01, -2.7834e+01, -2.5429e+01,
176
+ -3.2235e+01, -3.2280e+01, -2.8786e+01, -2.3366e+01, -2.6047e+01,
177
+ -2.2075e+01, -2.3770e+01, -2.2518e+01, -2.8101e+01, -2.5745e+01,
178
+ -2.6441e+01, -2.9822e+01, -2.7109e+01, -3.0225e+01, -2.4566e+01,
179
+ -2.9268e+01, -2.7651e+01, -3.4221e+01, -2.9026e+01, -2.6009e+01,
180
+ -3.1968e+01, -3.1747e+01, -2.8156e+01, -2.9025e+01, -2.7756e+01,
181
+ -2.8052e+01, -2.9341e+01, -2.8806e+01, -2.1636e+01, -2.3992e+01,
182
+ -2.3794e+01, -3.3743e+01, -2.8332e+01, -2.7465e+01, -1.5085e-02,
183
+ -2.9094e+01, -2.1444e+01, -2.9780e+01, -3.6046e+01, -3.7401e+01,
184
+ -3.0888e+01, -3.3172e+01, -1.8931e+01, -2.2679e+01, -3.0225e+01,
185
+ -2.4995e+01, -2.1028e+01]]), tensor([-0.0151]), tensor([94]), ['th'])
186
+ # The scores in the prediction[0] tensor can be interpreted as log-likelihoods that
187
+ # the given utterance belongs to the given language (i.e., the larger the better)
188
+ # The linear-scale likelihood can be retrieved using the following:
189
+ print(prediction[1].exp())
190
+ tensor([0.9850])
191
+ # The identified language ISO code is given in prediction[3]
192
+ print(prediction[3])
193
+ ['th']
194
+
195
+ # Alternatively, use the utterance embedding extractor:
196
+ emb = language_id.encode_batch(signal)
197
+ print(emb.shape)
198
+ torch.Size([1, 1, 256])
199
+ ```
200
+
201
+ #### Limitations and bias
202
+
203
+ Since the model is trained on VoxLingua107, it has many limitations and biases, some of which are:
204
+
205
+ - Probably it's accuracy on smaller languages is quite limited
206
+ - Probably it works worse on female speech than male speech (because YouTube data includes much more male speech)
207
+ - Based on subjective experiments, it doesn't work well on speech with a foreign accent
208
+ - Probably it doesn't work well on children's speech and on persons with speech disorders
209
+
210
+
211
+ ## Training data
212
+
213
+ The model is trained on [VoxLingua107](http://bark.phon.ioc.ee/voxlingua107/).
214
+
215
+ VoxLingua107 is a speech dataset for training spoken language identification models.
216
+ The dataset consists of short speech segments automatically extracted from YouTube videos and labeled according the language of the video title and description, with some post-processing steps to filter out false positives.
217
+
218
+ VoxLingua107 contains data for 107 languages. The total amount of speech in the training set is 6628 hours.
219
+ The average amount of data per language is 62 hours. However, the real amount per language varies a lot. There is also a seperate development set containing 1609 speech segments from 33 languages, validated by at least two volunteers to really contain the given language.
220
+
221
+ ## Training procedure
222
+
223
+ We used [SpeechBrain](https://github.com/speechbrain/speechbrain) to train the model.
224
+ Training recipe will be published soon.
225
+
226
+ ## Evaluation results
227
+
228
+ Error rate: 7% on the development dataset
229
+
230
+
231
+ ### BibTeX entry and citation info
232
+
233
+ ```bibtex
234
+ @inproceedings{valk2021slt,
235
+ title={{VoxLingua107}: a Dataset for Spoken Language Recognition},
236
+ author={J{\"o}rgen Valk and Tanel Alum{\"a}e},
237
+ booktitle={Proc. IEEE SLT Workshop},
238
+ year={2021},
239
+ }
240
+ ```