File size: 2,979 Bytes
426e0f2
 
 
 
71efd5e
ab27b81
 
426e0f2
 
 
 
 
 
 
 
 
 
 
8299be7
426e0f2
 
 
 
 
 
 
41cf728
426e0f2
 
 
 
 
 
 
 
 
 
8299be7
 
 
 
426e0f2
 
 
41cf728
426e0f2
 
 
 
 
 
 
 
 
 
 
 
 
4d291d7
 
426e0f2
 
 
 
2fd6ccd
 
426e0f2
 
2fd6ccd
 
426e0f2
 
 
 
 
 
 
 
9f4b924
426e0f2
 
b1b41f7
f9117d8
 
286b99f
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
---
language: en
datasets:
- common_voice
- voxpopuli
multilinguality:
- multilingual
tags:
- speech
license: apache-2.0
---

# M-CTC-T 
​
Massively multilingual speech recognizer from Meta AI. The model is a 1B-param transformer encoder, with a CTC head over 8065 character labels and a language identification head over 60 language ID labels. It is trained on Common Voice (version 6.1, December 2020 release) and VoxPopuli. After training on Common Voice and VoxPopuli, the model is trained on Common Voice only. The labels are unnormalized character-level transcripts (punctuation and capitalization are not removed). The model takes as input Mel filterbank features from a 16Khz audio signal.
​
![model image](https://raw.githubusercontent.com/cwkeam/scientific-images/main/MCTCT/mctct-arch.png) 
​

The original Flashlight code, model checkpoints, and Colab notebook can be found at https://github.com/flashlight/wav2letter/tree/main/recipes/mling_pl .
​
​
## Citation
​
[Paper](https://arxiv.org/abs/2111.00161)
​

Authors: Loren Lugosch, Tatiana Likhomanenko, Gabriel Synnaeve, Ronan Collobert
​
```
@article{lugosch2021pseudo,
  title={Pseudo-Labeling for Massively Multilingual Speech Recognition},
  author={Lugosch, Loren and Likhomanenko, Tatiana and Synnaeve, Gabriel and Collobert, Ronan},
  journal={ICASSP},
  year={2022}
}
```

## Contribution

A huge thanks to [Chan Woo Kim](https://huggingface.co/cwkeam) for porting the model from Flashlight C++ to PyTorch.
​
# Training method
​
![model image](https://raw.githubusercontent.com/cwkeam/scientific-images/main/MCTCT/mctct-slimipl.png)
​
For more information on how the model was trained, please take a look at the [official paper](https://arxiv.org/abs/2111.00161).
​
# Usage
​
To transcribe audio files the model can be used as a standalone acoustic model as follows:
​
```python
import torch
import torchaudio
from datasets import load_dataset
from transformers import MCTCTForCTC, MCTCTProcessor

model = MCTCTForCTC.from_pretrained("speechbrain/m-ctc-t-large")
processor = MCTCTProcessor.from_pretrained("speechbrain/m-ctc-t-large")

 # load dummy dataset and read soundfiles
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
 
# feature extraction
input_features = processor(ds[0]["audio"]["array"], sampling_rate=ds[0]["audio"]["sampling_rate"], return_tensors="pt").input_features 

# retrieve logits
with torch.no_grad():
    logits = model(input_features).logits

# take argmax and decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
```
 
Results for Common Voice, averaged over all languages:
​

*Character error rate (CER)*:
​

| "Valid" | "Test" |
|---|---|
| 21.4  | 23.3 |

# Questions & Help

If you have questions regarding this model or need help,
please consider opening a discussion or pull request on this repo
and tag @lorenlugosch, @cwkeam or @patrickvonplaten