mpc001 commited on
Commit
7b215b2
1 Parent(s): d319e26

Delete README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -315
README.md DELETED
@@ -1,315 +0,0 @@
1
- <p align="center"><img width="160" src="doc/lip_white.png" alt="logo"></p>
2
- <h1 align="center">Visual Speech Recognition for Multiple Languages</h1>
3
-
4
- <div align="center">
5
-
6
- [📘Introduction](#Introduction) |
7
- [🛠️Preparation](#Preparation) |
8
- [📊Benchmark](#Benchmark-evaluation) |
9
- [🔮Inference](#Speech-prediction) |
10
- [🐯Model zoo](#Model-Zoo) |
11
- [📝License](#License)
12
- </div>
13
-
14
- ## Authors
15
-
16
- [Pingchuan Ma](https://mpc001.github.io/), [Alexandros Haliassos](https://dblp.org/pid/257/3052.html), [Adriana Fernandez-Lopez](https://scholar.google.com/citations?user=DiVeQHkAAAAJ), [Honglie Chen](https://scholar.google.com/citations?user=HPwdvwEAAAAJ), [Stavros Petridis](https://ibug.doc.ic.ac.uk/people/spetridis), [Maja Pantic](https://ibug.doc.ic.ac.uk/people/mpantic).
17
-
18
- ## Update
19
-
20
- `2023-03-27`: We have released our AutoAVSR models for LRS3, see [here](#autoavsr-models).
21
-
22
- ## Introduction
23
-
24
- This is the repository of [Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels](https://arxiv.org/abs/2303.14307) and [Visual Speech Recognition for Multiple Languages](https://arxiv.org/abs/2202.13084), which is the successor of [End-to-End Audio-Visual Speech Recognition with Conformers](https://arxiv.org/abs/2102.06657). By using this repository, you can achieve the performance of 19.1%, 1.0% and 0.9% WER for automatic, visual, and audio-visual speech recognition (ASR, VSR, and AV-ASR) on LRS3.
25
-
26
- ## Tutorial
27
-
28
- We provide a tutorial [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1jfb6e4xxhXHbmQf-nncdLno1u0b4j614) to show how to use our Auto-AVSR models to perform speech recognition (ASR, VSR, and AV-ASR), crop mouth ROIs or extract visual speech features.
29
-
30
- ## Demo
31
-
32
- English -> Mandarin -> Spanish | French -> Portuguese -> Italian |
33
- :-------------------------------:|:------------------------------------:
34
- <img src='doc/vsr_1.gif' title='vsr1' style='max-width:320px'></img> | <img src='doc/vsr_2.gif' title='vsr2' style='max-width:320px'></img> |
35
-
36
- <div align="center">
37
-
38
- [Youtube](https://youtu.be/FIau-6JA9Po) |
39
- [Bilibili](https://www.bilibili.com/video/BV1Wu411D7oP)
40
- </div>
41
-
42
- ## Preparation
43
- 1. Clone the repository and enter it locally:
44
-
45
- ```Shell
46
- git clone https://github.com/mpc001/Visual_Speech_Recognition_for_Multiple_Languages
47
- cd Visual_Speech_Recognition_for_Multiple_Languages
48
- ```
49
-
50
- 2. Setup the environment.
51
- ```Shell
52
- conda create -y -n autoavsr python=3.8
53
- conda activate autoavsr
54
- ```
55
-
56
- 3. Install pytorch, torchvision, and torchaudio by following instructions [here](https://pytorch.org/get-started/), and install all packages:
57
-
58
- ```Shell
59
- pip install -r requirements.txt
60
- conda install -c conda-forge ffmpeg
61
- ```
62
-
63
- 4. Download and extract a pre-trained model and/or language model from [model zoo](#Model-Zoo) to:
64
-
65
- - `./benchmarks/${dataset}/models`
66
-
67
- - `./benchmarks/${dataset}/language_models`
68
-
69
- 5. [For VSR and AV-ASR] Install [RetinaFace](./tools) or [MediaPipe](https://pypi.org/project/mediapipe/) tracker.
70
-
71
- ### Benchmark evaluation
72
-
73
- ```Shell
74
- python eval.py config_filename=[config_filename] \
75
- labels_filename=[labels_filename] \
76
- data_dir=[data_dir] \
77
- landmarks_dir=[landmarks_dir]
78
- ```
79
-
80
- - `[config_filename]` is the model configuration path, located in `./configs`.
81
-
82
- - `[labels_filename]` is the labels path, located in `${lipreading_root}/benchmarks/${dataset}/labels`.
83
-
84
- - `[data_dir]` and `[landmarks_dir]` are the directories for original dataset and corresponding landmarks.
85
-
86
- - `gpu_idx=-1` can be added to switch from `cuda:0` to `cpu`.
87
-
88
- ### Speech prediction
89
-
90
- ```Shell
91
- python infer.py config_filename=[config_filename] data_filename=[data_filename]
92
- ```
93
-
94
- - `data_filename` is the path to the audio/video file.
95
-
96
- - `detector=mediapipe` can be added to switch from RetinaFace to MediaPipe tracker.
97
-
98
- ### Mouth ROIs cropping
99
-
100
- ```Shell
101
- python crop_mouth.py data_filename=[data_filename] dst_filename=[dst_filename]
102
- ```
103
-
104
- - `dst_filename` is the path where the cropped mouth will be saved.
105
-
106
- ## Model zoo
107
-
108
- ### Overview
109
-
110
- We support a number of datasets for speech recognition:
111
- - [x] [Lip Reading Sentences 2 (LRS2)](https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs2.html)
112
- - [x] [Lip Reading Sentences 3 (LRS3)](https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs3.html)
113
- - [x] [Chinese Mandarin Lip Reading (CMLR)](https://www.vipazoo.cn/CMLR.html)
114
- - [x] [CMU Multimodal Opinion Sentiment, Emotions and Attributes (CMU-MOSEAS)](http://immortal.multicomp.cs.cmu.edu/cache/multilingual)
115
- - [x] [GRID](http://spandh.dcs.shef.ac.uk/gridcorpus)
116
- - [x] [Lombard GRID](http://spandh.dcs.shef.ac.uk/avlombard)
117
- - [x] [TCD-TIMIT](https://sigmedia.tcd.ie)
118
-
119
- ### AutoAVSR models
120
-
121
- <details open>
122
-
123
- <summary>Lip Reading Sentences 3 (LRS3)</summary>
124
-
125
- <p> </p>
126
-
127
- | Components | WER | url | size (MB) |
128
- |:----------------------|:----:|:---------------------------------------------------------------------------------------:|:-----------:|
129
- | **Visual-only** |
130
- | - | 19.1 |[GoogleDrive](http://bit.ly/40EAtyX) or [BaiduDrive](https://bit.ly/3ZjbrV5)(key: dqsy) | 891 |
131
- | **Audio-only** |
132
- | - | 1.0 |[GoogleDrive](http://bit.ly/3ZSdh0l) or [BaiduDrive](http://bit.ly/3Z1TlGU)(key: dvf2) | 860 |
133
- | **Audio-visual** |
134
- | - | 0.9 |[GoogleDrive](http://bit.ly/3yRSXAn) or [BaiduDrive](http://bit.ly/3LAxcMY)(key: sai5) | 1540 |
135
- | **Language models** |
136
- | - | - |[GoogleDrive](http://bit.ly/3FE4XsV) or [BaiduDrive](http://bit.ly/3yRI5SY)(key: t9ep) | 191 |
137
- | **Landmarks** |
138
- | - | - |[GoogleDrive](https://bit.ly/33rEsax) or [BaiduDrive](https://bit.ly/3rwQSph)(key: mi3c) | 18577 |
139
-
140
- </details>
141
-
142
- ### VSR for multiple languages models
143
-
144
- <details open>
145
-
146
- <summary>Lip Reading Sentences 2 (LRS2)</summary>
147
-
148
- <p> </p>
149
-
150
- | Components | WER | url | size (MB) |
151
- |:----------------------|:----:|:---------------------------------------------------------------------------------------:|:-----------:|
152
- | **Visual-only** |
153
- | - | 26.1 |[GoogleDrive](https://bit.ly/3I25zrH) or [BaiduDrive](https://bit.ly/3BAHBkH)(key: 48l1) | 186 |
154
- | **Language models** |
155
- | - | - |[GoogleDrive](https://bit.ly/3qzWKit) or [BaiduDrive](https://bit.ly/3KgAL7T)(key: 59u2) | 180 |
156
- | **Landmarks** |
157
- | - | - |[GoogleDrive](https://bit.ly/3jSMMoz) or [BaiduDrive](https://bit.ly/3BuIwBB)(key: 53rc) | 9358 |
158
-
159
- </details>
160
-
161
-
162
- <details open>
163
-
164
- <summary>Lip Reading Sentences 3 (LRS3)</summary>
165
-
166
- <p> </p>
167
-
168
- | Components | WER | url | size (MB) |
169
- |:----------------------|:----:|:---------------------------------------------------------------------------------------:|:-----------:|
170
- | **Visual-only** |
171
- | - | 32.3 |[GoogleDrive](https://bit.ly/3Bp4gjV) or [BaiduDrive](https://bit.ly/3rIzLCn)(key: 1b1s) | 186 |
172
- | **Language models** |
173
- | - | - |[GoogleDrive](https://bit.ly/3qzWKit) or [BaiduDrive](https://bit.ly/3KgAL7T)(key: 59u2) | 180 |
174
- | **Landmarks** |
175
- | - | - |[GoogleDrive](https://bit.ly/33rEsax) or [BaiduDrive](https://bit.ly/3rwQSph)(key: mi3c) | 18577 |
176
-
177
- </details>
178
-
179
-
180
-
181
- <details open>
182
-
183
- <summary>Chinese Mandarin Lip Reading (CMLR)</summary>
184
-
185
- <p> </p>
186
-
187
- | Components | CER | url | size (MB) |
188
- |:----------------------|:----:|:---------------------------------------------------------------------------------------:|:-----------:|
189
- | **Visual-only** |
190
- | - | 8.0 |[GoogleDrive](https://bit.ly/3fR8RkU) or [BaiduDrive](https://bit.ly/3IyACLB)(key: 7eq1) | 195 |
191
- | **Language models** |
192
- | - | - |[GoogleDrive](https://bit.ly/3fPxXAJ) or [BaiduDrive](https://bit.ly/3rEcErr)(key: k8iv) | 187 |
193
- | **Landmarks** |
194
- | - | - |[GoogleDrive](https://bit.ly/3bvetPL) or [BaiduDrive](https://bit.ly/3o2u53d)(key: 1ret) | 3721 |
195
-
196
- </details>
197
-
198
-
199
- <details open>
200
-
201
- <summary>CMU Multimodal Opinion Sentiment, Emotions and Attributes (CMU-MOSEAS)</summary>
202
-
203
- <p> </p>
204
-
205
- | Components | WER | url | size (MB) |
206
- |:----------------------|:----:|:---------------------------------------------------------------------------------------:|:-----------:|
207
- | **Visual-only** |
208
- | Spanish | 44.5 |[GoogleDrive](https://bit.ly/34MjWBW) or [BaiduDrive](https://bit.ly/33rMq3a)(key: m35h) | 186 |
209
- | Portuguese | 51.4 |[GoogleDrive](https://bit.ly/3HjXCgo) or [BaiduDrive](https://bit.ly/3IqbbMg)(key: wk2h) | 186 |
210
- | French | 58.6 |[GoogleDrive](https://bit.ly/3Ik6owb) or [BaiduDrive](https://bit.ly/35msiQG)(key: t1hf) | 186 |
211
- | **Language models** |
212
- | Spanish | - |[GoogleDrive](https://bit.ly/3rppyJN) or [BaiduDrive](https://bit.ly/3nA3wCN)(key: 0mii) | 180 |
213
- | Portuguese | - |[GoogleDrive](https://bit.ly/3gPvneF) or [BaiduDrive](https://bit.ly/33vL8Es)(key: l6ag) | 179 |
214
- | French | - |[GoogleDrive](https://bit.ly/3LDChSn) or [BaiduDrive](https://bit.ly/3sNnNql)(key: 6tan) | 179 |
215
- | **Landmarks** |
216
- | - | - |[GoogleDrive](https://bit.ly/34Cf6ak) or [BaiduDrive](https://bit.ly/3BiFG4c)(key: vsic) | 3040 |
217
-
218
-
219
- </details>
220
-
221
-
222
- <details open>
223
-
224
- <summary>GRID</summary>
225
-
226
- <p> </p>
227
-
228
- | Components | WER | url | size (MB) |
229
- |:----------------------|:----:|:---------------------------------------------------------------------------------------:|:-----------:|
230
- | **Visual-only** |
231
- | Overlapped | 1.2 |[GoogleDrive](https://bit.ly/3Aa6PWn) or [BaiduDrive](https://bit.ly/3IdamGh)(key: d8d2) | 186 |
232
- | Unseen | 4.8 |[GoogleDrive](https://bit.ly/3patMVh) or [BaiduDrive](https://bit.ly/3t6459A)(key: ttsh) | 186 |
233
- | **Landmarks** |
234
- | - | - |[GoogleDrive](https://bit.ly/2Yzu1PF) or [BaiduDrive](https://bit.ly/30fucjG)(key: 16l9) | 1141 |
235
-
236
- You can include `data_ext=.mpg` in your command line to match the video file extension in the GRID dataset.
237
-
238
- </details>
239
-
240
-
241
- <details open>
242
-
243
- <summary>Lombard GRID</summary>
244
-
245
- <p> </p>
246
-
247
- | Components | WER | url | size (MB) |
248
- |:----------------------|:----:|:---------------------------------------------------------------------------------------:|:-----------:|
249
- | **Visual-only** |
250
- | Unseen (Front Plain) | 4.9 |[GoogleDrive](https://bit.ly/3H5zkGQ) or [BaiduDrive](https://bit.ly/3LE1xI6)(key: 38ds) | 186 |
251
- | Unseen (Side Plain) | 8.0 |[GoogleDrive](https://bit.ly/3BsGOSO) or [BaiduDrive](https://bit.ly/3sRZYNY)(key: k6m0) | 186 |
252
- | **Landmarks** |
253
- | - | - |[GoogleDrive](https://bit.ly/354YOH0) or [BaiduDrive](https://bit.ly/3oWUCA4)(key: cusv) | 309 |
254
-
255
- You can include `data_ext=.mov` in your command line to match the video file extension in the Lombard GRID dataset.
256
-
257
- </details>
258
-
259
-
260
- <details open>
261
-
262
- <summary>TCD-TIMIT</summary>
263
-
264
- <p> </p>
265
-
266
- | Components | WER | url | size (MB) |
267
- |:----------------------|:----:|:---------------------------------------------------------------------------------------:|:-----------:|
268
- | **Visual-only** |
269
- | Overlapped | 16.9 |[GoogleDrive](https://bit.ly/3Fv7u61) or [BaiduDrive](https://bit.ly/33rPlZN)(key: jh65) | 186 |
270
- | Unseen | 21.8 |[GoogleDrive](https://bit.ly/3530d0N) or [BaiduDrive](https://bit.ly/3nxZjzC)(key: n2gr) | 186 |
271
- | **Language models** |
272
- | - | - |[GoogleDrive](https://bit.ly/3qzWKit) or [BaiduDrive](https://bit.ly/3KgAL7T)(key: 59u2) | 180 |
273
- | **Landmarks** |
274
- | - | - |[GoogleDrive](https://bit.ly/3HYmifr) or [BaiduDrive](https://bit.ly/3JFJ6RH)(key: bnm8) | 930 |
275
-
276
- </details>
277
-
278
-
279
- ## Citation
280
-
281
- If you use the AutoAVSR models, please consider citing the following paper:
282
-
283
- ```bibtex
284
- @inproceedings{ma2023auto,
285
- author={Ma, Pingchuan and Haliassos, Alexandros and Fernandez-Lopez, Adriana and Chen, Honglie and Petridis, Stavros and Pantic, Maja},
286
- booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
287
- title={Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels},
288
- year={2023},
289
- }
290
- ```
291
-
292
- If you use the VSR models for multiple languages please consider citing the following paper:
293
-
294
- ```bibtex
295
- @article{ma2022visual,
296
- title={{Visual Speech Recognition for Multiple Languages in the Wild}},
297
- author={Ma, Pingchuan and Petridis, Stavros and Pantic, Maja},
298
- journal={{Nature Machine Intelligence}},
299
- volume={4},
300
- pages={930--939},
301
- year={2022}
302
- url={https://doi.org/10.1038/s42256-022-00550-z},
303
- doi={10.1038/s42256-022-00550-z}
304
- }
305
- ```
306
-
307
- ## License
308
-
309
- It is noted that the code can only be used for comparative or benchmarking purposes. Users can only use code supplied under a [License](./LICENSE) for non-commercial purposes.
310
-
311
- ## Contact
312
-
313
- ```
314
- [Pingchuan Ma](pingchuan.ma16[at]imperial.ac.uk)
315
- ```