mtspeech commited on
Commit
318eb9b
Β·
verified Β·
1 Parent(s): 330ec3e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +257 -47
README.md CHANGED
@@ -1,48 +1,258 @@
1
- # MooER (ζ‘©θ€³): LLM-based Speech Recognition and Translation Models from Moore Threads
2
-
3
- ## πŸ“– Introduction
4
-
5
- We introduce **MooER (ζ‘©θ€³)**: LLM-based speech recognition and translation models developed by Moore Threads. With the *MooER* framework, you can transcribe the speech into text (speech recognition or, ASR), and translate it into other languages (speech translation or, AST) in a end-to-end manner. The performance of *MooER* is demonstrated in the subsequent section, along with our insights into model configurations, training strategies, and more, provided in our [technical report](https://arxiv.org/abs/2408.05101).
6
-
7
- <br>
8
- <p align="center">
9
- <img src="assets/framework.png" width="600"/>
10
- <p>
11
- <br>
12
-
13
- ## 🏁 Getting Started
14
-
15
- **For the performance of MooER and the usage of the model files, please visit our [GitHub](https://github.com/MooreThreads/MooER)**
16
-
17
-
18
-
19
- ## 🧾 License
20
-
21
- Please see the [LICENSE](LICENSE).
22
-
23
-
24
- ## πŸ’– Citation
25
-
26
- If you find MooER useful for your research, please 🌟 this repo and cite our work using the following BibTeX:
27
-
28
- ```bibtex
29
- @article{liang2024mooer,
30
- title = {MooER: LLM-based Speech Recognition and Translation Models from Moore Threads},
31
- author = {Zhenlin Liang, Junhao Xu, Yi Liu, Yichao Hu, Jian Li, Yajun Zheng, Meng Cai, Hua Wang},
32
- journal = {arXiv preprint arXiv:2408.05101},
33
- url = {https://arxiv.org/abs/2408.05101},
34
- year = {2024}
35
- }
36
- ```
37
-
38
- ## πŸ“§ Contact
39
-
40
- If you encouter any problems, feel free to create a discussion.
41
-
42
- Moore Threads Website: **https://www.mthreads.com/**
43
-
44
- <br>
45
- <p align="left">
46
- <img src="assets/MTLogo.png" width="300"/>
47
- <p>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48
  <br>
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - zh
5
+ - en
6
+ metrics:
7
+ - cer
8
+ - bleu
9
+ tags:
10
+ - asr
11
+ - automatic-speech-recognition
12
+ - automatic-speech-translation
13
+ - speech-translation
14
+ - speech-recognition
15
+ ---
16
+
17
+ # MooER (ζ‘©θ€³): LLM-based Speech Recognition and Translation Models from Moore Threads
18
+
19
+ ## πŸ“– Introduction
20
+
21
+ We introduce **MooER (ζ‘©θ€³)**: LLM-based speech recognition and translation models developed by Moore Threads. With the *MooER* framework, you can transcribe the speech into text (speech recognition or, ASR), and translate it into other languages (speech translation or, AST) in a end-to-end manner. The performance of *MooER* is demonstrated in the subsequent section, along with our insights into model configurations, training strategies, and more, provided in our [technical report](https://arxiv.org/abs/2408.05101).
22
+
23
+ For the usage of the model files, please refer to our [GitHub](https://github.com/MooreThreads/MooER)
24
+
25
+ <br>
26
+ <p align="center">
27
+ <img src="assets/framework.png" width="600"/>
28
+ <p>
29
+ <br>
30
+
31
+ ## πŸ₯Š Evaluation Results
32
+
33
+ We demonstrate the training data and the evaluation results below. For more comprehensive information, please refer to our [report](https://arxiv.org/pdf/2408.05101).
34
+
35
+ ### Training data
36
+
37
+ We utilize 5k hours of data (MT5K) to train our basic *MooER-5K* model. The data sources include:
38
+
39
+ | Dataset | Duration |
40
+ |---------------|---------------|
41
+ | aishell2 | 137h |
42
+ | librispeech | 131h |
43
+ | multi_cn | 100h |
44
+ | wenetspeech | 1361h |
45
+ | in-house data | 3274h |
46
+
47
+ Note that, data from the open-source datasets were randomly selected from the full training set. The in-house data, collected internally without text, were transcribed using a third-party ASR service.
48
+
49
+ Since all the above datasets were originally designed only for the speech recognition task, no translation results are available. To train our speech translation model, we used a third-party translation service to generate pseudo-labels. No data filtering techniques were applied.
50
+
51
+ At this moment, we are also developing a new model trained with 80K hours of data.
52
+
53
+ ### Speech Recognition
54
+
55
+ The performance of speech recognition is evaluated using WER/CER.
56
+
57
+ <table>
58
+ <tr>
59
+ <th>Language</th>
60
+ <th>Testset</th>
61
+ <th>Paraformer-large</th>
62
+ <th>SenseVoice-small</th>
63
+ <th>Qwen-audio</th>
64
+ <th>Whisper-large-v3</th>
65
+ <th>SeamlessM4T-v2</th>
66
+ <th>MooER-5K</th>
67
+ <th>MooER-80K</th>
68
+ </tr>
69
+ <tr>
70
+ <td rowspan="7">Chinese</td>
71
+ <td>aishell1</td>
72
+ <td>1.93</td>
73
+ <td>3.03</td>
74
+ <td>1.43</td>
75
+ <td>7.86</td>
76
+ <td>4.09</td>
77
+ <td>1.93</td>
78
+ <td>1.25</td>
79
+ </tr>
80
+ <tr>
81
+ <td>aishell2_ios</td>
82
+ <td>2.85</td>
83
+ <td>3.79</td>
84
+ <td>3.57</td>
85
+ <td>5.38</td>
86
+ <td>4.81</td>
87
+ <td>3.17</td>
88
+ <td>2.67</td>
89
+ </tr>
90
+ <tr>
91
+ <td>test_magicdata</td>
92
+ <td>3.66</td>
93
+ <td>3.81</td>
94
+ <td>5.31</td>
95
+ <td>8.36</td>
96
+ <td>9.69</td>
97
+ <td>3.48</td>
98
+ <td>2.52</td>
99
+ </tr>
100
+ <tr>
101
+ <td>test_thchs</td>
102
+ <td>3.99</td>
103
+ <td>5.17</td>
104
+ <td>4.86</td>
105
+ <td>9.06</td>
106
+ <td>7.14</td>
107
+ <td>4.11</td>
108
+ <td>3.14</td>
109
+ </tr>
110
+ <tr>
111
+ <td>fleurs cmn_dev</td>
112
+ <td>5.56</td>
113
+ <td>6.39</td>
114
+ <td>10.54</td>
115
+ <td>4.54</td>
116
+ <td>7.12</td>
117
+ <td>5.81</td>
118
+ <td>5.23</td>
119
+ </tr>
120
+ <tr>
121
+ <td>fleurs cmn_test</td>
122
+ <td>6.92</td>
123
+ <td>7.36</td>
124
+ <td>11.07</td>
125
+ <td>5.24</td>
126
+ <td>7.66</td>
127
+ <td>6.77</td>
128
+ <td>6.18</td>
129
+ </tr>
130
+ <tr>
131
+ <td>average</td>
132
+ <td><strong>4.15</strong></td>
133
+ <td><strong>4.93</strong></td>
134
+ <td><strong>6.13</strong></td>
135
+ <td><strong>6.74</strong></td>
136
+ <td><strong>6.75</strong></td>
137
+ <td><strong>4.21</strong></td>
138
+ <td><strong>3.50</strong></td>
139
+ </tr>
140
+ <tr>
141
+ <td rowspan="7">English</td>
142
+ <td>librispeech test_clean</td>
143
+ <td>14.15</td>
144
+ <td>4.07</td>
145
+ <td>2.15</td>
146
+ <td>3.42</td>
147
+ <td>2.77</td>
148
+ <td>7.78</td>
149
+ <td>4.11</td>
150
+ </tr>
151
+ <tr>
152
+ <td>librispeech test_other</td>
153
+ <td>22.99</td>
154
+ <td>8.26</td>
155
+ <td>4.68</td>
156
+ <td>5.62</td>
157
+ <td>5.25</td>
158
+ <td>15.25</td>
159
+ <td>9.99</td>
160
+ </tr>
161
+ <tr>
162
+ <td>fleurs eng_dev</td>
163
+ <td>24.93</td>
164
+ <td>12.92</td>
165
+ <td>22.53</td>
166
+ <td>11.63</td>
167
+ <td>11.36</td>
168
+ <td>18.89</td>
169
+ <td>13.32</td>
170
+ </tr>
171
+ <tr>
172
+ <td>fleurs eng_test</td>
173
+ <td>26.81</td>
174
+ <td>13.41</td>
175
+ <td>22.51</td>
176
+ <td>12.57</td>
177
+ <td>11.82</td>
178
+ <td>20.41</td>
179
+ <td>14.97</td>
180
+ </tr>
181
+ <tr>
182
+ <td>gigaspeech dev</td>
183
+ <td>24.23</td>
184
+ <td>19.44</td>
185
+ <td>12.96</td>
186
+ <td>19.18</td>
187
+ <td>28.01</td>
188
+ <td>23.46</td>
189
+ <td>16.92</td>
190
+ </tr>
191
+ <tr>
192
+ <td>gigaspeech test</td>
193
+ <td>23.07</td>
194
+ <td>16.65</td>
195
+ <td>13.26</td>
196
+ <td>22.34</td>
197
+ <td>28.65</td>
198
+ <td>22.09</td>
199
+ <td>16.64</td>
200
+ </tr>
201
+ <tr>
202
+ <td>average</td>
203
+ <td><strong>22.70</strong></td>
204
+ <td><strong>12.46</strong></td>
205
+ <td><strong>13.02</strong></td>
206
+ <td><strong>12.46</strong></td>
207
+ <td><strong>14.64</strong></td>
208
+ <td><strong>17.98</strong></td>
209
+ <td><strong>12.66</strong></td>
210
+ </tr>
211
+ </table>
212
+
213
+ ### Speech Translation (zh -> en)
214
+
215
+ For speech translation, the performanced is evaluated using BLEU score.
216
+
217
+ | Testset | Speech-LLaMA | Whisper-large-v3 | Qwen-audio | Qwen2-audio | SeamlessM4T-v2 | MooER-5K | MooER-5K-MTL |
218
+ |--------|-------------|-------------------|------------|-------------|-----------------|--------|--------------|
219
+ |CoVoST1 zh2en | - | 13.5 | 13.5 | - | 25.3 | - | **30.2** |
220
+ |CoVoST2 zh2en | 12.3 | 12.2 | 15.7 | 24.4 | 22.2 | 23.4 | **25.2** |
221
+ |CCMT2019 dev | - | 15.9 | 12.0 | - | 14.8 | - | **19.6** |
222
+
223
+
224
+ ## 🏁 Getting Started
225
+
226
+ Please visit our [GitHub](https://github.com/MooreThreads/MooER) for the setup and usage.
227
+
228
+
229
+ ## 🧾 License
230
+
231
+ Please see the [LICENSE](LICENSE).
232
+
233
+
234
+ ## πŸ’– Citation
235
+
236
+ If you find MooER useful for your research, please 🌟 this repo and cite our work using the following BibTeX:
237
+
238
+ ```bibtex
239
+ @article{liang2024mooer,
240
+ title = {MooER: LLM-based Speech Recognition and Translation Models from Moore Threads},
241
+ author = {Zhenlin Liang, Junhao Xu, Yi Liu, Yichao Hu, Jian Li, Yajun Zheng, Meng Cai, Hua Wang},
242
+ journal = {arXiv preprint arXiv:2408.05101},
243
+ url = {https://arxiv.org/abs/2408.05101},
244
+ year = {2024}
245
+ }
246
+ ```
247
+
248
+ ## πŸ“§ Contact
249
+
250
+ If you encouter any problems, feel free to create a discussion.
251
+
252
+ Moore Threads Website: **https://www.mthreads.com/**
253
+
254
+ <br>
255
+ <p align="left">
256
+ <img src="assets/MTLogo.png" width="300"/>
257
+ <p>
258
  <br>