English
French
Paulmzr commited on
Commit
17e3316
1 Parent(s): c379030

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +121 -3
README.md CHANGED
@@ -1,3 +1,121 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - google/cvss
5
+ language:
6
+ - en
7
+ - fr
8
+ metrics:
9
+ - bleu
10
+ ---
11
+ # NAST-S2X: A Fast and End-to-End Simultaneous Speech-to-Any Translation Model
12
+ <p align="center">
13
+ <img src="https://github.com/ictnlp/NAST-S2x/assets/43530347/02d6dea6-5887-459e-9938-bc510b6c850c"/>
14
+ </p>
15
+
16
+ ## Features
17
+ * 🤖 **An end-to-end model without intermediate text decoding**
18
+ * 💪 **Supports offline and streaming decoding of all modalities**
19
+ * ⚡️ **28× faster inference compared to autoregressive models**
20
+
21
+ ## Examples
22
+ #### We present an example of French-to-English translation using chunk sizes of 320 ms, 2560 ms, and in offline conditions.
23
+ * Generation with chunk sizes of 320 ms and 2560 ms starts generating English translation before the source speech is complete.
24
+ * In the examples of simultaneous interpretation, the left audio channel is the input streaming speech, and the right audio channel is the simultaneous translation.
25
+ > [!NOTE]
26
+ > For a better experience, please wear headphones.
27
+
28
+ Chunk Size 320ms | Chunk Size 2560ms | Offline
29
+ :-------------------------:|:-------------------------: |:-------------------------:
30
+ <video src="https://github.com/ictnlp/NAST-S2x/assets/43530347/52f2d5c4-43ad-49cb-844f-09575ef048e0" width="100"></video> | <video src="https://github.com/ictnlp/NAST-S2x/assets/43530347/56475dee-1649-40d9-9cb6-9fe033f6bb32"></video> | <video src="https://github.com/ictnlp/NAST-S2x/assets/43530347/b6fb1d09-b418-45f0-84e9-e6ed3a2cea48"></video>
31
+
32
+ Source Speech Transcript | Reference Text Translation
33
+ :-------------------------:|:-------------------------:
34
+ Avant la fusion des communes, Rouge-Thier faisait partie de la commune de Louveigné.| before the fusion of the towns rouge thier was a part of the town of louveigne
35
+
36
+ > [!NOTE]
37
+ > For more examples, please check https://nast-s2x.github.io/.
38
+
39
+ ## Performance
40
+
41
+ * ⚡️ **Lightning Fast**: 28× faster inference and competitive quality in offline speech-to-speech translation
42
+ * 👩‍💼 **Simultaneous**: Achieves high-quality simultaneous interpretation within a delay of less than 3 seconds
43
+ * 🤖 **Unified Framework**: Support end-to-end text & speech generation in one model
44
+
45
+
46
+ **Check Details** 👇
47
+ Offline-S2S | Simul-S2S | Simul-S2T
48
+ :-------------------------:|:-------------------------:|:-------------------------:
49
+ ![image](https://github.com/ictnlp/NAST-S2x/assets/43530347/abf6931f-c6be-4870-8f58-3a338e3b2b5c)| ![image](https://github.com/ictnlp/NAST-S2x/assets/43530347/9a57bf02-c606-4a78-af3e-1c0d1f25d27e) | ![image](https://github.com/ictnlp/NAST-S2x/assets/43530347/6ecfe401-770c-4dc0-9c50-e76a8c20b84b)
50
+
51
+
52
+
53
+
54
+ ## Architecture
55
+ <p align="center">
56
+ <img src="https://github.com/ictnlp/NAST-S2x/assets/43530347/404cdd56-a9d9-4c10-96aa-64f0c7605248" width="800" />
57
+ </p>
58
+
59
+ * **Fully Non-autoregressive:** Trained with **CTC-based non-monotonic latent alignment loss [(Shao and Feng, 2022)](https://arxiv.org/abs/2210.03953)** and **glancing mechanism [(Qian et al., 2021)](https://arxiv.org/abs/2008.07905)**.
60
+ * **Minimum Human Design:** Seamlessly switch between offline translation and simultaneous interpretation **by adjusting the chunk size**.
61
+ * **End-to-End:** Generate target speech **without** target text decoding.
62
+
63
+ # Sources and Usage
64
+ ## Model
65
+ > [!NOTE]
66
+ > We release French-to-English speech-to-speech translation models trained on the CVSS-C dataset to reproduce results in our paper. You can train models in your desired languages by following the instructions provided below.
67
+
68
+ [🤗 Model card](https://huggingface.co/ICTNLP/NAST-S2X)
69
+ | Chunk Size | checkpoint | ASR-BLEU | ASR-BLEU (Silence Removed) | Average Lagging |
70
+ | ----------------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------------------------- |---------------------------------------------------------------- |
71
+ | 320ms | [checkpoint](https://huggingface.co/ICTNLP/NAST-S2X/blob/main/chunk_320ms.pt) | 19.67 | 24.90 | -393ms |
72
+ | 1280ms | [checkpoint](https://huggingface.co/ICTNLP/NAST-S2X/blob/main/chunk_1280ms.pt) | 20.20 | 25.71 | 3330ms |
73
+ | 2560ms | [checkpoint](https://huggingface.co/ICTNLP/NAST-S2X/blob/main/chunk_2560ms.pt) | 24.88 | 26.14 | 4976ms |
74
+ | Offline | [checkpoint](https://huggingface.co/ICTNLP/NAST-S2X/blob/main/Offline.pt) | 25.82 | - | - |
75
+
76
+ | Vocoder |
77
+ | --- |
78
+ | [checkpoint](https://huggingface.co/ICTNLP/NAST-S2X/tree/main/vocoder)|
79
+
80
+ ## Inference
81
+ > [!WARNING]
82
+ > Before executing all the provided shell scripts, please ensure to replace the variables in the file with the paths specific to your machine.
83
+
84
+ ### Offline Inference
85
+ * **Data preprocessing**: Follow the instructions in the [document](https://github.com/ictnlp/NAST-S2x/blob/main/Preprocessing.md).
86
+ * **Generate Acoustic Unit**: Excute [``offline_s2u_infer.sh``](https://github.com/ictnlp/NAST-S2x/blob/main/test_scripts/offline_s2u_infer.sh)
87
+ * **Generate Waveform**: Excute [``offline_wav_infer.sh``](https://github.com/ictnlp/NAST-S2x/blob/main/test_scripts/offline_wav_infer.sh)
88
+ * **Evaluation**: Using Fairseq's [ASR-BLEU evaluation toolkit](https://github.com/facebookresearch/fairseq/tree/main/examples/speech_to_speech/asr_bleu)
89
+ ### Simultaneous Inference
90
+ * We use our customized fork of [``SimulEval: b43a7c``](https://github.com/Paulmzr/SimulEval/tree/b43a7c7a9f20bb4c2ff48cf1bc573b4752d7081e) to evaluate the model in simultaneous inference. This repository is built upon the official [``SimulEval: a1435b``](https://github.com/facebookresearch/SimulEval/tree/a1435b65331cac9d62ea8047fe3344153d7e7dac) and includes additional latency scorers.
91
+ * **Data preprocessing**: Follow the instructions in the [document](https://github.com/ictnlp/NAST-S2x/blob/main/Preprocessing.md).
92
+ * **Streaming Generation and Evaluation**: Excute [``streaming_infer.sh``](https://github.com/ictnlp/NAST-S2x/blob/main/test_scripts/streaming_infer.sh)
93
+
94
+ ## Train your own NAST-S2X
95
+ * **Data preprocessing**: Follow the instructions in the [document](https://github.com/ictnlp/NAST-S2x/blob/main/Preprocessing.md).
96
+ * **CTC Pretraining**: Excute [``train_ctc.sh``](https://github.com/ictnlp/NAST-S2x/blob/main/train_scripts/train_ctc.sh)
97
+ * **NMLA Training**: Excute [``train_nmla.sh``](https://github.com/ictnlp/NAST-S2x/blob/main/train_scripts/train_nmla.sh)
98
+
99
+ ## Citing
100
+
101
+ Please kindly cite us if you find our papers or codes useful.
102
+
103
+ ```
104
+ @inproceedings{
105
+ ma2024nonautoregressive,
106
+ title={A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Any Translation},
107
+ author={Ma, Zhengrui and Fang, Qingkai and Zhang, Shaolei and Guo, Shoutao and Feng, Yang and Zhang, Min
108
+ },
109
+ booktitle={Proceedings of ACL 2024},
110
+ year={2024},
111
+ }
112
+
113
+ @inproceedings{
114
+ fang2024ctcs2ut,
115
+ title={CTC-based Non-autoregressive Textless Speech-to-Speech Translation},
116
+ author={Fang, Qingkai and Ma, Zhengrui and Zhou, Yan and Zhang, Min and Feng, Yang
117
+ },
118
+ booktitle={Findings of ACL 2024},
119
+ year={2024},
120
+ }
121
+ ```