Spaces:
Runtime error
Runtime error
File size: 7,770 Bytes
ee21b96 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 |
# Simultaneous Speech Translation (SimulST) on MuST-C
This is a tutorial of training and evaluating a transformer *wait-k* simultaneous model on MUST-C English-Germen Dataset, from [SimulMT to SimulST: Adapting Simultaneous Text Translation to End-to-End Simultaneous Speech Translation](https://www.aclweb.org/anthology/2020.aacl-main.58.pdf).
[MuST-C](https://www.aclweb.org/anthology/N19-1202) is multilingual speech-to-text translation corpus with 8-language translations on English TED talks.
## Data Preparation
This section introduces the data preparation for training and evaluation.
If you only want to evaluate the model, please jump to [Inference & Evaluation](#inference--evaluation)
[Download](https://ict.fbk.eu/must-c) and unpack MuST-C data to a path
`${MUSTC_ROOT}/en-${TARGET_LANG_ID}`, then preprocess it with
```bash
# Additional Python packages for S2T data processing/model training
pip install pandas torchaudio sentencepiece
# Generate TSV manifests, features, vocabulary,
# global cepstral and mean estimation,
# and configuration for each language
cd fairseq
python examples/speech_to_text/prep_mustc_data.py \
--data-root ${MUSTC_ROOT} --task asr \
--vocab-type unigram --vocab-size 10000 \
--cmvn-type global
python examples/speech_to_text/prep_mustc_data.py \
--data-root ${MUSTC_ROOT} --task st \
--vocab-type unigram --vocab-size 10000 \
--cmvn-type global
```
## ASR Pretraining
We need a pretrained offline ASR model. Assuming the save directory of the ASR model is `${ASR_SAVE_DIR}`.
The following command (and the subsequent training commands in this tutorial) assume training on 1 GPU (you can also train on 8 GPUs and remove the `--update-freq 8` option).
```
fairseq-train ${MUSTC_ROOT}/en-de \
--config-yaml config_asr.yaml --train-subset train_asr --valid-subset dev_asr \
--save-dir ${ASR_SAVE_DIR} --num-workers 4 --max-tokens 40000 --max-update 100000 \
--task speech_to_text --criterion label_smoothed_cross_entropy --report-accuracy \
--arch convtransformer_espnet --optimizer adam --lr 0.0005 --lr-scheduler inverse_sqrt \
--warmup-updates 10000 --clip-norm 10.0 --seed 1 --update-freq 8
```
A pretrained ASR checkpoint can be downloaded [here](https://dl.fbaipublicfiles.com/simultaneous_translation/must_c_v1_en_de_pretrained_asr)
## Simultaneous Speech Translation Training
### Wait-K with fixed pre-decision module
Fixed pre-decision indicates that the model operate simultaneous policy on the boundaries of fixed chunks.
Here is a example of fixed pre-decision ratio 7 (the simultaneous decision is made every 7 encoder states) and
a wait-3 policy model. Assuming the save directory is `${ST_SAVE_DIR}`
```bash
fairseq-train ${MUSTC_ROOT}/en-de \
--config-yaml config_st.yaml --train-subset train_st --valid-subset dev_st \
--save-dir ${ST_SAVE_DIR} --num-workers 8 \
--optimizer adam --lr 0.0001 --lr-scheduler inverse_sqrt --clip-norm 10.0 \
--criterion label_smoothed_cross_entropy \
--warmup-updates 4000 --max-update 100000 --max-tokens 40000 --seed 2 \
--load-pretrained-encoder-from ${ASR_SAVE_DIR}/checkpoint_best.pt \
--task speech_to_text \
--arch convtransformer_simul_trans_espnet \
--simul-type waitk_fixed_pre_decision \
--waitk-lagging 3 \
--fixed-pre-decision-ratio 7 \
--update-freq 8
```
### Monotonic multihead attention with fixed pre-decision module
```
fairseq-train ${MUSTC_ROOT}/en-de \
--config-yaml config_st.yaml --train-subset train_st --valid-subset dev_st \
--save-dir ${ST_SAVE_DIR} --num-workers 8 \
--optimizer adam --lr 0.0001 --lr-scheduler inverse_sqrt --clip-norm 10.0 \
--warmup-updates 4000 --max-update 100000 --max-tokens 40000 --seed 2 \
--load-pretrained-encoder-from ${ASR_SAVE_DIR}/${CHECKPOINT_FILENAME} \
--task speech_to_text \
--criterion latency_augmented_label_smoothed_cross_entropy \
--latency-weight-avg 0.1 \
--arch convtransformer_simul_trans_espnet \
--simul-type infinite_lookback_fixed_pre_decision \
--fixed-pre-decision-ratio 7 \
--update-freq 8
```
## Inference & Evaluation
[SimulEval](https://github.com/facebookresearch/SimulEval) is used for evaluation.
The following command is for evaluation.
```
git clone https://github.com/facebookresearch/SimulEval.git
cd SimulEval
pip install -e .
simuleval \
--agent ${FAIRSEQ}/examples/speech_to_text/simultaneous_translation/agents/fairseq_simul_st_agent.py
--source ${SRC_LIST_OF_AUDIO}
--target ${TGT_FILE}
--data-bin ${MUSTC_ROOT}/en-de \
--config config_st.yaml \
--model-path ${ST_SAVE_DIR}/${CHECKPOINT_FILENAME} \
--output ${OUTPUT} \
--scores
```
The source file `${SRC_LIST_OF_AUDIO}` is a list of paths of audio files. Assuming your audio files stored at `/home/user/data`,
it should look like this
```bash
/home/user/data/audio-1.wav
/home/user/data/audio-2.wav
```
Each line of target file `${TGT_FILE}` is the translation for each audio file input.
```bash
Translation_1
Translation_2
```
The evaluation runs on the original MUSTC segmentation.
The following command will generate the wav list and text file for a evaluation set `${SPLIT}` (chose from `dev`, `tst-COMMON` and `tst-HE`) in MUSTC to `${EVAL_DATA}`.
```bash
python ${FAIRSEQ}/examples/speech_to_text/seg_mustc_data.py \
--data-root ${MUSTC_ROOT} --lang de \
--split ${SPLIT} --task st \
--output ${EVAL_DATA}
```
The `--data-bin` and `--config` should be the same in previous section if you prepare the data from the scratch.
If only for evaluation, a prepared data directory can be found [here](https://dl.fbaipublicfiles.com/simultaneous_translation/must_c_v1.0_en_de_databin.tgz). It contains
- `spm_unigram10000_st.model`: a sentencepiece model binary.
- `spm_unigram10000_st.txt`: the dictionary file generated by the sentencepiece model.
- `gcmvn.npz`: the binary for global cepstral mean and variance.
- `config_st.yaml`: the config yaml file. It looks like this.
You will need to set the absolute paths for `sentencepiece_model` and `stats_npz_path` if the data directory is downloaded.
```yaml
bpe_tokenizer:
bpe: sentencepiece
sentencepiece_model: ABS_PATH_TO_SENTENCEPIECE_MODEL
global_cmvn:
stats_npz_path: ABS_PATH_TO_GCMVN_FILE
input_channels: 1
input_feat_per_channel: 80
sampling_alpha: 1.0
specaugment:
freq_mask_F: 27
freq_mask_N: 1
time_mask_N: 1
time_mask_T: 100
time_mask_p: 1.0
time_wrap_W: 0
transforms:
'*':
- global_cmvn
_train:
- global_cmvn
- specaugment
vocab_filename: spm_unigram10000_st.txt
```
Notice that once a `--data-bin` is set, the `--config` is the base name of the config yaml, not the full path.
Set `--model-path` to the model checkpoint.
A pretrained checkpoint can be downloaded from [here](https://dl.fbaipublicfiles.com/simultaneous_translation/convtransformer_wait5_pre7), which is a wait-5 model with a pre-decision of 280 ms.
The result of this model on `tst-COMMON` is:
```bash
{
"Quality": {
"BLEU": 13.94974229366959
},
"Latency": {
"AL": 1751.8031870037803,
"AL_CA": 2338.5911762796536,
"AP": 0.7931395378788959,
"AP_CA": 0.9405103863210942,
"DAL": 1987.7811616943081,
"DAL_CA": 2425.2751560926167
}
}
```
If `--output ${OUTPUT}` option is used, the detailed log and scores will be stored under the `${OUTPUT}` directory.
The quality is measured by detokenized BLEU. So make sure that the predicted words sent to the server are detokenized.
The latency metrics are
* Average Proportion
* Average Lagging
* Differentiable Average Lagging
Again they will also be evaluated on detokenized text.
|