Spaces:
Sleeping
A newer version of the Gradio SDK is available:
5.28.0
Baseline Simultaneous Translation
This is an instruction of training and evaluating a wait-k simultanoes LSTM model on MUST-C English-Gernam Dataset.
Requirements
Install fairseq (make sure to use the correct branch):
git clone --branch simulastsharedtask git@github.com:pytorch/fairseq.git
cd fairseq
pip install -e .
Assuming that fairseq is installed in a directory called FAIRSEQ
.
Install SentencePiece. One easy way is to use anaconda:
conda install -c powerai sentencepiece
Download the MuST-C data for English-German available at https://ict.fbk.eu/must-c/.
We will assume that the data is downloaded in a directory called DATA_ROOT
.
Text-to-text Model
Data Preparation
Train a SentencePiece model:
for lang in en de; do
python $FAIRSEQ/examples/simultaneous_translation/data/train_spm.py \
--data-path $DATA_ROOT/data \
--vocab-size 10000 \
--max-frame 3000 \
--model-type unigram \
--lang $lang \
--out-path .
Process the data with the SentencePiece model:
proc_dir=proc
mkdir -p $proc_dir
for split in train dev tst-COMMON tst-HE; do
for lang in en de; do
spm_encode \
--model unigram-$lang-10000-3000/spm.model \
< $DATA_ROOT/data/$split/txt/$split.$lang \
> $proc_dir/$split.spm.$lang
done
done
Binarize the data:
proc_dir=proc
fairseq-preprocess \
--source-lang en --target-lang de \
--trainpref $proc_dir/train.spm \
--validpref $proc_dir/dev.spm \
--testpref $proc_dir/tst-COMMON.spm \
--thresholdtgt 0 \
--thresholdsrc 0 \
--workers 20 \
--destdir ./data-bin/mustc_en_de \
Training
mkdir -p checkpoints
CUDA_VISIBLE_DEVICES=1 python $FAIRSEQ/train.py data-bin/mustc_en_de \
--save-dir checkpoints \
--arch berard_simul_text_iwslt \
--simul-type waitk \
--waitk-lagging 2 \
--optimizer adam \
--max-epoch 100 \
--lr 0.001 \
--clip-norm 5.0 \
--batch-size 128 \
--log-format json \
--log-interval 10 \
--criterion cross_entropy_acc \
--user-dir $FAIRSEQ/examples/simultaneous_translation
Speech-to-text Model
Data Preparation
First, segment wav files.
python $FAIRSEQ/examples/simultaneous_translation/data/segment_wav.py \
--datapath $DATA_ROOT
Similar to text-to-text model, train a Sentencepiecemodel, but only train on German
python $FAIRSEQ/examples/simultaneous_translation/data/train_spm.py \
--data-path $DATA_ROOT/data \
--vocab-size 10000 \
--max-frame 3000 \
--model-type unigram \
--lang $lang \
--out-path .
Training
mkdir -p checkpoints
CUDA_VISIBLE_DEVICES=1 python $FAIRSEQ/train.py data-bin/mustc_en_de \
--save-dir checkpoints \
--arch berard_simul_text_iwslt \
--waitk-lagging 2 \
--waitk-stride 10 \
--input-feat-per-channel 40 \
--encoder-hidden-size 512 \
--output-layer-dim 128 \
--decoder-num-layers 3 \
--task speech_translation \
--user-dir $FAIRSEQ/examples/simultaneous_translation
--optimizer adam \
--max-epoch 100 \
--lr 0.001 \
--clip-norm 5.0 \
--batch-size 128 \
--log-format json \
--log-interval 10 \
--criterion cross_entropy_acc \
--user-dir $FAIRSEQ/examples/simultaneous_translation
Evaluation
Evaluation Server
For text translation models, the server is set up as follow give input file and reference file.
python ./eval/server.py \
--hostname localhost \
--port 12321 \
--src-file $DATA_ROOT/data/dev/txt/dev.en \
--ref-file $DATA_ROOT/data/dev/txt/dev.de
For speech translation models, the input is the data direcrory.
python ./eval/server.py \
--hostname localhost \
--port 12321 \
--ref-file $DATA_ROOT \
--data-type speech
Decode and Evaluate with Client
Once the server is set up, run client to evaluate translation quality and latency.
# TEXT
python $fairseq_dir/examples/simultaneous_translation/evaluate.py \
data-bin/mustc_en_de \
--user-dir $FAIRSEQ/examples/simultaneous_translation \
--src-spm unigram-en-10000-3000/spm.model\
--tgt-spm unigram-de-10000-3000/spm.model\
-s en -t de \
--path checkpoints/checkpoint_best.pt
# SPEECH
python $fairseq_dir/examples/simultaneous_translation/evaluate.py \
data-bin/mustc_en_de \
--user-dir $FAIRSEQ/examples/simultaneous_translation \
--data-type speech \
--tgt-spm unigram-de-10000-3000/spm.model\
-s en -t de \
--path checkpoints/checkpoint_best.pt