Spaces:
Runtime error
Language Models not just for Pre-training: Fast Online Neural Noisy Channel Modeling
Introduction
- Yee et al. (2019) introduce a simple and effective noisy channel modeling approach for neural machine translation. However, the noisy channel online decoding approach introduced in this paper is too slow to be practical.
- To address this, Bhosale et al. (2020) introduces 3 simple approximations to make this approach very fast and practical without much loss in accuracy.
- This README provides intructions on how to run online decoding or generation with the noisy channel modeling approach, including ways to make it very fast without much loss in accuracy.
Noisy Channel Modeling
Yee et al. (2019) applies the Bayes Rule to predict P(y|x)
, the probability of the target y
given the source x
.
P(y|x) = P(x|y) * P(y) / P(x)
P(x|y)
predicts the sourcex
given the targety
and is referred to as the channel modelP(y)
is a language model over the targety
P(x)
is generally not modeled since it is constant for ally
.
We use Transformer models to parameterize the direct model P(y|x)
, the channel model P(x|y)
and the language model P(y)
.
During online decoding with beam search, we generate the top K2
candidates per beam and score them with the following linear combination of the channel model, the language model as well as the direct model scores.
(1 / t) * log(P(y|x) + (1 / s) * ( λ1 * log(P(x|y)) + λ2 * log(P(y) ) )
t
- Target Prefix Lengths
- Source Lengthλ1
- Channel Model Weightλ2
- Language Model Weight
The top beam_size
candidates based on the above combined scores are chosen to continue the beams in beam search. In beam search with a direct model alone, the scores from the direct model P(y|x)
are used to choose the top candidates in beam search.
This framework provides a great way to utlize strong target language models trained on large amounts of unlabeled data. Language models can prefer targets unrelated to the source, so we also need a channel model whose role is to ensure that the target preferred by the language model also translates back to the source.
Training Translation Models and Language Models
For training Transformer models in fairseq for machine translation, refer to instructions here
For training Transformer models in fairseq for language modeling, refer to instructions here
Generation with Language Model for German-English translation with fairseq
Here are instructions to generate using a direct model and a target-side language model.
Note:
- Download and install fairseq as per instructions here
- Preprocess and binarize the dataset as per instructions in section Test Data Preprocessing
binarized_data=data_dir/binarized
direct_model=de_en_seed4.pt
lm_model=en_lm.pt
lm_data=lm_data
wget https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/direct_models/seed4.pt -O ${direct_model}
wget https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/lm_model/transformer_lm.pt -O ${lm_model}
mkdir -p ${lm_data}
wget https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/lm_model/lm_dict/dict.txt -O ${lm_data}/dict.txt
k2=10
lenpen=0.16
lm_wt=0.14
fairseq-generate ${binarized_data} \
--user-dir examples/fast_noisy_channel \
--beam 5 \
--path ${direct_model} \
--lm-model ${lm_model} \
--lm-data ${lm_data} \
--k2 ${k2} \
--combine-method lm_only \
--task noisy_channel_translation \
--lenpen ${lenpen} \
--lm-wt ${lm_wt} \
--gen-subset valid \
--remove-bpe \
--fp16 \
--batch-size 10
Noisy Channel Generation for German-English translation with fairseq
Here are instructions for noisy channel generation with a direct model, channel model and language model as explained in section Noisy Channel Modeling.
Note:
- Download and install fairseq as per instructions here
- Preprocess and binarize the dataset as per instructions in section Test Data Preprocessing
binarized_data=data_dir/binarized
direct_model=de_en_seed4.pt
lm_model=en_lm.pt
lm_data=lm_data
ch_model=en_de.big.seed4.pt
wget https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/direct_models/seed4.pt -O ${direct_model}
wget https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/lm_model/transformer_lm.pt -O ${lm_model}
mkdir -p ${lm_data}
wget https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/lm_model/lm_dict/dict.txt -O ${lm_data}/dict.txt
wget https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/channel_models/big.seed4.pt -O ${ch_model}
k2=10
lenpen=0.21
lm_wt=0.50
bw_wt=0.30
fairseq-generate ${binarized_data} \
--user-dir examples/fast_noisy_channel \
--beam 5 \
--path ${direct_model} \
--lm-model ${lm_model} \
--lm-data ${lm_data} \
--channel-model ${ch_model} \
--k2 ${k2} \
--combine-method noisy_channel \
--task noisy_channel_translation \
--lenpen ${lenpen} \
--lm-wt ${lm_wt} \
--ch-wt ${bw_wt} \
--gen-subset test \
--remove-bpe \
--fp16 \
--batch-size 1
Fast Noisy Channel Modeling
Bhosale et al. (2020) introduces 3 approximations that speed up online noisy channel decoding -
- Smaller channel models (
Tranformer Base
with 1 encoder and decoder layer each vs.Transformer Big
)- This involves training a channel model that is possibly smaller and less accurate in terms of BLEU than a channel model of the same size as the direct model.
- Since the role of the channel model is mainly to assign low scores to generations from the language model if they don't translate back to the source, we may not need the most accurate channel model for this purpose.
- Smaller output vocabulary size for the channel model (~30,000 -> ~1000)
- The channel model doesn't need to score the full output vocabulary, it just needs to score the source tokens, which are completely known.
- This is specified using the arguments
--channel-scoring-type src_vocab --top-k-vocab 500
- This means that the output vocabulary for the channel model will be the source tokens for all examples in the batch and the top-K most frequent tokens in the vocabulary
- This reduces the memory consumption needed to store channel model scores significantly
- Smaller number of candidates (
k2
) scored per beam- This is specified by reducing the argument
--k2
- This is specified by reducing the argument
Fast Noisy Channel Generation for German-English translation with fairseq
Here are instructions for fast noisy channel generation with a direct model, channel model and language model as explained in section Fast Noisy Channel Modeling. The main differences are that we use a smaller channel model, reduce --k2
, set --channel-scoring-type src_vocab --top-k-vocab 500
and increase the --batch-size
.
Note:
- Download and install fairseq as per instructions here
- Preprocess and binarize the dataset as per instructions in section Test Data Preprocessing
binarized_data=data_dir/binarized
direct_model=de_en_seed4.pt
lm_model=en_lm.pt
lm_data=lm_data
small_ch_model=en_de.base_1_1.seed4.pt
wget https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/direct_models/seed4.pt -O ${direct_model}
wget https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/lm_model/transformer_lm.pt -O ${lm_model}
mkdir -p ${lm_data}
wget https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/lm_model/lm_dict/dict.txt -O ${lm_data}/dict.txt
wget https://dl.fbaipublicfiles.com/fast_noisy_channel/de_en/channel_models/base_1_1.seed4.pt -O ${small_ch_model}
k2=3
lenpen=0.23
lm_wt=0.58
bw_wt=0.26
fairseq-generate ${binarized_data} \
--user-dir examples/fast_noisy_channel \
--beam 5 \
--path ${direct_model} \
--lm-model ${lm_model} \
--lm-data ${lm_data} \
--channel-model ${small_ch_model} \
--k2 ${k2} \
--combine-method noisy_channel \
--task noisy_channel_translation \
--lenpen ${lenpen} \
--lm-wt ${lm_wt} \
--ch-wt ${bw_wt} \
--gen-subset test \
--remove-bpe \
--fp16 \
--batch-size 50 \
--channel-scoring-type src_vocab --top-k-vocab 500
Test Data Preprocessing
For preprocessing and binarizing the test sets for Romanian-English and German-English translation, we use the following script -
FAIRSEQ=/path/to/fairseq
cd $FAIRSEQ
SCRIPTS=$FAIRSEQ/mosesdecoder/scripts
if [ ! -d "${SCRIPTS}" ]; then
echo 'Cloning Moses github repository (for tokenization scripts)...'
git clone https://github.com/moses-smt/mosesdecoder.git
fi
TOKENIZER=$SCRIPTS/tokenizer/tokenizer.perl
NORMALIZE=$SCRIPTS/tokenizer/normalize-punctuation.perl
s=de
t=en
test=wmt18
mkdir -p data_dir
# Tokenization
if [ $s == "ro" ] ; then
# Note: Get normalise-romanian.py and remove-diacritics.py from
# https://github.com/rsennrich/wmt16-scripts/tree/master/preprocess
sacrebleu -t $test -l $s-$t --echo src | \
$NORMALIZE -l $s | \
python normalise-romanian.py | \
python remove-diacritics.py | \
$TOKENIZER -l $s -a -q > data_dir/$test.$s-$t.$s
else
sacrebleu -t $test -l $s-$t --echo src | perl $NORMALIZE -l $s | perl $TOKENIZER -threads 8 -a -l $s > data_dir/$test.$s-$t.$s
fi
sacrebleu -t $test -l $s-$t --echo ref | perl $NORMALIZE -l $t | perl $TOKENIZER -threads 8 -a -l $t > data_dir/$test.$s-$t.$t
# Applying BPE
src_bpe_code=/path/to/source/language/bpe/code
tgt_bpe_code=/path/to/target/language/bpe/code
src_dict=/path/to/source/language/dict
tgt_dict=/path/to/target/language/dict
FASTBPE=$FAIRSEQ/fastBPE
if [ ! -d "${FASTBPE}" ] ; then
git clone https://github.com/glample/fastBPE.git
# Follow compilation instructions at https://github.com/glample/fastBPE
g++ -std=c++11 -pthread -O3 fastBPE/main.cc -IfastBPE -o fast
fi
${FASTBPE}/fast applybpe data_dir/bpe.$test.$s-$t.$s data_dir/$test.$s-$t.$s ${src_bpe_code}
${FASTBPE}/fast applybpe data_dir/bpe.$test.$s-$t.$s data_dir/$test.$s-$t.$s ${tgt_bpe_code}
fairseq-preprocess -s $s -t $t \
--testpref data_dir/bpe.$test.$s-$t \
--destdir data_dir/binarized \
--srcdict ${src_dict} \
--tgtdict ${tgt_dict}
Calculating BLEU
DETOKENIZER=$SCRIPTS/tokenizer/detokenizer.perl
cat ${generation_output} | grep -P "^H" | sort -V | cut -f 3- | $DETOKENIZER -l $t -q -a | sacrebleu -t $test -l $s-$t
Romanian-English Translation
The direct and channel models are trained using bitext data (WMT16) combined with backtranslated data (The monolingual data used for backtranslation comes from http://data.statmt.org/rsennrich/wmt16_backtranslations/ (Sennrich et al., 2016c))
The backtranslated data is generated using an ensemble of 3 English-Romanian models trained on bitext training data (WMT16) with unrestricted sampling.
BPE Codes and Dictionary
We learn a joint BPE vocabulary of 18K types on the bitext training data which is used for both the source and target.
Path | |
---|---|
BPE Code | joint_bpe_18k |
Dictionary | dict |
Direct Models
For Ro-En with backtranslation, the direct and channel models use a Transformer-Big architecture.
Seed | Model |
---|---|
2 | ro_en_seed2.pt |
4 | ro_en_seed4.pt |
6 | ro_en_seed6.pt |
Channel Models
For channel models, we follow the same steps as for the direct models. But backtranslated data is generated in the opposite direction using this Romanian monolingual data. The best lenpen, LM weight and CH weight are obtained by sweeping over the validation set (wmt16/dev) using beam 5.
Model Size | Lenpen | LM Weight | CH Weight | Seed 2 | Seed 4 | Seed 6 |
---|---|---|---|---|---|---|
big |
0.84 | 0.64 | 0.56 | big.seed2.pt | big.seed2.pt | big.seed2.pt |
base_1_1 |
0.63 | 0.40 | 0.37 | base_1_1.seed2.pt | base_1_1.seed4.pt | base_1_1.seed6.pt |
Language Model
The model is trained on de-duplicated English Newscrawl data from 2007-2018 comprising 186 million sentences or 4.5B words after normalization and tokenization.
Path | |
---|---|
--lm-model |
transformer_en_lm |
--lm-data |
lm_data |
German-English Translation
BPE Codes and Dictionaries
Path | |
---|---|
Source BPE Code | de_bpe_code_24K |
Target BPE Code | en_bpe_code_24K |
Source Dictionary | de_dict |
Target Dictionary | en_dict |
Direct Models
We train on WMT’19 training data. Following Ng et al., 2019, we apply language identification filtering and remove sentences longer than 250 tokens as well as sentence pairs with a source/target length ratio exceeding 1.5. This results in 26.8M sentence pairs. We use the Transformer-Big architecture for the direct model.
Seed | Model |
---|---|
4 | de_en_seed4.pt |
5 | de_en_seed5.pt |
6 | de_en_seed6.pt |
Channel Models
We train on WMT’19 training data. Following Ng et al., 2019, we apply language identification filtering and remove sentences longer than 250 tokens as well as sentence pairs with a source/target length ratio exceeding 1.5. This results in 26.8M sentence pairs.
Model Size | Seed 4 | Seed 5 | Seed 6 |
---|---|---|---|
big |
big.seed4.pt | big.seed5.pt | big.seed6.pt |
big_1_1 |
big_1_1.seed4.pt | big_1_1.seed5.pt | big_1_1.seed6.pt |
base |
base.seed4.pt | base.seed5.pt | base.seed6.pt |
base_1_1 |
base_1_1.seed4.pt | base_1_1.seed5.pt | base_1_1.seed6.pt |
half |
half.seed4.pt | half.seed5.pt | half.seed6.pt |
half_1_1 |
half_1_1.seed4.pt | half_1_1.seed5.pt | half_1_1.seed6.pt |
quarter |
quarter.seed4.pt | quarter.seed5.pt | quarter.seed6.pt |
quarter_1_1 |
quarter_1_1.seed4.pt | quarter_1_1.seed5.pt | quarter_1_1.seed6.pt |
8th |
8th.seed4.pt | 8th.seed5.pt | 8th.seed6.pt |
8th_1_1 |
8th_1_1.seed4.pt | 8th_1_1.seed5.pt | 8th_1_1.seed6.pt |
16th |
16th.seed4.pt | 16th.seed5.pt | 16th.seed6.pt |
16th_1_1 |
16th_1_1.seed4.pt | 16th_1_1.seed5.pt | 16th_1_1.seed6.pt |
Language Model
The model is trained on de-duplicated English Newscrawl data from 2007-2018 comprising 186 million sentences or 4.5B words after normalization and tokenization.
Path | |
---|---|
--lm-model |
transformer_en_lm |
--lm-data |
lm_data |
Citation
@inproceedings{bhosale2020language,
title={Language Models not just for Pre-training: Fast Online Neural Noisy Channel Modeling},
author={Shruti Bhosale and Kyra Yee and Sergey Edunov and Michael Auli},
booktitle={Proceedings of the Fifth Conference on Machine Translation (WMT)},
year={2020},
}
@inproceedings{yee2019simple,
title={Simple and Effective Noisy Channel Modeling for Neural Machine Translation},
author={Yee, Kyra and Dauphin, Yann and Auli, Michael},
booktitle={Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)},
pages={5700--5705},
year={2019}
}