File size: 3,416 Bytes
9b43833
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
# Neural Machine Translation with Byte-Level Subwords

https://arxiv.org/abs/1909.03341

We provide an implementation of byte-level byte-pair encoding (BBPE), taking IWSLT 2017 Fr-En translation as
example.

## Data
Get data and generate fairseq binary dataset:
```bash
bash ./get_data.sh
```

## Model Training
Train Transformer model with Bi-GRU embedding contextualization (implemented in `gru_transformer.py`):
```bash
# VOCAB=bytes
# VOCAB=chars
VOCAB=bbpe2048
# VOCAB=bpe2048
# VOCAB=bbpe4096
# VOCAB=bpe4096
# VOCAB=bpe16384
```
```bash
fairseq-train "data/bin_${VOCAB}" --task translation --user-dir examples/byte_level_bpe/gru_transformer \
    --arch gru_transformer --encoder-layers 2 --decoder-layers 2 --dropout 0.3 --share-all-embeddings \
    --optimizer adam --adam-betas '(0.9, 0.98)' \
    --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --log-format 'simple' --log-interval 100 --save-dir "checkpoints/${VOCAB}" \
    --batch-size 100 --max-update 100000 --update-freq 2
```

## Generation
`fairseq-generate` requires bytes (BBPE) decoder to convert byte-level representation back to characters:
```bash
# BPE=--bpe bytes
# BPE=--bpe characters
BPE=--bpe byte_bpe --sentencepiece-model-path data/spm_bbpe2048.model
# BPE=--bpe sentencepiece --sentencepiece-model data/spm_bpe2048.model
# BPE=--bpe byte_bpe --sentencepiece-model-path data/spm_bbpe4096.model
# BPE=--bpe sentencepiece --sentencepiece-model data/spm_bpe4096.model
# BPE=--bpe sentencepiece --sentencepiece-model data/spm_bpe16384.model
```

```bash
fairseq-generate "data/bin_${VOCAB}" --task translation --user-dir examples/byte_level_bpe/gru_transformer \
    --source-lang fr --gen-subset test --sacrebleu --path "checkpoints/${VOCAB}/checkpoint_last.pt" \
    --tokenizer moses --moses-target-lang en ${BPE}
```
When using `fairseq-interactive`, bytes (BBPE) encoder/decoder is required to tokenize input data and detokenize model predictions:
```bash
fairseq-interactive "data/bin_${VOCAB}" --task translation --user-dir examples/byte_level_bpe/gru_transformer \
    --path "checkpoints/${VOCAB}/checkpoint_last.pt" --input data/test.fr --tokenizer moses --moses-source-lang fr \
    --moses-target-lang en ${BPE} --buffer-size 1000 --max-tokens 10000
```

## Results
| Vocabulary    | Model  | BLEU |
|:-------------:|:-------------:|:-------------:|
| Joint BPE 16k ([Kudo, 2018](https://arxiv.org/abs/1804.10959)) | 512d LSTM 2+2 | 33.81 |
| Joint BPE 16k | Transformer base 2+2 (w/ GRU) | 36.64 (36.72) |
| Joint BPE 4k | Transformer base 2+2 (w/ GRU) | 35.49 (36.10) |
| Joint BBPE 4k | Transformer base 2+2 (w/ GRU) | 35.61 (35.82) |
| Joint BPE 2k | Transformer base 2+2 (w/ GRU) | 34.87 (36.13) |
| Joint BBPE 2k | Transformer base 2+2 (w/ GRU) | 34.98 (35.43) |
| Characters | Transformer base 2+2 (w/ GRU) | 31.78 (33.30) |
| Bytes | Transformer base 2+2 (w/ GRU) | 31.57 (33.62) |


## Citation
```
@misc{wang2019neural,
    title={Neural Machine Translation with Byte-Level Subwords},
    author={Changhan Wang and Kyunghyun Cho and Jiatao Gu},
    year={2019},
    eprint={1909.03341},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
```


## Contact
Changhan Wang ([changhan@fb.com](mailto:changhan@fb.com)),
Kyunghyun Cho ([kyunghyuncho@fb.com](mailto:kyunghyuncho@fb.com)),
Jiatao Gu ([jgu@fb.com](mailto:jgu@fb.com))