Spaces:

Tzktz
/

Dit-document-layout-analysis

Sleeping

App Files Files Community

Dit-document-layout-analysis / unilm /edgelm /examples /mbart /README.md

Tzktz

Upload 7664 files

6fc683c verified 10 months ago

preview code

raw

history blame contribute delete

4.79 kB

	# MBART: Multilingual Denoising Pre-training for Neural Machine Translation
	[https://arxiv.org/abs/2001.08210]

	## Introduction

	MBART is a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora in many languages using the BART objective. mBART is one of the first methods for pre-training a complete sequence-to-sequence model by denoising full texts in multiple languages, while previous approaches have focused only on the encoder, decoder, or reconstructing parts of the text.

	## Pre-trained models

	Model \| Description \| # params \| Download
	---\|---\|---\|---
	`mbart.CC25` \| mBART model with 12 encoder and decoder layers trained on 25 languages' monolingual corpus \| 610M \| [mbart.CC25.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/mbart/mbart.cc25.v2.tar.gz)
	`mbart.ft.ro_en` \| finetune mBART cc25 model on ro-en language pairs \| 610M \| [mbart.cc25.ft.enro.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/mbart/mbart.cc25.ft.enro.tar.gz)

	## Results

	[WMT16 EN-RO](https://www.statmt.org/wmt16/translation-task.html)

	_(test set, no additional data used)_

	Model \| en-ro \| ro-en
	---\|---\|---
	`Random` \| 34.3 \| 34.0
	`mbart.cc25` \| 37.7 \| 37.8
	`mbart.enro.bilingual` \| 38.5 \| 38.5

	## BPE data
	# download model
	wget https://dl.fbaipublicfiles.com/fairseq/models/mbart/mbart.cc25.v2.tar.gz
	tar -xzvf mbart.CC25.tar.gz
	# bpe data
	install SPM [here](https://github.com/google/sentencepiece)
	```bash
	SPM=/path/to/sentencepiece/build/src/spm_encode
	MODEL=sentence.bpe.model
	${SPM} --model=${MODEL} < ${DATA}/${TRAIN}.${SRC} > ${DATA}/${TRAIN}.spm.${SRC} &
	${SPM} --model=${MODEL} < ${DATA}/${TRAIN}.${TGT} > ${DATA}/${TRAIN}.spm.${TGT} &
	${SPM} --model=${MODEL} < ${DATA}/${VALID}.${SRC} > ${DATA}/${VALID}.spm.${SRC} &
	${SPM} --model=${MODEL} < ${DATA}/${VALID}.${TGT} > ${DATA}/${VALID}.spm.${TGT} &
	${SPM} --model=${MODEL} < ${DATA}/${TEST}.${SRC} > ${DATA}/${TEST}.spm.${SRC} &
	${SPM} --model=${MODEL} < ${DATA}/${TEST}.${TGT} > ${DATA}/${TEST}.spm.${TGT} &
	```

	## Preprocess data

	```bash
	DICT=dict.txt
	fairseq-preprocess \
	--source-lang ${SRC} \
	--target-lang ${TGT} \
	--trainpref ${DATA}/${TRAIN}.spm \
	--validpref ${DATA}/${VALID}.spm \
	--testpref ${DATA}/${TEST}.spm \
	--destdir ${DEST}/${NAME} \
	--thresholdtgt 0 \
	--thresholdsrc 0 \
	--srcdict ${DICT} \
	--tgtdict ${DICT} \
	--workers 70
	```

	## Finetune on EN-RO
	Finetune on mbart CC25

	```bash
	PRETRAIN=mbart.cc25 # fix if you moved the downloaded checkpoint
	langs=ar_AR,cs_CZ,de_DE,en_XX,es_XX,et_EE,fi_FI,fr_XX,gu_IN,hi_IN,it_IT,ja_XX,kk_KZ,ko_KR,lt_LT,lv_LV,my_MM,ne_NP,nl_XX,ro_RO,ru_RU,si_LK,tr_TR,vi_VN,zh_CN

	fairseq-train path_2_data \
	--encoder-normalize-before --decoder-normalize-before \
	--arch mbart_large --layernorm-embedding \
	--task translation_from_pretrained_bart \
	--source-lang en_XX --target-lang ro_RO \
	--criterion label_smoothed_cross_entropy --label-smoothing 0.2 \
	--optimizer adam --adam-eps 1e-06 --adam-betas '(0.9, 0.98)' \
	--lr-scheduler polynomial_decay --lr 3e-05 --warmup-updates 2500 --total-num-update 40000 \
	--dropout 0.3 --attention-dropout 0.1 --weight-decay 0.0 \
	--max-tokens 1024 --update-freq 2 \
	--save-interval 1 --save-interval-updates 5000 --keep-interval-updates 10 --no-epoch-checkpoints \
	--seed 222 --log-format simple --log-interval 2 \
	--restore-file $PRETRAIN \
	--reset-optimizer --reset-meters --reset-dataloader --reset-lr-scheduler \
	--langs $langs \
	--ddp-backend legacy_ddp
	```
	## Generate on EN-RO
	Get sacrebleu on finetuned en-ro model

	get tokenizer [here](https://github.com/rsennrich/wmt16-scripts)
	```bash
	wget https://dl.fbaipublicfiles.com/fairseq/models/mbart/mbart.cc25.ft.enro.tar.gz
	tar -xzvf mbart.cc25.ft.enro.tar.gz
	```

	```bash
	model_dir=MBART_finetuned_enro # fix if you moved the checkpoint

	fairseq-generate path_2_data \
	--path $model_dir/model.pt \
	--task translation_from_pretrained_bart \
	--gen-subset test \
	-t ro_RO -s en_XX \
	--bpe 'sentencepiece' --sentencepiece-model $model_dir/sentence.bpe.model \
	--sacrebleu --remove-bpe 'sentencepiece' \
	--batch-size 32 --langs $langs > en_ro

	cat en_ro \| grep -P "^H" \|sort -V \|cut -f 3- \| sed 's/\[ro_RO\]//g' \|$TOKENIZER ro > en_ro.hyp
	cat en_ro \| grep -P "^T" \|sort -V \|cut -f 2- \| sed 's/\[ro_RO\]//g' \|$TOKENIZER ro > en_ro.ref
	sacrebleu -tok 'none' -s 'none' en_ro.ref < en_ro.hyp
	```

	## Citation

	```bibtex
	@article{liu2020multilingual,
	title={Multilingual Denoising Pre-training for Neural Machine Translation},
	author={Yinhan Liu and Jiatao Gu and Naman Goyal and Xian Li and Sergey Edunov and Marjan Ghazvininejad and Mike Lewis and Luke Zettlemoyer},
	year={2020},
	eprint={2001.08210},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}
	```