Spaces:

OFA-Sys
/

OFA-OCR

Runtime error

App Files Files Community

OFA-OCR / fairseq /examples /cross_lingual_language_model /README.md

JustinLin610

first commit

ee21b96 almost 2 years ago

preview code

raw

history blame

3.05 kB

	# Cross-Lingual Language Model Pre-training

	Below are some details for training Cross-Lingual Language Models (XLM) - similar to the ones presented in [Lample & Conneau, 2019](https://arxiv.org/pdf/1901.07291.pdf) - in Fairseq. The current implementation only supports the Masked Language Model (MLM) from the paper above.

	## Downloading and Tokenizing Monolingual Data

	Pointers to the monolingual data from wikipedia, used for training the XLM-style MLM model as well as details on processing (tokenization and BPE) it can be found in the [XLM Github Repository](https://github.com/facebookresearch/XLM#download--preprocess-monolingual-data).

	Let's assume the following for the code snippets in later sections to work
	- Processed data is in the folder: monolingual_data/processed
	- Each language has 3 files for train, test and validation. For example we have the following files for English:
	train.en, valid.en
	- We are training a model for 5 languages: Arabic (ar), German (de), English (en), Hindi (hi) and French (fr)
	- The vocabulary file is monolingual_data/processed/vocab_mlm


	## Fairseq Pre-processing and Binarization

	Pre-process and binarize the data with the MaskedLMDictionary and cross_lingual_lm task

	```bash
	# Ensure the output directory exists
	DATA_DIR=monolingual_data/fairseq_processed
	mkdir -p "$DATA_DIR"

	for lg in ar de en hi fr
	do

	fairseq-preprocess \
	--task cross_lingual_lm \
	--srcdict monolingual_data/processed/vocab_mlm \
	--only-source \
	--trainpref monolingual_data/processed/train \
	--validpref monolingual_data/processed/valid \
	--testpref monolingual_data/processed/test \
	--destdir monolingual_data/fairseq_processed \
	--workers 20 \
	--source-lang $lg

	# Since we only have a source language, the output file has a None for the
	# target language. Remove this

	for stage in train test valid

	sudo mv "$DATA_DIR/$stage.$lg-None.$lg.bin" "$stage.$lg.bin"
	sudo mv "$DATA_DIR/$stage.$lg-None.$lg.idx" "$stage.$lg.idx"

	done

	done
	```

	## Train a Cross-lingual Language Model similar to the XLM MLM model

	Use the following command to train the model on 5 languages.

	```
	fairseq-train \
	--task cross_lingual_lm monolingual_data/fairseq_processed \
	--save-dir checkpoints/mlm \
	--max-update 2400000 --save-interval 1 --no-epoch-checkpoints \
	--arch xlm_base \
	--optimizer adam --lr-scheduler reduce_lr_on_plateau \
	--lr-shrink 0.5 --lr 0.0001 --stop-min-lr 1e-09 \
	--dropout 0.1 \
	--criterion legacy_masked_lm_loss \
	--max-tokens 2048 --tokens-per-sample 256 --attention-dropout 0.1 \
	--dataset-impl lazy --seed 0 \
	--masked-lm-only \
	--monolingual-langs 'ar,de,en,hi,fr' --num-segment 5 \
	--ddp-backend=legacy_ddp
	```

	Some Notes:
	- Using tokens_per_sample greater than 256 can cause OOM (out-of-memory) issues. Usually since MLM packs in streams of text, this parameter doesn't need much tuning.
	- The Evaluation workflow for computing MLM Perplexity on test data is in progress.
	- Finetuning this model on a downstream task is something which is not currently available.