Spaces:
Runtime error
Runtime error
# Cross-Lingual Language Model Pre-training | |
Below are some details for training Cross-Lingual Language Models (XLM) - similar to the ones presented in [Lample & Conneau, 2019](https://arxiv.org/pdf/1901.07291.pdf) - in Fairseq. The current implementation only supports the Masked Language Model (MLM) from the paper above. | |
## Downloading and Tokenizing Monolingual Data | |
Pointers to the monolingual data from wikipedia, used for training the XLM-style MLM model as well as details on processing (tokenization and BPE) it can be found in the [XLM Github Repository](https://github.com/facebookresearch/XLM#download--preprocess-monolingual-data). | |
Let's assume the following for the code snippets in later sections to work | |
- Processed data is in the folder: monolingual_data/processed | |
- Each language has 3 files for train, test and validation. For example we have the following files for English: | |
train.en, valid.en | |
- We are training a model for 5 languages: Arabic (ar), German (de), English (en), Hindi (hi) and French (fr) | |
- The vocabulary file is monolingual_data/processed/vocab_mlm | |
## Fairseq Pre-processing and Binarization | |
Pre-process and binarize the data with the MaskedLMDictionary and cross_lingual_lm task | |
```bash | |
# Ensure the output directory exists | |
DATA_DIR=monolingual_data/fairseq_processed | |
mkdir -p "$DATA_DIR" | |
for lg in ar de en hi fr | |
do | |
fairseq-preprocess \ | |
--task cross_lingual_lm \ | |
--srcdict monolingual_data/processed/vocab_mlm \ | |
--only-source \ | |
--trainpref monolingual_data/processed/train \ | |
--validpref monolingual_data/processed/valid \ | |
--testpref monolingual_data/processed/test \ | |
--destdir monolingual_data/fairseq_processed \ | |
--workers 20 \ | |
--source-lang $lg | |
# Since we only have a source language, the output file has a None for the | |
# target language. Remove this | |
for stage in train test valid | |
sudo mv "$DATA_DIR/$stage.$lg-None.$lg.bin" "$stage.$lg.bin" | |
sudo mv "$DATA_DIR/$stage.$lg-None.$lg.idx" "$stage.$lg.idx" | |
done | |
done | |
``` | |
## Train a Cross-lingual Language Model similar to the XLM MLM model | |
Use the following command to train the model on 5 languages. | |
``` | |
fairseq-train \ | |
--task cross_lingual_lm monolingual_data/fairseq_processed \ | |
--save-dir checkpoints/mlm \ | |
--max-update 2400000 --save-interval 1 --no-epoch-checkpoints \ | |
--arch xlm_base \ | |
--optimizer adam --lr-scheduler reduce_lr_on_plateau \ | |
--lr-shrink 0.5 --lr 0.0001 --stop-min-lr 1e-09 \ | |
--dropout 0.1 \ | |
--criterion legacy_masked_lm_loss \ | |
--max-tokens 2048 --tokens-per-sample 256 --attention-dropout 0.1 \ | |
--dataset-impl lazy --seed 0 \ | |
--masked-lm-only \ | |
--monolingual-langs 'ar,de,en,hi,fr' --num-segment 5 \ | |
--ddp-backend=legacy_ddp | |
``` | |
Some Notes: | |
- Using tokens_per_sample greater than 256 can cause OOM (out-of-memory) issues. Usually since MLM packs in streams of text, this parameter doesn't need much tuning. | |
- The Evaluation workflow for computing MLM Perplexity on test data is in progress. | |
- Finetuning this model on a downstream task is something which is not currently available. | |