Spaces:
Sleeping
Sleeping
# InfoXLM | |
**Cross-Lingual Language Model Pre-training** | |
## Overview | |
Code for pretraining cross-lingual language models. This repo provides implementations of various cross-lingual language models, including: | |
- **InfoXLM** (NAACL 2021, [paper](https://arxiv.org/pdf/2007.07834.pdf), [repo](https://github.com/microsoft/unilm/tree/master/infoxlm), [model](https://huggingface.co/microsoft/infoxlm-base)) InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language Model Pre-Training. | |
- **XLM-E** (arXiv 2021, [paper](https://arxiv.org/pdf/2106.16138.pdf)) XLM-E: Cross-lingual Language Model Pre-training via ELECTRA | |
- **XLM-Align** (ACL 2021, [paper](https://aclanthology.org/2021.acl-long.265/), [repo](https://github.com/CZWin32768/XLM-Align), [model](https://huggingface.co/microsoft/xlm-align-base)) Improving Pretrained Cross-Lingual Language Models via Self-Labeled Word Alignment | |
- **mBERT** Pretraining BERT on multilingual text with the masked language modeling (MLM) task. | |
- **XLM** Pretraining Transformer encoder with masked language modeling (MLM) and translation language modeling (TLM). | |
The following models will be also added to this repo ASAP: | |
- **XNLG** (AAAI 2020, [paper](https://arxiv.org/pdf/1909.10481.pdf), [repo](https://github.com/CZWin32768/XNLG)) multilingual/cross-lingual pre-trained model for natural language generation, e.g., finetuning XNLG with English abstractive summarization (AS) data and directly performing French AS or even Chinese-French AS. | |
- **mT6** ([paper](https://arxiv.org/abs/2104.08692)) mT6: Multilingual Pretrained Text-to-Text Transformer with Translation Pairs | |
## How to Use | |
### From Hugging Face model hub | |
We provide the models in Hugging Face format, so you can use the model directly with Hugging Face API: | |
**XLM-Align** | |
```python | |
model = AutoModel.from_pretrained("microsoft/xlm-align-base") | |
tokenizer = AutoTokenizer.from_pretrained("microsoft/xlm-align-base") | |
``` | |
**InfoXLM-base** | |
```python | |
model = AutoModel.from_pretrained("microsoft/infoxlm-base") | |
tokenizer = AutoTokenizer.from_pretrained("microsoft/infoxlm-base") | |
``` | |
**InfoXLM-large** | |
```python | |
model = AutoModel.from_pretrained("microsoft/infoxlm-large") | |
tokenizer = AutoTokenizer.from_pretrained("microsoft/infoxlm-large") | |
``` | |
### Finetuning on end tasks | |
Our models use the same vocabulary, tokenizer, and architecture with XLM-Roberta. So you can directly use the existing codes for finetuning XLM-R, **just by replacing the model name from `xlm-roberta-base` to `microsoft/xlm-align-base`, `microsoft/infoxlm-base`, or `microsoft/infoxlm-base`**. | |
For example, you can evaluate our model with [xTune](https://github.com/bozheng-hit/xTune)[3] on the XTREME benchmark. | |
## Pretraining | |
### Environment | |
The recommended way to run the code is using docker: | |
```bash | |
docker run -it --rm --runtime=nvidia --ipc=host --privileged pytorch/pytorch:1.4-cuda10.1-cudnn7-devel bash | |
``` | |
The docker is initialized by: | |
```bash | |
. .bashrc | |
apt-get update | |
apt-get install -y vim wget ssh | |
PWD_DIR=$(pwd) | |
cd $(mktemp -d) | |
# install apex | |
git clone -q https://github.com/NVIDIA/apex.git | |
cd apex | |
git reset --hard 11faaca7c8ff7a7ba6d55854a9ee2689784f7ca5 | |
python setup.py install --user --cuda_ext --cpp_ext | |
cd .. | |
cd $PWD_DIR | |
git clone https://github.com/microsoft/unilm | |
cd unilm/infoxlm | |
# install fairseq https://github.com/CZWin32768/fairseq/tree/czw | |
pip install --user --editable ./fairseq | |
# install infoxlm | |
pip install --user --editable ./src-infoxlm | |
``` | |
### Prepare Training Data | |
All the training data are preprocessed into fairseq mmap format. | |
**Prepare MLM data** | |
The MLM training data should be preprocessed into token blocks with the length of 512. | |
**Step1**: Prepare training data in text format with one sentence per line. The text file should contain multilingual unlabeled text. | |
Example: | |
``` | |
This is just an example. | |
Bonjour! | |
今天天气怎么样? | |
... | |
``` | |
**Step2**: Convert to token blocks with the length of 512 in fairseq `mmap` format | |
Example: | |
``` | |
<s> This is just an example . </s> Bonjour ! </s> 今天 天气 怎么样 ? </s> | |
... | |
``` | |
Command: | |
``` | |
python ./tools/txt2bin.py \ | |
--model_name microsoft/xlm-align-base \ | |
--input /path/to/text.txt \ | |
--output /path/to/output/dir | |
``` | |
**Step3**: Put the `dict.txt` to the data dir. (Note: In InfoXLM and XLM-Align, we use the same `dict.txt` as [the dict file of XLM-R](https://github.com/pytorch/fairseq/tree/master/examples/xlmr). ) | |
**Prepare TLM Data** | |
**Step1**: Prepare parallel data in text format with one sentence per line. | |
Example: | |
At en-zh.en.txt | |
``` | |
This is just an example. | |
Hello world! | |
... | |
``` | |
At en-zh.zh.txt | |
``` | |
这只是一个例子。 | |
你好世界! | |
... | |
``` | |
**Step2**: Concatenate the parallel sentences into fairseq `mmap` format. | |
Example: | |
``` | |
<s> This is just an example . <\s> 这 只是 一个 例 子 。 <\s> | |
<s> Hello world ! <\s> 你好 世界 !<\s> | |
... | |
``` | |
Command: | |
``` | |
python ./tools/para2bin.py \ | |
--model_name microsoft/xlm-align-base \ | |
--input_src /path/to/src-trg.src.txt \ | |
--input_trg /path/to/src-trg.trg.txt \ | |
--output /path/to/output/dir | |
``` | |
**Prepare XlCo Data** | |
**Step1**: Prepare parallel data in text format with one sentence per line. | |
**Step2**: Alternately store the token indices of the two input files, and save the resulting dataset into fairseq `mmap` format. | |
Example: | |
``` | |
<s> This is just an example . </s> | |
<s> 这 只是 一个 例 子 。 </s> | |
<s> Hello world ! </s> | |
<s> 你好 世界 ! </s> | |
... | |
``` | |
Command: | |
``` | |
python ./tools/para2bin.py \ | |
--model_name microsoft/xlm-align-base \ | |
--input_src /path/to/src-trg.src.txt \ | |
--input_trg /path/to/src-trg.trg.txt \ | |
--output /path/to/output/dir | |
``` | |
### Pretrain InfoXLM | |
Continue-train InfoXLM-base from XLM-R-base | |
```bash | |
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 | |
python src-infoxlm/train.py ${MLM_DATA_DIR} \ | |
--task infoxlm --criterion xlco \ | |
--tlm_data ${TLM_DATA_DIR} \ | |
--xlco_data ${XLCO_DATA_DIR} \ | |
--arch infoxlm_base --sample-break-mode complete --tokens-per-sample 512 \ | |
--optimizer adam --adam-betas '(0.9,0.98)' --adam-eps 1e-6 --clip-norm 1.0 \ | |
--lr-scheduler polynomial_decay --lr 0.0002 --warmup-updates 10000 \ | |
--total-num-update 200000 --max-update 200000 \ | |
--dropout 0.0 --attention-dropout 0.0 --weight-decay 0.01 \ | |
--max-sentences 16 --update-freq 16 \ | |
--log-format simple --log-interval 1 --disable-validation \ | |
--save-interval-updates 5000 --no-epoch-checkpoints \ | |
--fp16 --fp16-init-scale 128 --fp16-scale-window 128 --min-loss-scale 0.0001 \ | |
--seed 1 \ | |
--save-dir .${SAVE_DIR}/ \ | |
--tensorboard-logdir .${SAVE_DIR}/tb-log \ | |
--roberta-model-path /path/to/model.pt \ | |
--num-workers 4 --ddp-backend=c10d --distributed-no-spawn \ | |
--xlco_layer 8 --xlco_queue_size 131072 --xlco_lambda 1.0 \ | |
--xlco_momentum constant,0.9999 --use_proj | |
``` | |
- `${MLM_DATA_DIR}`: directory to mlm training data. | |
- `${SAVE_DIR}`: checkpoints are saved in this folder. | |
- `--max-sentences 8`: batch size per GPU. | |
- `--update-freq 32`: gradient accumulation steps. (total batch size = TOTAL_NUM_GPU x max-sentences x update-freq = 8 x 16 x 16 = 2048) | |
- `--roberta-model-path`: the checkpoint path to an existing roberta model (as the initialization of the current model). For learning from scratch, remove this line. The `model.pt` file of XLM-R can be downloaded from [here](https://github.com/pytorch/fairseq/tree/master/examples/xlmr) | |
- `--xlco_layer`: the layer to perform cross-lingual contrast (XlCo) | |
- `--xlco_lambda`: the weight of XlCo loss | |
### Pretrain XLM-Align | |
Continue-train XLM-Align-base from XLM-R-base | |
```bash | |
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 | |
python src-infoxlm/train.py ${MLM_DATA_DIR} \ | |
--task xlm_align --criterion dwa_mlm_tlm \ | |
--tlm_data ${TLM_DATA_DIR} \ | |
--arch xlm_align_base --sample-break-mode complete --tokens-per-sample 512 \ | |
--optimizer adam --adam-betas '(0.9,0.98)' --adam-eps 1e-6 \ | |
--clip-norm 1.0 --lr-scheduler polynomial_decay --lr 0.0002 \ | |
--warmup-updates 10000 --total-num-update 200000 --max-update 200000 \ | |
--dropout 0.0 --attention-dropout 0.0 --weight-decay 0.01 \ | |
--max-sentences 16 --update-freq 16 --log-format simple \ | |
--log-interval 1 --disable-validation --save-interval-updates 5000 --no-epoch-checkpoints \ | |
--fp16 --fp16-init-scale 128 --fp16-scale-window 128 --min-loss-scale 0.0001 \ | |
--seed 1 \ | |
--save-dir .${SAVE_DIR} \ | |
--tensorboard-logdir .${SAVE_DIR}/tb-log \ | |
--roberta-model-path /path/to/model.pt \ | |
--num-workers 2 --ddp-backend=c10d --distributed-no-spawn \ | |
--wa_layer 10 --wa_max_count 2 --sinkhorn_iter 2 | |
``` | |
- `${MLM_DATA_DIR}`: directory to mlm training data. | |
- `${SAVE_DIR}`: checkpoints are saved in this folder. | |
- `--max-sentences 8`: batch size per GPU. | |
- `--update-freq 32`: gradient accumulation steps. (total batch size = TOTAL_NUM_GPU x max-sentences x update-freq = 8 x 16 x 16 = 2048) | |
- `--roberta-model-path`: the checkpoint path to an existing roberta model (as the initialization of the current model). For learning from scratch, remove this line. | |
- `--wa_layer`: the layer to perform word alignment self-labeling | |
- `--wa_max_count`: the number of iterative alignment filtering | |
- `--sinkhorn_iter`: the number of the iteration in Sinkhorn's algorithm | |
### Pretrain MLM | |
Continue-train MLM / mBert from XLM-R-base | |
```bash | |
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 | |
python src-infoxlm/train.py ${MLM_DATA_DIR} \ | |
--task mlm --criterion masked_lm \ | |
--arch reload_roberta_base --sample-break-mode complete --tokens-per-sample 512 \ | |
--optimizer adam --adam-betas '(0.9,0.98)' --adam-eps 1e-6 --clip-norm 1.0 \ | |
--lr-scheduler polynomial_decay --lr 0.0002 --warmup-updates 10000 \ | |
--total-num-update 200000 --max-update 200000 \ | |
--dropout 0.0 --attention-dropout 0.0 --weight-decay 0.01 \ | |
--max-sentences 32 --update-freq 8 \ | |
--log-format simple --log-interval 1 --disable-validation \ | |
--save-interval-updates 5000 --no-epoch-checkpoints \ | |
--fp16 --fp16-init-scale 128 --fp16-scale-window 128 --min-loss-scale 0.0001 \ | |
--seed 1 \ | |
--save-dir .${SAVE_DIR}/ \ | |
--tensorboard-logdir .${SAVE_DIR}/tb-log \ | |
--roberta-model-path /path/to/model.pt \ | |
--num-workers 2 --ddp-backend=c10d --distributed-no-spawn | |
``` | |
Pretraining MLM / mBERT from scratch | |
```bash | |
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 | |
python src-infoxlm/train.py ${MLM_DATA_DIR} \ | |
--task mlm --criterion masked_lm \ | |
--arch reload_roberta_base --sample-break-mode complete --tokens-per-sample 512 \ | |
--optimizer adam --adam-betas '(0.9,0.98)' --adam-eps 1e-6 --clip-norm 1.0 \ | |
--lr-scheduler polynomial_decay --lr 0.0001 --warmup-updates 10000 \ | |
--total-num-update 1000000 --max-update 1000000 \ | |
--dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \ | |
--max-sentences 32 --update-freq 1 \ | |
--log-format simple --log-interval 1 --disable-validation \ | |
--save-interval-updates 5000 --no-epoch-checkpoints \ | |
--fp16 --fp16-init-scale 128 --fp16-scale-window 128 --min-loss-scale 0.0001 \ | |
--seed 1 \ | |
--save-dir .${SAVE_DIR}/ \ | |
--tensorboard-logdir .${SAVE_DIR}/tb-log \ | |
--num-workers 2 --ddp-backend=c10d --distributed-no-spawn | |
``` | |
### Pretrain MLM+TLM | |
Continue-train MLM+TLM from XLM-R-base | |
```bash | |
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 | |
python src-infoxlm/train.py ${MLM_DATA_DIR} \ | |
--tlm_data ${TLM_DATA_DIR} \ | |
--task tlm --criterion masked_lm \ | |
--arch reload_roberta_base --sample-break-mode complete --tokens-per-sample 512 \ | |
--optimizer adam --adam-betas '(0.9,0.98)' --adam-eps 1e-6 --clip-norm 1.0 \ | |
--lr-scheduler polynomial_decay --lr 0.0002 --warmup-updates 10000 \ | |
--total-num-update 200000 --max-update 200000 \ | |
--dropout 0.0 --attention-dropout 0.0 --weight-decay 0.01 \ | |
--max-sentences 32 --update-freq 8 \ | |
--log-format simple --log-interval 1 --disable-validation \ | |
--save-interval-updates 5000 --no-epoch-checkpoints \ | |
--fp16 --fp16-init-scale 128 --fp16-scale-window 128 --min-loss-scale 0.0001 \ | |
--seed 1 \ | |
--save-dir .${SAVE_DIR}/ \ | |
--tensorboard-logdir .${SAVE_DIR}/tb-log \ | |
--roberta-model-path /path/to/model.pt \ | |
--num-workers 2 --ddp-backend=c10d --distributed-no-spawn | |
``` | |
Pretraining MLM+TLM from scratch | |
```bash | |
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 | |
python src-infoxlm/train.py ${MLM_DATA_DIR} \ | |
--tlm_data ${TLM_DATA_DIR} \ | |
--task tlm --criterion masked_lm \ | |
--arch reload_roberta_base --sample-break-mode complete --tokens-per-sample 512 \ | |
--optimizer adam --adam-betas '(0.9,0.98)' --adam-eps 1e-6 --clip-norm 1.0 \ | |
--lr-scheduler polynomial_decay --lr 0.0001 --warmup-updates 10000 \ | |
--total-num-update 1000000 --max-update 1000000 \ | |
--dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \ | |
--max-sentences 32 --update-freq 1 \ | |
--log-format simple --log-interval 1 --disable-validation \ | |
--save-interval-updates 5000 --no-epoch-checkpoints \ | |
--fp16 --fp16-init-scale 128 --fp16-scale-window 128 --min-loss-scale 0.0001 \ | |
--seed 1 \ | |
--save-dir .${SAVE_DIR}/ \ | |
--tensorboard-logdir .${SAVE_DIR}/tb-log \ | |
--num-workers 2 --ddp-backend=c10d --distributed-no-spawn | |
``` | |
## References | |
Please cite the papers if you found the resources in this repository useful. | |
[1] **XLM-Align** (ACL 2021, [paper](https://aclanthology.org/2021.acl-long.265/), [repo](https://github.com/CZWin32768/XLM-Align), [model](https://huggingface.co/microsoft/xlm-align-base)) Improving Pretrained Cross-Lingual Language Models via Self-Labeled Word Alignment | |
``` | |
@inproceedings{xlmalign, | |
title = "Improving Pretrained Cross-Lingual Language Models via Self-Labeled Word Alignment", | |
author={Zewen Chi and Li Dong and Bo Zheng and Shaohan Huang and Xian-Ling Mao and Heyan Huang and Furu Wei}, | |
booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)", | |
month = aug, | |
year = "2021", | |
address = "Online", | |
publisher = "Association for Computational Linguistics", | |
url = "https://aclanthology.org/2021.acl-long.265", | |
doi = "10.18653/v1/2021.acl-long.265", | |
pages = "3418--3430",} | |
``` | |
[2] **InfoXLM** (NAACL 2021, [paper](https://arxiv.org/pdf/2007.07834.pdf), [repo](https://github.com/microsoft/unilm/tree/master/infoxlm), [model](https://huggingface.co/microsoft/infoxlm-base)) InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language Model Pre-Training. | |
``` | |
@inproceedings{chi-etal-2021-infoxlm, | |
title = "{I}nfo{XLM}: An Information-Theoretic Framework for Cross-Lingual Language Model Pre-Training", | |
author={Chi, Zewen and Dong, Li and Wei, Furu and Yang, Nan and Singhal, Saksham and Wang, Wenhui and Song, Xia and Mao, Xian-Ling and Huang, Heyan and Zhou, Ming}, | |
booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies", | |
month = jun, | |
year = "2021", | |
address = "Online", | |
publisher = "Association for Computational Linguistics", | |
url = "https://aclanthology.org/2021.naacl-main.280", | |
doi = "10.18653/v1/2021.naacl-main.280", | |
pages = "3576--3588",} | |
``` | |
[3] **xTune** (ACL 2021, [paper](https://arxiv.org/pdf/2106.08226.pdf), [repo](https://github.com/bozheng-hit/xTune)) Consistency Regularization for Cross-Lingual Fine-Tuning. | |
``` | |
@inproceedings{zheng-etal-2021-consistency, | |
title = "Consistency Regularization for Cross-Lingual Fine-Tuning", | |
author = {Bo Zheng, Li Dong, Shaohan Huang, Wenhui Wang, Zewen Chi, Saksham Singhal, Wanxiang Che, Ting Liu, Xia Song, Furu Wei}, | |
booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)", | |
month = aug, | |
year = "2021", | |
address = "Online", | |
publisher = "Association for Computational Linguistics", | |
url = "https://aclanthology.org/2021.acl-long.264", | |
doi = "10.18653/v1/2021.acl-long.264", | |
pages = "3403--3417", | |
} | |
``` | |
[4] **XLM-E** (arXiv 2021, [paper](https://arxiv.org/pdf/2106.16138.pdf)) XLM-E: Cross-lingual Language Model Pre-training via ELECTRA | |
``` | |
@misc{chi2021xlme, | |
title={XLM-E: Cross-lingual Language Model Pre-training via ELECTRA}, | |
author={Zewen Chi and Shaohan Huang and Li Dong and Shuming Ma and Saksham Singhal and Payal Bajaj and Xia Song and Furu Wei}, | |
year={2021}, | |
eprint={2106.16138}, | |
archivePrefix={arXiv}, | |
primaryClass={cs.CL} | |
} | |
``` | |
## License | |
This project is licensed under the license found in the LICENSE file in the root directory of this source tree. | |
[Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct) | |
### Contact Information | |
For help or issues using InfoXLM, please submit a GitHub issue. | |
For other communications related to InfoXLM, please contact Li Dong (`lidong1@microsoft.com`), Furu Wei (`fuwei@microsoft.com`). | |