Spaces:
Runtime error
Runtime error
File size: 1,731 Bytes
ee21b96 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 |
# Cross-lingual Retrieval for Iterative Self-Supervised Training
https://arxiv.org/pdf/2006.09526.pdf
## Introduction
CRISS is a multilingual sequence-to-sequnce pretraining method where mining and training processes are applied iteratively, improving cross-lingual alignment and translation ability at the same time.
## Requirements:
* faiss: https://github.com/facebookresearch/faiss
* mosesdecoder: https://github.com/moses-smt/mosesdecoder
* flores: https://github.com/facebookresearch/flores
* LASER: https://github.com/facebookresearch/LASER
## Unsupervised Machine Translation
##### 1. Download and decompress CRISS checkpoints
```
cd examples/criss
wget https://dl.fbaipublicfiles.com/criss/criss_3rd_checkpoints.tar.gz
tar -xf criss_checkpoints.tar.gz
```
##### 2. Download and preprocess Flores test dataset
Make sure to run all scripts from examples/criss directory
```
bash download_and_preprocess_flores_test.sh
```
##### 3. Run Evaluation on Sinhala-English
```
bash unsupervised_mt/eval.sh
```
## Sentence Retrieval
##### 1. Download and preprocess Tatoeba dataset
```
bash download_and_preprocess_tatoeba.sh
```
##### 2. Run Sentence Retrieval on Tatoeba Kazakh-English
```
bash sentence_retrieval/sentence_retrieval_tatoeba.sh
```
## Mining
##### 1. Install faiss
Follow instructions on https://github.com/facebookresearch/faiss/blob/master/INSTALL.md
##### 2. Mine pseudo-parallel data between Kazakh and English
```
bash mining/mine_example.sh
```
## Citation
```bibtex
@article{tran2020cross,
title={Cross-lingual retrieval for iterative self-supervised training},
author={Tran, Chau and Tang, Yuqing and Li, Xian and Gu, Jiatao},
journal={arXiv preprint arXiv:2006.09526},
year={2020}
}
```
|