Spaces:
Runtime error
Runtime error
# LASER Language-Agnostic SEntence Representations | |
LASER is a library to calculate and use multilingual sentence embeddings. | |
You can find more information about LASER and how to use it on the official [LASER repository](https://github.com/facebookresearch/LASER). | |
This folder contains source code for training LASER embeddings. | |
## Prepare data and configuration file | |
Binarize your data with fairseq, as described [here](https://fairseq.readthedocs.io/en/latest/getting_started.html#data-pre-processing). | |
Create a json config file with this format: | |
``` | |
{ | |
"src_vocab": "/path/to/spm.src.cvocab", | |
"tgt_vocab": "/path/to/spm.tgt.cvocab", | |
"train": [ | |
{ | |
"type": "translation", | |
"id": 0, | |
"src": "/path/to/srclang1-tgtlang0/train.srclang1", | |
"tgt": "/path/to/srclang1-tgtlang0/train.tgtlang0" | |
}, | |
{ | |
"type": "translation", | |
"id": 1, | |
"src": "/path/to/srclang1-tgtlang1/train.srclang1", | |
"tgt": "/path/to/srclang1-tgtlang1/train.tgtlang1" | |
}, | |
{ | |
"type": "translation", | |
"id": 0, | |
"src": "/path/to/srclang2-tgtlang0/train.srclang2", | |
"tgt": "/path/to/srclang2-tgtlang0/train.tgtlang0" | |
}, | |
{ | |
"type": "translation", | |
"id": 1, | |
"src": "/path/to/srclang2-tgtlang1/train.srclang2", | |
"tgt": "/path/to/srclang2-tgtlang1/train.tgtlang1" | |
}, | |
... | |
], | |
"valid": [ | |
{ | |
"type": "translation", | |
"id": 0, | |
"src": "/unused", | |
"tgt": "/unused" | |
} | |
] | |
} | |
``` | |
where paths are paths to binarized indexed fairseq dataset files. | |
`id` represents the target language id. | |
## Training Command Line Example | |
``` | |
fairseq-train \ | |
/path/to/configfile_described_above.json \ | |
--user-dir examples/laser/laser_src \ | |
--log-interval 100 --log-format simple \ | |
--task laser --arch laser_lstm \ | |
--save-dir . \ | |
--optimizer adam \ | |
--lr 0.001 \ | |
--lr-scheduler inverse_sqrt \ | |
--clip-norm 5 \ | |
--warmup-updates 90000 \ | |
--update-freq 2 \ | |
--dropout 0.0 \ | |
--encoder-dropout-out 0.1 \ | |
--max-tokens 2000 \ | |
--max-epoch 50 \ | |
--encoder-bidirectional \ | |
--encoder-layers 5 \ | |
--encoder-hidden-size 512 \ | |
--decoder-layers 1 \ | |
--decoder-hidden-size 2048 \ | |
--encoder-embed-dim 320 \ | |
--decoder-embed-dim 320 \ | |
--decoder-lang-embed-dim 32 \ | |
--warmup-init-lr 0.001 \ | |
--disable-validation | |
``` | |
## Applications | |
We showcase several applications of multilingual sentence embeddings | |
with code to reproduce our results (in the directory "tasks"). | |
* [**Cross-lingual document classification**](https://github.com/facebookresearch/LASER/tree/master/tasks/mldoc) using the | |
[*MLDoc*](https://github.com/facebookresearch/MLDoc) corpus [2,6] | |
* [**WikiMatrix**](https://github.com/facebookresearch/LASER/tree/master/tasks/WikiMatrix) | |
Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia [7] | |
* [**Bitext mining**](https://github.com/facebookresearch/LASER/tree/master/tasks/bucc) using the | |
[*BUCC*](https://comparable.limsi.fr/bucc2018/bucc2018-task.html) corpus [3,5] | |
* [**Cross-lingual NLI**](https://github.com/facebookresearch/LASER/tree/master/tasks/xnli) | |
using the [*XNLI*](https://www.nyu.edu/projects/bowman/xnli/) corpus [4,5,6] | |
* [**Multilingual similarity search**](https://github.com/facebookresearch/LASER/tree/master/tasks/similarity) [1,6] | |
* [**Sentence embedding of text files**](https://github.com/facebookresearch/LASER/tree/master/tasks/embed) | |
example how to calculate sentence embeddings for arbitrary text files in any of the supported language. | |
**For all tasks, we use exactly the same multilingual encoder, without any task specific optimization or fine-tuning.** | |
## References | |
[1] Holger Schwenk and Matthijs Douze, | |
[*Learning Joint Multilingual Sentence Representations with Neural Machine Translation*](https://aclanthology.info/papers/W17-2619/w17-2619), | |
ACL workshop on Representation Learning for NLP, 2017 | |
[2] Holger Schwenk and Xian Li, | |
[*A Corpus for Multilingual Document Classification in Eight Languages*](http://www.lrec-conf.org/proceedings/lrec2018/pdf/658.pdf), | |
LREC, pages 3548-3551, 2018. | |
[3] Holger Schwenk, | |
[*Filtering and Mining Parallel Data in a Joint Multilingual Space*](http://aclweb.org/anthology/P18-2037) | |
ACL, July 2018 | |
[4] Alexis Conneau, Guillaume Lample, Ruty Rinott, Adina Williams, Samuel R. Bowman, Holger Schwenk and Veselin Stoyanov, | |
[*XNLI: Cross-lingual Sentence Understanding through Inference*](https://aclweb.org/anthology/D18-1269), | |
EMNLP, 2018. | |
[5] Mikel Artetxe and Holger Schwenk, | |
[*Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings*](https://arxiv.org/abs/1811.01136) | |
arXiv, Nov 3 2018. | |
[6] Mikel Artetxe and Holger Schwenk, | |
[*Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond*](https://arxiv.org/abs/1812.10464) | |
arXiv, Dec 26 2018. | |
[7] Holger Schwenk, Vishrav Chaudhary, Shuo Sun, Hongyu Gong and Paco Guzman, | |
[*WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia*](https://arxiv.org/abs/1907.05791) | |
arXiv, July 11 2019. | |
[8] Holger Schwenk, Guillaume Wenzek, Sergey Edunov, Edouard Grave and Armand Joulin | |
[*CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB*](https://arxiv.org/abs/1911.04944) | |