JustinLin610's picture
first commit
ee21b96
|
raw
history blame
5.31 kB
# LASER Language-Agnostic SEntence Representations
LASER is a library to calculate and use multilingual sentence embeddings.
You can find more information about LASER and how to use it on the official [LASER repository](https://github.com/facebookresearch/LASER).
This folder contains source code for training LASER embeddings.
## Prepare data and configuration file
Binarize your data with fairseq, as described [here](https://fairseq.readthedocs.io/en/latest/getting_started.html#data-pre-processing).
Create a json config file with this format:
```
{
"src_vocab": "/path/to/spm.src.cvocab",
"tgt_vocab": "/path/to/spm.tgt.cvocab",
"train": [
{
"type": "translation",
"id": 0,
"src": "/path/to/srclang1-tgtlang0/train.srclang1",
"tgt": "/path/to/srclang1-tgtlang0/train.tgtlang0"
},
{
"type": "translation",
"id": 1,
"src": "/path/to/srclang1-tgtlang1/train.srclang1",
"tgt": "/path/to/srclang1-tgtlang1/train.tgtlang1"
},
{
"type": "translation",
"id": 0,
"src": "/path/to/srclang2-tgtlang0/train.srclang2",
"tgt": "/path/to/srclang2-tgtlang0/train.tgtlang0"
},
{
"type": "translation",
"id": 1,
"src": "/path/to/srclang2-tgtlang1/train.srclang2",
"tgt": "/path/to/srclang2-tgtlang1/train.tgtlang1"
},
...
],
"valid": [
{
"type": "translation",
"id": 0,
"src": "/unused",
"tgt": "/unused"
}
]
}
```
where paths are paths to binarized indexed fairseq dataset files.
`id` represents the target language id.
## Training Command Line Example
```
fairseq-train \
/path/to/configfile_described_above.json \
--user-dir examples/laser/laser_src \
--log-interval 100 --log-format simple \
--task laser --arch laser_lstm \
--save-dir . \
--optimizer adam \
--lr 0.001 \
--lr-scheduler inverse_sqrt \
--clip-norm 5 \
--warmup-updates 90000 \
--update-freq 2 \
--dropout 0.0 \
--encoder-dropout-out 0.1 \
--max-tokens 2000 \
--max-epoch 50 \
--encoder-bidirectional \
--encoder-layers 5 \
--encoder-hidden-size 512 \
--decoder-layers 1 \
--decoder-hidden-size 2048 \
--encoder-embed-dim 320 \
--decoder-embed-dim 320 \
--decoder-lang-embed-dim 32 \
--warmup-init-lr 0.001 \
--disable-validation
```
## Applications
We showcase several applications of multilingual sentence embeddings
with code to reproduce our results (in the directory "tasks").
* [**Cross-lingual document classification**](https://github.com/facebookresearch/LASER/tree/master/tasks/mldoc) using the
[*MLDoc*](https://github.com/facebookresearch/MLDoc) corpus [2,6]
* [**WikiMatrix**](https://github.com/facebookresearch/LASER/tree/master/tasks/WikiMatrix)
Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia [7]
* [**Bitext mining**](https://github.com/facebookresearch/LASER/tree/master/tasks/bucc) using the
[*BUCC*](https://comparable.limsi.fr/bucc2018/bucc2018-task.html) corpus [3,5]
* [**Cross-lingual NLI**](https://github.com/facebookresearch/LASER/tree/master/tasks/xnli)
using the [*XNLI*](https://www.nyu.edu/projects/bowman/xnli/) corpus [4,5,6]
* [**Multilingual similarity search**](https://github.com/facebookresearch/LASER/tree/master/tasks/similarity) [1,6]
* [**Sentence embedding of text files**](https://github.com/facebookresearch/LASER/tree/master/tasks/embed)
example how to calculate sentence embeddings for arbitrary text files in any of the supported language.
**For all tasks, we use exactly the same multilingual encoder, without any task specific optimization or fine-tuning.**
## References
[1] Holger Schwenk and Matthijs Douze,
[*Learning Joint Multilingual Sentence Representations with Neural Machine Translation*](https://aclanthology.info/papers/W17-2619/w17-2619),
ACL workshop on Representation Learning for NLP, 2017
[2] Holger Schwenk and Xian Li,
[*A Corpus for Multilingual Document Classification in Eight Languages*](http://www.lrec-conf.org/proceedings/lrec2018/pdf/658.pdf),
LREC, pages 3548-3551, 2018.
[3] Holger Schwenk,
[*Filtering and Mining Parallel Data in a Joint Multilingual Space*](http://aclweb.org/anthology/P18-2037)
ACL, July 2018
[4] Alexis Conneau, Guillaume Lample, Ruty Rinott, Adina Williams, Samuel R. Bowman, Holger Schwenk and Veselin Stoyanov,
[*XNLI: Cross-lingual Sentence Understanding through Inference*](https://aclweb.org/anthology/D18-1269),
EMNLP, 2018.
[5] Mikel Artetxe and Holger Schwenk,
[*Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings*](https://arxiv.org/abs/1811.01136)
arXiv, Nov 3 2018.
[6] Mikel Artetxe and Holger Schwenk,
[*Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond*](https://arxiv.org/abs/1812.10464)
arXiv, Dec 26 2018.
[7] Holger Schwenk, Vishrav Chaudhary, Shuo Sun, Hongyu Gong and Paco Guzman,
[*WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia*](https://arxiv.org/abs/1907.05791)
arXiv, July 11 2019.
[8] Holger Schwenk, Guillaume Wenzek, Sergey Edunov, Edouard Grave and Armand Joulin
[*CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB*](https://arxiv.org/abs/1911.04944)