Easy-Translate is a script for translating large text files in your machine using the [M2M100 models](https://arxiv.org/pdf/2010.11125.pdf) and [NLLB200 models](https://research.facebook.com/publications/no-language-left-behind/) from Facebook/Meta AI. We also provide a [script](#evaluate-translations) for Easy-Evaluation of your translations 🥳
Easy-Translate is built on top of 🤗HuggingFace's [Transformers](https://huggingface.co/docs/transformers/index) and 🤗HuggingFace's [Accelerate](https://huggingface.co/docs/accelerate/index) library.
We currently support:
- CPU / multi-CPU / GPU / multi-GPU / TPU acceleration
- BF16 / FP16 / FP32 precision.
- Automatic batch size finder: Forget CUDA OOM errors. Set an initial batch size, if it doesn't fit, we will automatically adjust it.
- Sharded Data Parallel to load huge models sharded on multiple GPUs (See: ).
- Greedy decoding / Beam Search decoding / Multinomial Sampling / Beam-Search Multinomial Sampling
>Test the 🔌 Online Demo here:
## Supported languages
See the [Supported languages table](supported_languages.md) for a table of the supported languages and their ids.
## Supported Models
### M2M100
**M2M100** is a multilingual encoder-decoder (seq-to-seq) model trained for Many-to-Many multilingual translation introduced in this [paper](https://arxiv.org/abs/2010.11125) and first released in [this](https://github.com/pytorch/fairseq/tree/master/examples/m2m_100) repository.
>M2M100 can directly translate between 9,900 directions of 100 languages.
- **Facebook/m2m100_418M**:
- **Facebook/m2m100_1.2B**:
- **Facebook/m2m100_12B**:
### NLLB200
**No Language Left Behind (NLLB)** open-sources models capable of delivering high-quality translations directly between any pair of 200+ languages — including low-resource languages like Asturian, Luganda, Urdu and more. It aims to help people communicate with anyone, anywhere, regardless of their language preferences. It was introduced in this [paper](https://research.facebook.com/publications/no-language-left-behind/) and first released in [this](https://github.com/facebookresearch/fairseq/tree/nllb) repository.
>NLLB can directly translate between +40,000 of +200 languages.
- **facebook/nllb-200-3.3B**:
- **facebook/nllb-200-1.3B**:
- **facebook/nllb-200-distilled-1.3B**:
- **facebook/nllb-200-distilled-600M**:
Any other ModelForSeq2SeqLM from HuggingFace's Hub should work with this library:
## Citation
If you use this software please cite
````
@inproceedings{garcia-ferrero-etal-2022-model,
title = "Model and Data Transfer for Cross-Lingual Sequence Labelling in Zero-Resource Settings",
author = "Garc{\'\i}a-Ferrero, Iker and
Agerri, Rodrigo and
Rigau, German",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
month = dec,
year = "2022",
address = "Abu Dhabi, United Arab Emirates",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.findings-emnlp.478",
pages = "6403--6416",
}
````
## Requirements
```
Pytorch >= 1.10.0
See: https://pytorch.org/get-started/locally/
Accelerate >= 0.12.0
pip install --upgrade accelerate
HuggingFace Transformers
pip install --upgrade transformers
If you find errors using NLLB200, try installing transformers from source:
pip install git+https://github.com/huggingface/transformers.git
```
## Translate a file
Run `python translate.py -h` for more info.
#### Using a single CPU / GPU
```bash
accelerate launch translate.py \
--sentences_path sample_text/en.txt \
--output_path sample_text/en2es.translation.m2m100_1.2B.txt \
--source_lang en \
--target_lang es \
--model_name facebook/m2m100_1.2B
```
#### Multi-GPU
See Accelerate documentation for more information (multi-node, TPU, Sharded model...):
You can use the Accelerate CLI to configure the Accelerate environment (Run `accelerate config` in your terminal) instead of using the `--multi_gpu and --num_processes` flags.
```bash
# Use 2 GPUs
accelerate launch --multi_gpu --num_processes 2 --num_machines 1 translate.py \
--sentences_path sample_text/en.txt \
--output_path sample_text/en2es.translation.m2m100_1.2B.txt \
--source_lang en \
--target_lang es \
--model_name facebook/m2m100_1.2B
```
#### Automatic batch size finder
We will automatically find a batch size that fits in your GPU memory. The default initial batch size is 128 (You can set it with the `--starting_batch_size 128` flag). If we find an Out Of Memory error, we will automatically decrease the batch size until we find a working one.
#### Choose precision
Use the `--precision` flag to choose the precision of the model. You can choose between: bf16, fp16 and 32.
```bash
accelerate launch translate.py \
--sentences_path sample_text/en.txt \
--output_path sample_text/en2es.translation.m2m100_1.2B.txt \
--source_lang en \
--target_lang es \
--model_name facebook/m2m100_1.2B \
--precision fp16
```
### Decoding/Sampling strategies
You can choose the decoding/sampling strategy to use and the number of candidate translation to output for each input sentence. By default we will use beam-search with 'num_beams' set to 5, and we will output the most likely candidate translation. But you can change this behavior:
##### Greedy decoding
```bash
accelerate launch translate.py \
--sentences_path sample_text/en.txt \
--output_path sample_text/en2es.translation.m2m100_1.2B.txt \
--source_lang en \
--target_lang es \
--model_name facebook/m2m100_1.2B \
--num_beams 1
```
##### Multinomial Sampling
```bash
accelerate launch translate.py \
--sentences_path sample_text/en.txt \
--output_path sample_text/en2es.translation.m2m100_1.2B.txt \
--source_lang en \
--target_lang es \
--model_name facebook/m2m100_1.2B \
--num_beams 1 \
--do_sample \
--temperature 0.5 \
--top_k 100 \
--top_p 0.8 \
--num_return_sequences 1
```
##### Beam-Search decoding **(DEFAULT)**
```bash
accelerate launch translate.py \
--sentences_path sample_text/en.txt \
--output_path sample_text/en2es.translation.m2m100_1.2B.txt \
--source_lang en \
--target_lang es \
--model_name facebook/m2m100_1.2B \
--num_beams 5 \
--num_return_sequences 1 \
```
##### Beam-Search Multinomial Sampling
```bash
accelerate launch translate.py \
--sentences_path sample_text/en.txt \
--output_path sample_text/en2es.translation.m2m100_1.2B.txt \
--source_lang en \
--target_lang es \
--model_name facebook/m2m100_1.2B \
--num_beams 5 \
--num_return_sequences 1 \
--do_sample \
--temperature 0.5 \
--top_k 100 \
--top_p 0.8
```
## Evaluate translations
To run the evaluation script you need to install [bert_score](https://github.com/Tiiiger/bert_score): `pip install bert_score` and 🤗HuggingFace's [Datasets](https://huggingface.co/docs/datasets/index) model: `pip install datasets`.
The evaluation script will calculate the following metrics:
- [SacreBLEU](https://github.com/huggingface/datasets/tree/master/metrics/sacrebleu)
- [BLEU](https://github.com/huggingface/datasets/tree/master/metrics/bleu)
- [ROUGE](https://github.com/huggingface/datasets/tree/master/metrics/rouge)
- [METEOR](https://github.com/huggingface/datasets/tree/master/metrics/meteor)
- [TER](https://github.com/huggingface/datasets/tree/master/metrics/ter)
- [BertScore](https://github.com/huggingface/datasets/tree/master/metrics/bertscore)
Run the following command to evaluate the translations:
```bash
accelerate launch eval.py \
--pred_path sample_text/en2es.translation.m2m100_1.2B.txt
--gold_path sample_text/es.txt \
```
If you want to save the results to a file use the `--output_path` flag.
See [sample_text/en2es.m2m100_1.2B.json](sample_text/en2es.m2m100_1.2B.json) for a sample output.