Spaces:
Running
Running
File size: 9,322 Bytes
d0a815c b873cb9 60d3e46 b873cb9 a571fee b873cb9 8930886 fa7f5f9 f72813d fb671fc 5401d1a 8930886 b873cb9 519b8e4 5401d1a b873cb9 8930886 b873cb9 5401d1a fb671fc b873cb9 fb671fc b873cb9 fb671fc b873cb9 5401d1a b873cb9 e27083d 74502a8 e27083d 8930886 b873cb9 ba5f9a4 b873cb9 5401d1a b873cb9 8930886 b873cb9 62b1ca5 b873cb9 8930886 797bf18 b873cb9 ad48936 b873cb9 62b1ca5 b873cb9 8930886 b62b581 b873cb9 8930886 b873cb9 8930886 b873cb9 62b1ca5 b873cb9 5401d1a b873cb9 2268075 62b1ca5 fb671fc 62b1ca5 e149cd5 62b1ca5 b873cb9 2cb7997 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 |
<p align="center">
<br>
<img src="images/title.png" width="900"/>
<br>
<a href="https://twitter.com/intent/tweet?text=Wow:&url=https%3A%2F%2Fgithub.com%2Fikergarcia1996%2FEasy-Translate"><img alt="Twitter" src="https://img.shields.io/twitter/url?style=social&url=https%3A%2F%2Fgithub.com%2Fikergarcia1996%2FEasy-Translate"></a>
<a href="https://github.com/ikergarcia1996/Easy-Translate/blob/main/LICENSE.md"><img alt="License" src="https://img.shields.io/github/license/ikergarcia1996/Easy-Translate"></a>
<a href="https://huggingface.co/docs/transformers/index"><img alt="Transformers" src="https://img.shields.io/badge/-%F0%9F%A4%97Transformers%20-grey"></a>
<a href="https://huggingface.co/docs/accelerate/index/"><img alt="Accelerate" src="https://img.shields.io/badge/-%F0%9F%A4%97Accelerate%20-grey"></a>
<a href="https://ikergarcia1996.github.io/Iker-Garcia-Ferrero/"><img alt="Author" src="https://img.shields.io/badge/Author-Iker García Ferrero-ff69b4"></a>
<br>
<br>
</p>
Easy-Translate is a script for translating large text files in your machine using the [M2M100 models](https://arxiv.org/pdf/2010.11125.pdf) and [NLLB200 models](https://research.facebook.com/publications/no-language-left-behind/) from Facebook/Meta AI. We also provide a [script](#evaluate-translations) for Easy-Evaluation of your translations 🥳
Easy-Translate is built on top of 🤗HuggingFace's [Transformers](https://huggingface.co/docs/transformers/index) and 🤗HuggingFace's [Accelerate](https://huggingface.co/docs/accelerate/index) library.
We currently support:
- CPU / multi-CPU / GPU / multi-GPU / TPU acceleration
- BF16 / FP16 / FP32 precision.
- Automatic batch size finder: Forget CUDA OOM errors. Set an initial batch size, if it doesn't fit, we will automatically adjust it.
- Sharded Data Parallel to load huge models sharded on multiple GPUs (See: <https://huggingface.co/docs/accelerate/fsdp>).
- Greedy decoding / Beam Search decoding / Multinomial Sampling / Beam-Search Multinomial Sampling
>Test the 🔌 Online Demo here: <https://huggingface.co/spaces/Iker/Translate-100-languages>
## Supported languages
See the [Supported languages table](supported_languages.md) for a table of the supported languages and their ids.
## Supported Models
### M2M100
**M2M100** is a multilingual encoder-decoder (seq-to-seq) model trained for Many-to-Many multilingual translation introduced in this [paper](https://arxiv.org/abs/2010.11125) and first released in [this](https://github.com/pytorch/fairseq/tree/master/examples/m2m_100) repository.
>M2M100 can directly translate between 9,900 directions of 100 languages.
- **Facebook/m2m100_418M**: <https://huggingface.co/facebook/m2m100_418M>
- **Facebook/m2m100_1.2B**: <https://huggingface.co/facebook/m2m100_1.2B>
- **Facebook/m2m100_12B**: <https://huggingface.co/facebook/m2m100-12B-avg-5-ckpt>
### NLLB200
**No Language Left Behind (NLLB)** open-sources models capable of delivering high-quality translations directly between any pair of 200+ languages — including low-resource languages like Asturian, Luganda, Urdu and more. It aims to help people communicate with anyone, anywhere, regardless of their language preferences. It was introduced in this [paper](https://research.facebook.com/publications/no-language-left-behind/) and first released in [this](https://github.com/facebookresearch/fairseq/tree/nllb) repository.
>NLLB can directly translate between +40,000 of +200 languages.
- **facebook/nllb-200-3.3B**: <https://huggingface.co/facebook/nllb-200-3.3B>
- **facebook/nllb-200-1.3B**: <https://huggingface.co/facebook/nllb-200-1.3B>
- **facebook/nllb-200-distilled-1.3B**: <https://huggingface.co/facebook/nllb-200-distilled-1.3B>
- **facebook/nllb-200-distilled-600M**: <https://huggingface.co/facebook/nllb-200-distilled-600M>
Any other ModelForSeq2SeqLM from HuggingFace's Hub should work with this library: <https://huggingface.co/models?pipeline_tag=text2text-generation>
## Citation
If you use this software please cite
````
@inproceedings{garcia-ferrero-etal-2022-model,
title = "Model and Data Transfer for Cross-Lingual Sequence Labelling in Zero-Resource Settings",
author = "Garc{\'\i}a-Ferrero, Iker and
Agerri, Rodrigo and
Rigau, German",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
month = dec,
year = "2022",
address = "Abu Dhabi, United Arab Emirates",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.findings-emnlp.478",
pages = "6403--6416",
}
````
## Requirements
```
Pytorch >= 1.10.0
See: https://pytorch.org/get-started/locally/
Accelerate >= 0.12.0
pip install --upgrade accelerate
HuggingFace Transformers
pip install --upgrade transformers
If you find errors using NLLB200, try installing transformers from source:
pip install git+https://github.com/huggingface/transformers.git
```
## Translate a file
Run `python translate.py -h` for more info.
#### Using a single CPU / GPU
```bash
accelerate launch translate.py \
--sentences_path sample_text/en.txt \
--output_path sample_text/en2es.translation.m2m100_1.2B.txt \
--source_lang en \
--target_lang es \
--model_name facebook/m2m100_1.2B
```
#### Multi-GPU
See Accelerate documentation for more information (multi-node, TPU, Sharded model...): <https://huggingface.co/docs/accelerate/index>
You can use the Accelerate CLI to configure the Accelerate environment (Run `accelerate config` in your terminal) instead of using the `--multi_gpu and --num_processes` flags.
```bash
# Use 2 GPUs
accelerate launch --multi_gpu --num_processes 2 --num_machines 1 translate.py \
--sentences_path sample_text/en.txt \
--output_path sample_text/en2es.translation.m2m100_1.2B.txt \
--source_lang en \
--target_lang es \
--model_name facebook/m2m100_1.2B
```
#### Automatic batch size finder
We will automatically find a batch size that fits in your GPU memory. The default initial batch size is 128 (You can set it with the `--starting_batch_size 128` flag). If we find an Out Of Memory error, we will automatically decrease the batch size until we find a working one.
#### Choose precision
Use the `--precision` flag to choose the precision of the model. You can choose between: bf16, fp16 and 32.
```bash
accelerate launch translate.py \
--sentences_path sample_text/en.txt \
--output_path sample_text/en2es.translation.m2m100_1.2B.txt \
--source_lang en \
--target_lang es \
--model_name facebook/m2m100_1.2B \
--precision fp16
```
### Decoding/Sampling strategies
You can choose the decoding/sampling strategy to use and the number of candidate translation to output for each input sentence. By default we will use beam-search with 'num_beams' set to 5, and we will output the most likely candidate translation. But you can change this behavior:
##### Greedy decoding
```bash
accelerate launch translate.py \
--sentences_path sample_text/en.txt \
--output_path sample_text/en2es.translation.m2m100_1.2B.txt \
--source_lang en \
--target_lang es \
--model_name facebook/m2m100_1.2B \
--num_beams 1
```
##### Multinomial Sampling
```bash
accelerate launch translate.py \
--sentences_path sample_text/en.txt \
--output_path sample_text/en2es.translation.m2m100_1.2B.txt \
--source_lang en \
--target_lang es \
--model_name facebook/m2m100_1.2B \
--num_beams 1 \
--do_sample \
--temperature 0.5 \
--top_k 100 \
--top_p 0.8 \
--num_return_sequences 1
```
##### Beam-Search decoding **(DEFAULT)**
```bash
accelerate launch translate.py \
--sentences_path sample_text/en.txt \
--output_path sample_text/en2es.translation.m2m100_1.2B.txt \
--source_lang en \
--target_lang es \
--model_name facebook/m2m100_1.2B \
--num_beams 5 \
--num_return_sequences 1 \
```
##### Beam-Search Multinomial Sampling
```bash
accelerate launch translate.py \
--sentences_path sample_text/en.txt \
--output_path sample_text/en2es.translation.m2m100_1.2B.txt \
--source_lang en \
--target_lang es \
--model_name facebook/m2m100_1.2B \
--num_beams 5 \
--num_return_sequences 1 \
--do_sample \
--temperature 0.5 \
--top_k 100 \
--top_p 0.8
```
## Evaluate translations
To run the evaluation script you need to install [bert_score](https://github.com/Tiiiger/bert_score): `pip install bert_score` and 🤗HuggingFace's [Datasets](https://huggingface.co/docs/datasets/index) model: `pip install datasets`.
The evaluation script will calculate the following metrics:
- [SacreBLEU](https://github.com/huggingface/datasets/tree/master/metrics/sacrebleu)
- [BLEU](https://github.com/huggingface/datasets/tree/master/metrics/bleu)
- [ROUGE](https://github.com/huggingface/datasets/tree/master/metrics/rouge)
- [METEOR](https://github.com/huggingface/datasets/tree/master/metrics/meteor)
- [TER](https://github.com/huggingface/datasets/tree/master/metrics/ter)
- [BertScore](https://github.com/huggingface/datasets/tree/master/metrics/bertscore)
Run the following command to evaluate the translations:
```bash
accelerate launch eval.py \
--pred_path sample_text/en2es.translation.m2m100_1.2B.txt
--gold_path sample_text/es.txt \
```
If you want to save the results to a file use the `--output_path` flag.
See [sample_text/en2es.m2m100_1.2B.json](sample_text/en2es.m2m100_1.2B.json) for a sample output.
|