Spaces:

jesseplusplus
/

easy-translate

Running

App Files Files Community

Iker commited on Nov 30, 2023

Commit

9dcafee

1 Parent(s): f88323f

Implement SeamlessM4T

Browse files

Files changed (9) hide show

README.md +32 -12
model.py +35 -40
sample_text/en2es.seamless-m4t-large.json +60 -0
sample_text/en2es.seamless-m4t-medium.json +60 -0
sample_text/en2es.translation.seamless-m4t-large.txt +0 -0
sample_text/en2es.translation.seamless-m4t-medium.txt +0 -0
tests/__init__.py +0 -0
tests/test_translation.py +548 -0
translate.py +119 -21

README.md CHANGED Viewed

@@ -1,4 +1,3 @@
 <p align="center">
     <br>
     <img src="images/title.png" width="900"/>
@@ -29,23 +28,19 @@ We currently support:
 - BF16 / FP16 / FP32 / 8 Bits / 4 Bits precision.
 - Automatic batch size finder: Forget CUDA OOM errors. Set an initial batch size, if it doesn't fit, we will automatically adjust it.
 - Multiple decoding strategies: Greedy Search, Beam Search, Top-K Sampling, Top-p (nucleus) sampling, etc. See [Decoding Strategies](#decodingsampling-strategies) for more information.
-- :new: Load huge models in a single GPU with 8-bits / 4-bits quantization and support for splitting the model between GPU and CPU. See [Loading Huge Models](#loading-huge-models) for more information.
-- :new: LoRA models support
-- :new: Support for any Seq2SeqLM or CausalLM model from HuggingFace's Hub.
-- :new: Prompt support! See [Prompting](#prompting) for more information.
 >Test the 🔌 Online Demo here: <https://huggingface.co/spaces/Iker/Translate-100-languages>
-## Supported languages
-See the [Supported languages table](supported_languages.md) for a table of the supported languages and their ids.
 ## Supported Models
 💥 EasyTranslate now supports any Seq2SeqLM (m2m100, nllb200, small100, mbart, MarianMT, T5, FlanT5, etc.) and any CausalLM (GPT2, LLaMA, Vicuna, Falcon) model from  🤗 Hugging Face's Hub!!
-We still recommend you to use M2M100 or NLLB200 for the best results, but you can experiment with any other MT model, as well as prompting LLMs to generate translations (See [Prompting Section](#prompting) for more details).
 You can also see [the examples folder](examples) for examples of how to use EasyTranslate with different models.
 ### M2M100
@@ -73,13 +68,23 @@ You can also see [the examples folder](examples) for examples of how to use Easy
 - **facebook/nllb-200-distilled-600M**: <https://huggingface.co/facebook/nllb-200-distilled-600M>
 ### Other MT Models supported
 We support every MT model in the 🤗 Hugging Face's Hub. If you find a model that doesn't work, please open an issue for us to fix it or a PR with the fix. This includes, among many others:
 - **Small100**: <https://huggingface.co/alirezamsh/small100>
 - **Mbart many-to-many / many-to-one**: <https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt>
 - **Opus MT**: <https://huggingface.co/Helsinki-NLP/opus-mt-es-en>
 ## Citation
 If you use this software please cite
@@ -110,6 +115,7 @@ pip install accelerate
 HuggingFace Transformers
 If you plan to use NLLB200, please use >= 4.28.0, as an important bug was fixed in this version.
 pip install --upgrade transformers
 BitsAndBytes (Optional, required for 8-bits / 4-bits quantization)
@@ -135,6 +141,20 @@ python3 translate.py \
 --model_name facebook/m2m100_1.2B
 ```
 #### Multi-GPU
 See Accelerate documentation for more information (multi-node, TPU, Sharded model...): <https://huggingface.co/docs/accelerate/index>

 <p align="center">
     <br>
     <img src="images/title.png" width="900"/>
 - BF16 / FP16 / FP32 / 8 Bits / 4 Bits precision.
 - Automatic batch size finder: Forget CUDA OOM errors. Set an initial batch size, if it doesn't fit, we will automatically adjust it.
 - Multiple decoding strategies: Greedy Search, Beam Search, Top-K Sampling, Top-p (nucleus) sampling, etc. See [Decoding Strategies](#decodingsampling-strategies) for more information.
+- Load huge models in a single GPU with 8-bits / 4-bits quantization and support for splitting the model between GPU and CPU. See [Loading Huge Models](#loading-huge-models) for more information.
+- LoRA models support
+- Support for any Seq2SeqLM or CausalLM model from HuggingFace's Hub.
+- Prompt support! See [Prompting](#prompting) for more information.
+- :new: Add support for [SeamlessM4T](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t)!
 >Test the 🔌 Online Demo here: <https://huggingface.co/spaces/Iker/Translate-100-languages>
 ## Supported Models
 💥 EasyTranslate now supports any Seq2SeqLM (m2m100, nllb200, small100, mbart, MarianMT, T5, FlanT5, etc.) and any CausalLM (GPT2, LLaMA, Vicuna, Falcon) model from  🤗 Hugging Face's Hub!!
+We still recommend you to use M2M100, NLLB200 or SeamlessM4T for the best results, but you can experiment with any other MT model, as well as prompting LLMs to generate translations (See [Prompting Section](#prompting) for more details).
 You can also see [the examples folder](examples) for examples of how to use EasyTranslate with different models.
 ### M2M100
 - **facebook/nllb-200-distilled-600M**: <https://huggingface.co/facebook/nllb-200-distilled-600M>
+### SeamlessM4T
+**SeamlessM4T** a collection of models designed to provide high quality translation, allowing people from different linguistic communities to communicate effortlessly through speech and text. It was introduced in this [paper](https://dl.fbaipublicfiles.com/seamless/seamless_m4t_paper.pdf) and first released in [this](https://github.com/facebookresearch/seamless_communication) repository.
+>SeamlessM4T can directly translate between 196 Languages for text input/output.
+- **facebook/hf-seamless-m4t-medium**: <https://huggingface.co/facebook/hf-seamless-m4t-medium> (Requires transformers 4.35.0)
+- **facebook/hf-seamless-m4t-large**: <https://huggingface.co/facebook/hf-seamless-m4t-large> (Requires transformers 4.35.0)
 ### Other MT Models supported
 We support every MT model in the 🤗 Hugging Face's Hub. If you find a model that doesn't work, please open an issue for us to fix it or a PR with the fix. This includes, among many others:
 - **Small100**: <https://huggingface.co/alirezamsh/small100>
 - **Mbart many-to-many / many-to-one**: <https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt>
 - **Opus MT**: <https://huggingface.co/Helsinki-NLP/opus-mt-es-en>
+See the [Supported languages table](supported_languages.md) for a table of the supported languages and their ids.
 ## Citation
 If you use this software please cite
 HuggingFace Transformers
 If you plan to use NLLB200, please use >= 4.28.0, as an important bug was fixed in this version.
+If you plan to use SeamlessM4T, please use >= 4.35.0.
 pip install --upgrade transformers
 BitsAndBytes (Optional, required for 8-bits / 4-bits quantization)
 --model_name facebook/m2m100_1.2B
 ```
+If you want to translate all the files in a directory, use the `--sentences_dir` flag instead of `--sentences_path`.
+```bash
+# We use --files_extension txt to translate only files with this extension.
+# Use empty string to translate all files in the directory
+python3 translate.py \
+--sentences_dir sample_text/ \
+--output_path sample_text/translations \
+--files_extension txt \
+--source_lang en \
+--target_lang es \
+--model_name facebook/m2m100_1.2B
+```
 #### Multi-GPU
 See Accelerate documentation for more information (multi-node, TPU, Sharded model...): <https://huggingface.co/docs/accelerate/index>

model.py CHANGED Viewed

@@ -14,8 +14,6 @@ from transformers.models.auto.modeling_auto import (
 from typing import Optional, Tuple
-import os
 import torch
 import json
@@ -27,6 +25,7 @@ def load_model_for_inference(
     lora_weights_name_or_path: Optional[str] = None,
     torch_dtype: Optional[str] = None,
     force_auto_device_map: bool = False,
 ) -> Tuple[PreTrainedModel, PreTrainedTokenizerBase]:
     """
     Load any Decoder model for inference.
@@ -50,6 +49,8 @@ def load_model_for_inference(
             Whether to force the use of the auto device map. If set to True, the model will be split across
             GPUs and CPU to fit the model in memory. If set to False, a full copy of the model will be loaded
             into each GPU. Defaults to False.
     Returns:
         `Tuple[PreTrainedModel, PreTrainedTokenizerBase]`:
@@ -64,19 +65,8 @@ def load_model_for_inference(
     print(f"Loading model from {weights_path}")
-    MODEL_FOR_CAUSAL_LM_MAPPING_NAMES.update(
-        {
-            "mpt": "MPTForCausalLM",
-            "RefinedWebModel": "RWForCausalLM",
-            "RefinedWeb": "RWForCausalLM",
-        }
-    )  # MPT and Falcon are not in transformers yet
     config = AutoConfig.from_pretrained(
-        weights_path,
-        trust_remote_code=True
-        if ("mpt" in weights_path or "falcon" in weights_path)
-        else False,
     )
     torch_dtype = (
@@ -84,20 +74,40 @@ def load_model_for_inference(
     )
     if "small100" in weights_path:
         print(f"Loading custom small100 tokenizer for utils.tokenization_small100")
         from utils.tokenization_small100 import SMALL100Tokenizer as AutoTokenizer
     else:
         from transformers import AutoTokenizer
     tokenizer: PreTrainedTokenizerBase = AutoTokenizer.from_pretrained(
-        weights_path,
-        add_eos_token=True,
-        trust_remote_code=True
-        if ("mpt" in weights_path or "falcon" in weights_path)
-        else False,
     )
     quant_args = {}
     if quantization is not None:
         quant_args = (
             {"load_in_4bit": True} if quantization == 4 else {"load_in_8bit": True}
@@ -107,16 +117,17 @@ def load_model_for_inference(
                 load_in_4bit=True,
                 bnb_4bit_use_double_quant=True,
                 bnb_4bit_quant_type="nf4",
-                bnb_4bit_compute_dtype=torch.bfloat16,
             )
-            torch_dtype = torch.bfloat16
         else:
             bnb_config = BitsAndBytesConfig(
                 load_in_8bit=True,
             )
         print(
-            f"Bits and Bytes config: {json.dumps(bnb_config.to_dict(),indent=4,ensure_ascii=False)}"
         )
     else:
         print(f"Loading model with dtype: {torch_dtype}")
@@ -131,6 +142,7 @@ def load_model_for_inference(
             device_map="auto" if force_auto_device_map else None,
             torch_dtype=torch_dtype,
             quantization_config=bnb_config,
             **quant_args,
         )
@@ -142,9 +154,7 @@ def load_model_for_inference(
             pretrained_model_name_or_path=weights_path,
             device_map="auto" if force_auto_device_map else None,
             torch_dtype=torch_dtype,
-            trust_remote_code=True
-            if ("mpt" in weights_path or "falcon" in weights_path)
-            else False,
             quantization_config=bnb_config,
             **quant_args,
         )
@@ -159,21 +169,6 @@ def load_model_for_inference(
             f"CausalLM: {MODEL_FOR_CAUSAL_LM_MAPPING_NAMES}\n"
         )
-    if tokenizer.pad_token_id is None:
-        if "<|padding|>" in tokenizer.get_vocab():
-            # StableLM specific fix
-            tokenizer.add_special_tokens({"pad_token": "<|padding|>"})
-        elif tokenizer.unk_token is not None:
-            print(
-                "Model does not have a pad token, we will use the unk token as pad token."
-            )
-            tokenizer.pad_token_id = tokenizer.unk_token_id
-        else:
-            print(
-                "Model does not have a pad token. We will use the eos token as pad token."
-            )
-            tokenizer.pad_token_id = tokenizer.eos_token_id
     if lora_weights_name_or_path:
         from peft import PeftModel

 from typing import Optional, Tuple
 import torch
 import json
     lora_weights_name_or_path: Optional[str] = None,
     torch_dtype: Optional[str] = None,
     force_auto_device_map: bool = False,
+    trust_remote_code: bool = False,
 ) -> Tuple[PreTrainedModel, PreTrainedTokenizerBase]:
     """
     Load any Decoder model for inference.
             Whether to force the use of the auto device map. If set to True, the model will be split across
             GPUs and CPU to fit the model in memory. If set to False, a full copy of the model will be loaded
             into each GPU. Defaults to False.
+        trust_remote_code (`bool`, optional):
+            Trust the remote code from HuggingFace model hub. Defaults to False.
     Returns:
         `Tuple[PreTrainedModel, PreTrainedTokenizerBase]`:
     print(f"Loading model from {weights_path}")
     config = AutoConfig.from_pretrained(
+        weights_path, trust_remote_code=trust_remote_code
     )
     torch_dtype = (
     )
     if "small100" in weights_path:
+        import transformers
+        if transformers.__version__ > "4.34.0":
+            raise ValueError(
+                "Small100 tokenizer is not supported in transformers > 4.34.0. Please "
+                "use transformers <= 4.34.0 if you want to use small100"
+            )
         print(f"Loading custom small100 tokenizer for utils.tokenization_small100")
         from utils.tokenization_small100 import SMALL100Tokenizer as AutoTokenizer
     else:
         from transformers import AutoTokenizer
     tokenizer: PreTrainedTokenizerBase = AutoTokenizer.from_pretrained(
+        weights_path, add_eos_token=True, trust_remote_code=trust_remote_code
     )
+    if tokenizer.pad_token_id is None:
+        if "<|padding|>" in tokenizer.get_vocab():
+            # StabilityLM specific fix
+            tokenizer.add_special_tokens({"pad_token": "<|padding|>"})
+        elif tokenizer.unk_token is not None:
+            print(
+                "Tokenizer does not have a pad token, we will use the unk token as pad token."
+            )
+            tokenizer.pad_token_id = tokenizer.unk_token_id
+        else:
+            print(
+                "Tokenizer does not have a pad token. We will use the eos token as pad token."
+            )
+            tokenizer.pad_token_id = tokenizer.eos_token_id
     quant_args = {}
     if quantization is not None:
         quant_args = (
             {"load_in_4bit": True} if quantization == 4 else {"load_in_8bit": True}
                 load_in_4bit=True,
                 bnb_4bit_use_double_quant=True,
                 bnb_4bit_quant_type="nf4",
+                bnb_4bit_compute_dtype=torch.bfloat16
+                if torch_dtype in ["auto", None]
+                else torch_dtype,
             )
         else:
             bnb_config = BitsAndBytesConfig(
                 load_in_8bit=True,
             )
         print(
+            f"Bits and Bytes config: {json.dumps(bnb_config.to_dict(), indent=4, ensure_ascii=False)}"
         )
     else:
         print(f"Loading model with dtype: {torch_dtype}")
             device_map="auto" if force_auto_device_map else None,
             torch_dtype=torch_dtype,
             quantization_config=bnb_config,
+            trust_remote_code=trust_remote_code,
             **quant_args,
         )
             pretrained_model_name_or_path=weights_path,
             device_map="auto" if force_auto_device_map else None,
             torch_dtype=torch_dtype,
+            trust_remote_code=trust_remote_code,
             quantization_config=bnb_config,
             **quant_args,
         )
             f"CausalLM: {MODEL_FOR_CAUSAL_LM_MAPPING_NAMES}\n"
         )
     if lora_weights_name_or_path:
         from peft import PeftModel

sample_text/en2es.seamless-m4t-large.json ADDED Viewed

	@@ -0,0 +1,60 @@

+{
+    "path": "sample_text/en2es.translation.seamless-m4t-large.txt",
+    "sacrebleu": {
+        "score": 36.315142112223896,
+        "counts": [
+            20334,
+            12742,
+            8758,
+            6156
+        ],
+        "totals": [
+            31021,
+            30021,
+            29021,
+            28021
+        ],
+        "precisions": [
+            65.54914412817124,
+            42.44362279737517,
+            30.178146859170944,
+            21.969237357696013
+        ],
+        "bp": 0.9854077938820913,
+        "sys_len": 31021,
+        "ref_len": 31477
+    },
+    "rouge": {
+        "rouge1": 0.6330701226501922,
+        "rouge2": 0.4284215608900075,
+        "rougeL": 0.5852948888167713,
+        "rougeLsum": 0.5852893813466102
+    },
+    "bleu": {
+        "bleu": 0.36315142112223897,
+        "precisions": [
+            0.6554914412817124,
+            0.4244362279737517,
+            0.30178146859170946,
+            0.21969237357696014
+        ],
+        "brevity_penalty": 0.9854077938820913,
+        "length_ratio": 0.9855132318835975,
+        "translation_length": 31021,
+        "reference_length": 31477
+    },
+    "meteor": {
+        "meteor": 0.5988659867679048
+    },
+    "ter": {
+        "score": 53.42233524051706,
+        "num_edits": 15126,
+        "ref_length": 28314.0
+    },
+    "bert_score": {
+        "precision": 0.8355873214006424,
+        "recall": 0.8343284497857094,
+        "f1": 0.8346186644434929,
+        "hashcode": "microsoft/deberta-xlarge-mnli_L40_no-idf_version=0.3.12(hug_trans=4.35.2)_fast-tokenizer"
+    }
+}

sample_text/en2es.seamless-m4t-medium.json ADDED Viewed

	@@ -0,0 +1,60 @@

+{
+    "path": "sample_text/en2es.translation.seamless-m4t-medium.txt",
+    "sacrebleu": {
+        "score": 32.86110838375764,
+        "counts": [
+            19564,
+            11721,
+            7752,
+            5264
+        ],
+        "totals": [
+            30811,
+            29811,
+            28811,
+            27812
+        ],
+        "precisions": [
+            63.49680308980559,
+            39.31770151957331,
+            26.90638992051647,
+            18.92708183517906
+        ],
+        "bp": 0.978616287348328,
+        "sys_len": 30811,
+        "ref_len": 31477
+    },
+    "rouge": {
+        "rouge1": 0.609193205717968,
+        "rouge2": 0.3944070815557623,
+        "rougeL": 0.558841464797821,
+        "rougeLsum": 0.5594046328281417
+    },
+    "bleu": {
+        "bleu": 0.3286110838375765,
+        "precisions": [
+            0.6349680308980559,
+            0.3931770151957331,
+            0.2690638992051647,
+            0.1892708183517906
+        ],
+        "brevity_penalty": 0.978616287348328,
+        "length_ratio": 0.9788416939352543,
+        "translation_length": 30811,
+        "reference_length": 31477
+    },
+    "meteor": {
+        "meteor": 0.5707261528520716
+    },
+    "ter": {
+        "score": 55.88754679663771,
+        "num_edits": 15824,
+        "ref_length": 28314.0
+    },
+    "bert_score": {
+        "precision": 0.8278114783763886,
+        "recall": 0.824702616840601,
+        "f1": 0.8259151731133461,
+        "hashcode": "microsoft/deberta-xlarge-mnli_L40_no-idf_version=0.3.12(hug_trans=4.35.2)_fast-tokenizer"
+    }
+}

sample_text/en2es.translation.seamless-m4t-large.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

sample_text/en2es.translation.seamless-m4t-medium.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

tests/__init__.py ADDED Viewed

File without changes

tests/test_translation.py ADDED Viewed

	@@ -0,0 +1,548 @@

+# Run with 'python -m unittest tests.test_translation'
+import unittest
+import tempfile
+import os
+from translate import main
+import transformers
+class Inputs(unittest.TestCase):
+    def test_m2m100_inputs(self):
+        # Create a temporary directory
+        with tempfile.TemporaryDirectory() as tmpdirname:
+            # Create a temporary file
+            input_path = os.path.join(tmpdirname, "source.txt")
+            output_path = os.path.join(tmpdirname, "target.txt")
+            with open(
+                os.path.join(tmpdirname, "source.txt"), "w", encoding="utf8"
+            ) as f:
+                print("Hello, world, my name is Iker!", file=f)
+            main(
+                sentences_path=input_path,
+                sentences_dir=None,
+                files_extension="txt",
+                output_path=output_path,
+                source_lang="en",
+                target_lang="es",
+                starting_batch_size=32,
+                model_name="facebook/m2m100_418M",
+                lora_weights_name_or_path=None,
+                force_auto_device_map=True,
+                precision=None,
+                max_length=64,
+                num_beams=2,
+                num_return_sequences=1,
+                do_sample=False,
+                temperature=1.0,
+                top_k=50,
+                top_p=1.0,
+                keep_special_tokens=False,
+                keep_tokenization_spaces=False,
+                repetition_penalty=None,
+                prompt=None,
+            )
+            main(
+                sentences_path=None,
+                sentences_dir=tmpdirname,
+                files_extension="txt",
+                output_path=os.path.join(tmpdirname, "target"),
+                source_lang="en",
+                target_lang="es",
+                starting_batch_size=32,
+                model_name="facebook/m2m100_418M",
+                lora_weights_name_or_path=None,
+                force_auto_device_map=True,
+                precision=None,
+                max_length=64,
+                num_beams=2,
+                num_return_sequences=1,
+                do_sample=False,
+                temperature=1.0,
+                top_k=50,
+                top_p=1.0,
+                keep_special_tokens=False,
+                keep_tokenization_spaces=False,
+                repetition_penalty=None,
+                prompt=None,
+            )
+class Translations(unittest.TestCase):
+    def test_m2m100(self):
+        # Create a temporary directory
+        with tempfile.TemporaryDirectory() as tmpdirname:
+            # Create a temporary file
+            input_path = os.path.join(tmpdirname, "source.txt")
+            output_path = os.path.join(tmpdirname, "target.txt")
+            with open(
+                os.path.join(tmpdirname, "source.txt"), "w", encoding="utf8"
+            ) as f:
+                print("Hello, world, my name is Iker!", file=f)
+            model_name = "facebook/m2m100_418M"
+            src_lang = "en"
+            tgt_lang = "es"
+            main(
+                sentences_path=input_path,
+                sentences_dir=None,
+                files_extension="txt",
+                output_path=output_path,
+                source_lang=src_lang,
+                target_lang=tgt_lang,
+                starting_batch_size=32,
+                model_name=model_name,
+                lora_weights_name_or_path=None,
+                force_auto_device_map=True,
+                precision="bf16",
+                max_length=64,
+                num_beams=2,
+                num_return_sequences=1,
+                do_sample=False,
+                temperature=1.0,
+                top_k=50,
+                top_p=1.0,
+                keep_special_tokens=False,
+                keep_tokenization_spaces=False,
+                repetition_penalty=None,
+                prompt=None,
+            )
+            main(
+                sentences_path=input_path,
+                sentences_dir=None,
+                files_extension="txt",
+                output_path=output_path,
+                source_lang=src_lang,
+                target_lang=tgt_lang,
+                starting_batch_size=32,
+                model_name=model_name,
+                lora_weights_name_or_path=None,
+                force_auto_device_map=True,
+                precision="4",
+                max_length=64,
+                num_beams=2,
+                num_return_sequences=1,
+                do_sample=False,
+                temperature=1.0,
+                top_k=50,
+                top_p=1.0,
+                keep_special_tokens=False,
+                keep_tokenization_spaces=False,
+                repetition_penalty=None,
+                prompt=None,
+            )
+    def test_nllb200(self):
+        # Create a temporary directory
+        with tempfile.TemporaryDirectory() as tmpdirname:
+            # Create a temporary file
+            input_path = os.path.join(tmpdirname, "source.txt")
+            output_path = os.path.join(tmpdirname, "target.txt")
+            with open(
+                os.path.join(tmpdirname, "source.txt"), "w", encoding="utf8"
+            ) as f:
+                print("Hello, world, my name is Iker!", file=f)
+            model_name = "facebook/nllb-200-distilled-600M"
+            src_lang = "eng_Latn"
+            tgt_lang = "spa_Latn"
+            main(
+                sentences_path=input_path,
+                sentences_dir=None,
+                files_extension="txt",
+                output_path=output_path,
+                source_lang=src_lang,
+                target_lang=tgt_lang,
+                starting_batch_size=32,
+                model_name=model_name,
+                lora_weights_name_or_path=None,
+                force_auto_device_map=True,
+                precision="bf16",
+                max_length=64,
+                num_beams=2,
+                num_return_sequences=1,
+                do_sample=False,
+                temperature=1.0,
+                top_k=50,
+                top_p=1.0,
+                keep_special_tokens=False,
+                keep_tokenization_spaces=False,
+                repetition_penalty=None,
+                prompt=None,
+            )
+            main(
+                sentences_path=input_path,
+                sentences_dir=None,
+                files_extension="txt",
+                output_path=output_path,
+                source_lang=src_lang,
+                target_lang=tgt_lang,
+                starting_batch_size=32,
+                model_name=model_name,
+                lora_weights_name_or_path=None,
+                force_auto_device_map=True,
+                precision="4",
+                max_length=64,
+                num_beams=2,
+                num_return_sequences=1,
+                do_sample=False,
+                temperature=1.0,
+                top_k=50,
+                top_p=1.0,
+                keep_special_tokens=False,
+                keep_tokenization_spaces=False,
+                repetition_penalty=None,
+                prompt=None,
+            )
+    def test_mbart(self):
+        # Create a temporary directory
+        with tempfile.TemporaryDirectory() as tmpdirname:
+            # Create a temporary file
+            input_path = os.path.join(tmpdirname, "source.txt")
+            output_path = os.path.join(tmpdirname, "target.txt")
+            with open(
+                os.path.join(tmpdirname, "source.txt"), "w", encoding="utf8"
+            ) as f:
+                print("Hello, world, my name is Iker!", file=f)
+            model_name = "facebook/mbart-large-50"
+            src_lang = "en_XX"
+            tgt_lang = "es_XX"
+            main(
+                sentences_path=input_path,
+                sentences_dir=None,
+                files_extension="txt",
+                output_path=output_path,
+                source_lang=src_lang,
+                target_lang=tgt_lang,
+                starting_batch_size=32,
+                model_name=model_name,
+                lora_weights_name_or_path=None,
+                force_auto_device_map=True,
+                precision="bf16",
+                max_length=64,
+                num_beams=2,
+                num_return_sequences=1,
+                do_sample=False,
+                temperature=1.0,
+                top_k=50,
+                top_p=1.0,
+                keep_special_tokens=False,
+                keep_tokenization_spaces=False,
+                repetition_penalty=None,
+                prompt=None,
+            )
+            main(
+                sentences_path=input_path,
+                sentences_dir=None,
+                files_extension="txt",
+                output_path=output_path,
+                source_lang=src_lang,
+                target_lang=tgt_lang,
+                starting_batch_size=32,
+                model_name=model_name,
+                lora_weights_name_or_path=None,
+                force_auto_device_map=True,
+                precision="4",
+                max_length=64,
+                num_beams=2,
+                num_return_sequences=1,
+                do_sample=False,
+                temperature=1.0,
+                top_k=50,
+                top_p=1.0,
+                keep_special_tokens=False,
+                keep_tokenization_spaces=False,
+                repetition_penalty=None,
+                prompt=None,
+            )
+    def test_opus(self):
+        # Create a temporary directory
+        with tempfile.TemporaryDirectory() as tmpdirname:
+            # Create a temporary file
+            input_path = os.path.join(tmpdirname, "source.txt")
+            output_path = os.path.join(tmpdirname, "target.txt")
+            with open(
+                os.path.join(tmpdirname, "source.txt"), "w", encoding="utf8"
+            ) as f:
+                print("Hello, world, my name is Iker!", file=f)
+            model_name = "Helsinki-NLP/opus-mt-en-es"
+            src_lang = None
+            tgt_lang = None
+            main(
+                sentences_path=input_path,
+                sentences_dir=None,
+                files_extension="txt",
+                output_path=output_path,
+                source_lang=src_lang,
+                target_lang=tgt_lang,
+                starting_batch_size=32,
+                model_name=model_name,
+                lora_weights_name_or_path=None,
+                force_auto_device_map=False,
+                precision="bf16",
+                max_length=64,
+                num_beams=2,
+                num_return_sequences=1,
+                do_sample=False,
+                temperature=1.0,
+                top_k=50,
+                top_p=1.0,
+                keep_special_tokens=False,
+                keep_tokenization_spaces=False,
+                repetition_penalty=None,
+                prompt=None,
+            )
+            main(
+                sentences_path=input_path,
+                sentences_dir=None,
+                files_extension="txt",
+                output_path=output_path,
+                source_lang=src_lang,
+                target_lang=tgt_lang,
+                starting_batch_size=32,
+                model_name=model_name,
+                lora_weights_name_or_path=None,
+                force_auto_device_map=False,
+                precision="4",
+                max_length=64,
+                num_beams=2,
+                num_return_sequences=1,
+                do_sample=False,
+                temperature=1.0,
+                top_k=50,
+                top_p=1.0,
+                keep_special_tokens=False,
+                keep_tokenization_spaces=False,
+                repetition_penalty=None,
+                prompt=None,
+            )
+    @unittest.skipIf(
+        transformers.__version__ > "4.34.0",
+        "Small100 tokenizer is not supported in transformers > 4.34.0. Please use transformers <= 4.34.0 if you want to use small100",
+    )
+    def test_small100(self):
+        # Create a temporary directory
+        with tempfile.TemporaryDirectory() as tmpdirname:
+            # Create a temporary file
+            input_path = os.path.join(tmpdirname, "source.txt")
+            output_path = os.path.join(tmpdirname, "target.txt")
+            with open(
+                os.path.join(tmpdirname, "source.txt"), "w", encoding="utf8"
+            ) as f:
+                print("Hello, world, my name is Iker!", file=f)
+            model_name = "alirezamsh/small100"
+            src_lang = None
+            tgt_lang = "es"
+            main(
+                sentences_path=input_path,
+                sentences_dir=None,
+                files_extension="txt",
+                output_path=output_path,
+                source_lang=src_lang,
+                target_lang=tgt_lang,
+                starting_batch_size=32,
+                model_name=model_name,
+                lora_weights_name_or_path=None,
+                force_auto_device_map=True,
+                precision="bf16",
+                max_length=64,
+                num_beams=2,
+                num_return_sequences=1,
+                do_sample=False,
+                temperature=1.0,
+                top_k=50,
+                top_p=1.0,
+                keep_special_tokens=False,
+                keep_tokenization_spaces=False,
+                repetition_penalty=None,
+                prompt=None,
+            )
+            main(
+                sentences_path=input_path,
+                sentences_dir=None,
+                files_extension="txt",
+                output_path=output_path,
+                source_lang=src_lang,
+                target_lang=tgt_lang,
+                starting_batch_size=32,
+                model_name=model_name,
+                lora_weights_name_or_path=None,
+                force_auto_device_map=True,
+                precision="4",
+                max_length=64,
+                num_beams=2,
+                num_return_sequences=1,
+                do_sample=False,
+                temperature=1.0,
+                top_k=50,
+                top_p=1.0,
+                keep_special_tokens=False,
+                keep_tokenization_spaces=False,
+                repetition_penalty=None,
+                prompt=None,
+            )
+    def test_seamless(self):
+        # Create a temporary directory
+        with tempfile.TemporaryDirectory() as tmpdirname:
+            # Create a temporary file
+            input_path = os.path.join(tmpdirname, "source.txt")
+            output_path = os.path.join(tmpdirname, "target.txt")
+            with open(
+                os.path.join(tmpdirname, "source.txt"), "w", encoding="utf8"
+            ) as f:
+                print("Hello, world, my name is Iker!", file=f)
+            model_name = "facebook/hf-seamless-m4t-medium"
+            src_lang = "eng"
+            tgt_lang = "spa"
+            main(
+                sentences_path=input_path,
+                sentences_dir=None,
+                files_extension="txt",
+                output_path=output_path,
+                source_lang=src_lang,
+                target_lang=tgt_lang,
+                starting_batch_size=32,
+                model_name=model_name,
+                lora_weights_name_or_path=None,
+                force_auto_device_map=True,
+                precision="bf16",
+                max_length=64,
+                num_beams=2,
+                num_return_sequences=1,
+                do_sample=False,
+                temperature=1.0,
+                top_k=50,
+                top_p=1.0,
+                keep_special_tokens=False,
+                keep_tokenization_spaces=False,
+                repetition_penalty=None,
+                prompt=None,
+            )
+            main(
+                sentences_path=input_path,
+                sentences_dir=None,
+                files_extension="txt",
+                output_path=output_path,
+                source_lang=src_lang,
+                target_lang=tgt_lang,
+                starting_batch_size=32,
+                model_name=model_name,
+                lora_weights_name_or_path=None,
+                force_auto_device_map=True,
+                precision="4",
+                max_length=64,
+                num_beams=2,
+                num_return_sequences=1,
+                do_sample=False,
+                temperature=1.0,
+                top_k=50,
+                top_p=1.0,
+                keep_special_tokens=False,
+                keep_tokenization_spaces=False,
+                repetition_penalty=None,
+                prompt=None,
+            )
+class Prompting(unittest.TestCase):
+    def test_llama(self):
+        # Create a temporary directory
+        with tempfile.TemporaryDirectory() as tmpdirname:
+            # Create a temporary file
+            input_path = os.path.join(tmpdirname, "source.txt")
+            output_path = os.path.join(tmpdirname, "target.txt")
+            with open(
+                os.path.join(tmpdirname, "source.txt"), "w", encoding="utf8"
+            ) as f:
+                print("Hello, world, my name is Iker!", file=f)
+            model_name = "stas/tiny-random-llama-2"
+            prompt = "Translate English to Spanish: %%SENTENCE%%"
+            main(
+                sentences_path=input_path,
+                sentences_dir=None,
+                files_extension="txt",
+                output_path=output_path,
+                source_lang=None,
+                target_lang=None,
+                starting_batch_size=32,
+                model_name=model_name,
+                lora_weights_name_or_path=None,
+                force_auto_device_map=True,
+                precision="bf16",
+                max_length=64,
+                num_beams=2,
+                num_return_sequences=1,
+                do_sample=True,
+                temperature=1.0,
+                top_k=50,
+                top_p=1.0,
+                keep_special_tokens=False,
+                keep_tokenization_spaces=False,
+                repetition_penalty=None,
+                prompt=prompt,
+            )
+            main(
+                sentences_path=input_path,
+                sentences_dir=None,
+                files_extension="txt",
+                output_path=output_path,
+                source_lang=None,
+                target_lang=None,
+                starting_batch_size=32,
+                model_name=model_name,
+                lora_weights_name_or_path=None,
+                force_auto_device_map=True,
+                precision="4",
+                max_length=64,
+                num_beams=2,
+                num_return_sequences=1,
+                do_sample=True,
+                temperature=1.0,
+                top_k=50,
+                top_p=1.0,
+                keep_special_tokens=False,
+                keep_tokenization_spaces=False,
+                repetition_penalty=None,
+                prompt=prompt,
+            )

translate.py CHANGED Viewed

@@ -1,6 +1,7 @@
 import os
 import math
 import argparse
 import torch
 from torch.utils.data import DataLoader
@@ -18,6 +19,8 @@ from dataset import DatasetReader, count_lines
 from accelerate import Accelerator, DistributedType, find_executable_batch_size
 def encode_string(text):
     return text.replace("\r", r"\r").replace("\n", r"\n").replace("\t", r"\t")
@@ -31,7 +34,12 @@ def get_dataloader(
     max_length: int,
     prompt: str,
 ) -> DataLoader:
-    dataset = DatasetReader(filename, tokenizer, max_length, prompt)
     if accelerator.distributed_type == DistributedType.TPU:
         data_collator = DataCollatorForSeq2Seq(
             tokenizer,
@@ -59,16 +67,18 @@ def get_dataloader(
 def main(
-    sentences_path: str,
     output_path: str,
-    source_lang: str,
-    target_lang: str,
     starting_batch_size: int,
     model_name: str = "facebook/m2m100_1.2B",
     lora_weights_name_or_path: str = None,
     force_auto_device_map: bool = False,
     precision: str = None,
-    max_length: int = 128,
     num_beams: int = 4,
     num_return_sequences: int = 1,
     do_sample: bool = False,
@@ -79,9 +89,8 @@ def main(
     keep_tokenization_spaces: bool = False,
     repetition_penalty: float = None,
     prompt: str = None,
 ):
-    os.makedirs(os.path.abspath(os.path.dirname(output_path)), exist_ok=True)
     accelerator = Accelerator()
     if force_auto_device_map and starting_batch_size >= 64:
@@ -92,6 +101,16 @@ def main(
             f"inference. You should consider using a smaller batch size, i.e '--starting_batch_size 8'"
         )
     if precision is None:
         quantization = None
         dtype = None
@@ -118,11 +137,17 @@ def main(
         lora_weights_name_or_path=lora_weights_name_or_path,
         torch_dtype=dtype,
         force_auto_device_map=force_auto_device_map,
     )
     is_translation_model = hasattr(tokenizer, "lang_code_to_id")
-    if is_translation_model and (source_lang is None or target_lang is None):
         raise ValueError(
             f"The model you are using requires a source and target language. "
             f"Please specify them with --source-lang and --target-lang. "
@@ -169,8 +194,32 @@ def main(
             # We don't need to force the BOS token, so we set is_translation_model to False
             is_translation_model = False
     gen_kwargs = {
-        "max_length": max_length,
         "num_beams": num_beams,
         "num_return_sequences": num_return_sequences,
         "do_sample": do_sample,
@@ -182,12 +231,17 @@ def main(
     if repetition_penalty is not None:
         gen_kwargs["repetition_penalty"] = repetition_penalty
-    total_lines: int = count_lines(sentences_path)
     if accelerator.is_main_process:
         print(
             f"** Translation **\n"
             f"Input file: {sentences_path}\n"
             f"Output file: {output_path}\n"
             f"Source language: {source_lang}\n"
             f"Target language: {target_lang}\n"
@@ -211,10 +265,12 @@ def main(
         print("\n")
     @find_executable_batch_size(starting_batch_size=starting_batch_size)
-    def inference(batch_size):
-        nonlocal model, tokenizer, sentences_path, max_length, output_path, lang_code_to_idx, gen_kwargs, precision, prompt, is_translation_model
-        print(f"Translating with batch size {batch_size}")
         data_loader = get_dataloader(
             accelerator=accelerator,
@@ -243,9 +299,6 @@ def main(
                     generated_tokens = accelerator.unwrap_model(model).generate(
                         **batch,
-                        forced_bos_token_id=lang_code_to_idx
-                        if is_translation_model
-                        else None,
                         **gen_kwargs,
                     )
@@ -286,24 +339,60 @@ def main(
                     pbar.update(len(tgt_text) // gen_kwargs["num_return_sequences"])
-    inference()
     print(f"Translation done.\n")
 if __name__ == "__main__":
     parser = argparse.ArgumentParser(description="Run the translation experiments")
-    parser.add_argument(
         "--sentences_path",
         type=str,
-        required=True,
         help="Path to a txt file containing the sentences to translate. One sentence per line.",
     )
     parser.add_argument(
         "--output_path",
         type=str,
         required=True,
-        help="Path to a txt file where the translated sentences will be written.",
     )
     parser.add_argument(
@@ -355,7 +444,7 @@ if __name__ == "__main__":
     parser.add_argument(
         "--max_length",
         type=int,
-        default=128,
         help="Maximum number of tokens in the source sentence and generated sentence. "
         "Increase this value to translate longer sentences, at the cost of increasing memory usage.",
     )
@@ -438,10 +527,18 @@ if __name__ == "__main__":
         "It must include the special token %%SENTENCE%% which will be replaced by the sentence to translate.",
     )
     args = parser.parse_args()
     main(
         sentences_path=args.sentences_path,
         output_path=args.output_path,
         source_lang=args.source_lang,
         target_lang=args.target_lang,
@@ -459,4 +556,5 @@ if __name__ == "__main__":
         keep_tokenization_spaces=args.keep_tokenization_spaces,
         repetition_penalty=args.repetition_penalty,
         prompt=args.prompt,
     )

 import os
 import math
 import argparse
+import glob
 import torch
 from torch.utils.data import DataLoader
 from accelerate import Accelerator, DistributedType, find_executable_batch_size
+from typing import Optional
 def encode_string(text):
     return text.replace("\r", r"\r").replace("\n", r"\n").replace("\t", r"\t")
     max_length: int,
     prompt: str,
 ) -> DataLoader:
+    dataset = DatasetReader(
+        filename=filename,
+        tokenizer=tokenizer,
+        max_length=max_length,
+        prompt=prompt,
+    )
     if accelerator.distributed_type == DistributedType.TPU:
         data_collator = DataCollatorForSeq2Seq(
             tokenizer,
 def main(
+    sentences_path: Optional[str],
+    sentences_dir: Optional[str],
+    files_extension: str,
     output_path: str,
+    source_lang: Optional[str],
+    target_lang: Optional[str],
     starting_batch_size: int,
     model_name: str = "facebook/m2m100_1.2B",
     lora_weights_name_or_path: str = None,
     force_auto_device_map: bool = False,
     precision: str = None,
+    max_length: int = 256,
     num_beams: int = 4,
     num_return_sequences: int = 1,
     do_sample: bool = False,
     keep_tokenization_spaces: bool = False,
     repetition_penalty: float = None,
     prompt: str = None,
+    trust_remote_code: bool = False,
 ):
     accelerator = Accelerator()
     if force_auto_device_map and starting_batch_size >= 64:
             f"inference. You should consider using a smaller batch size, i.e '--starting_batch_size 8'"
         )
+    if sentences_path is None and sentences_dir is None:
+        raise ValueError(
+            "You must specify either --sentences_path or --sentences_dir. Use --help for more details."
+        )
+    if sentences_path is not None and sentences_dir is not None:
+        raise ValueError(
+            "You must specify either --sentences_path or --sentences_dir, not both. Use --help for more details."
+        )
     if precision is None:
         quantization = None
         dtype = None
         lora_weights_name_or_path=lora_weights_name_or_path,
         torch_dtype=dtype,
         force_auto_device_map=force_auto_device_map,
+        trust_remote_code=trust_remote_code,
     )
     is_translation_model = hasattr(tokenizer, "lang_code_to_id")
+    lang_code_to_idx = None
+    if (
+        is_translation_model
+        and (source_lang is None or target_lang is None)
+        and "small100" not in model_name
+    ):
         raise ValueError(
             f"The model you are using requires a source and target language. "
             f"Please specify them with --source-lang and --target-lang. "
             # We don't need to force the BOS token, so we set is_translation_model to False
             is_translation_model = False
+    if model.config.model_type == "seamless_m4t":
+        # Loading a seamless_m4t model, we need to set a few things to ensure compatibility
+        supported_langs = tokenizer.additional_special_tokens
+        supported_langs = [lang.replace("__", "") for lang in supported_langs]
+        if source_lang is None or target_lang is None:
+            raise ValueError(
+                f"The model you are using requires a source and target language. "
+                f"Please specify them with --source-lang and --target-lang. "
+                f"The supported languages are: {supported_langs}"
+            )
+        if source_lang not in supported_langs:
+            raise ValueError(
+                f"Language {source_lang} not found in tokenizer. Available languages: {supported_langs}"
+            )
+        if target_lang not in supported_langs:
+            raise ValueError(
+                f"Language {target_lang} not found in tokenizer. Available languages: {supported_langs}"
+            )
+        tokenizer.src_lang = source_lang
     gen_kwargs = {
+        "max_new_tokens": max_length,
         "num_beams": num_beams,
         "num_return_sequences": num_return_sequences,
         "do_sample": do_sample,
     if repetition_penalty is not None:
         gen_kwargs["repetition_penalty"] = repetition_penalty
+    if is_translation_model:
+        gen_kwargs["forced_bos_token_id"] = lang_code_to_idx
+    if model.config.model_type == "seamless_m4t":
+        gen_kwargs["tgt_lang"] = target_lang
     if accelerator.is_main_process:
         print(
             f"** Translation **\n"
             f"Input file: {sentences_path}\n"
+            f"Sentences dir: {sentences_dir}\n"
             f"Output file: {output_path}\n"
             f"Source language: {source_lang}\n"
             f"Target language: {target_lang}\n"
         print("\n")
     @find_executable_batch_size(starting_batch_size=starting_batch_size)
+    def inference(batch_size, sentences_path, output_path):
+        nonlocal model, tokenizer, max_length, gen_kwargs, precision, prompt, is_translation_model
+        print(f"Translating {sentences_path} with batch size {batch_size}")
+        total_lines: int = count_lines(sentences_path)
         data_loader = get_dataloader(
             accelerator=accelerator,
                     generated_tokens = accelerator.unwrap_model(model).generate(
                         **batch,
                         **gen_kwargs,
                     )
                     pbar.update(len(tgt_text) // gen_kwargs["num_return_sequences"])
+        print(f"Translation done. Output written to {output_path}\n")
+    if sentences_path is not None:
+        os.makedirs(os.path.abspath(os.path.dirname(output_path)), exist_ok=True)
+        inference(sentences_path=sentences_path, output_path=output_path)
+    if sentences_dir is not None:
+        print(
+            f"Translating all files in {sentences_dir}, with extension {files_extension}"
+        )
+        os.makedirs(os.path.abspath(output_path), exist_ok=True)
+        for filename in glob.glob(
+            os.path.join(
+                sentences_dir, f"*.{files_extension}" if files_extension else "*"
+            )
+        ):
+            output_filename = os.path.join(output_path, os.path.basename(filename))
+            inference(sentences_path=filename, output_path=output_filename)
     print(f"Translation done.\n")
 if __name__ == "__main__":
     parser = argparse.ArgumentParser(description="Run the translation experiments")
+    input_group = parser.add_mutually_exclusive_group(required=True)
+    input_group.add_argument(
         "--sentences_path",
+        default=None,
         type=str,
         help="Path to a txt file containing the sentences to translate. One sentence per line.",
     )
+    input_group.add_argument(
+        "--sentences_dir",
+        type=str,
+        default=None,
+        help="Path to a directory containing the sentences to translate. "
+        "Sentences must be in  .txt files containing containing one sentence per line.",
+    )
+    parser.add_argument(
+        "--files_extension",
+        type=str,
+        default="txt",
+        help="If sentences_dir is specified, extension of the files to translate. Defaults to txt. "
+        "If set to an empty string, we will translate all files in the directory.",
+    )
     parser.add_argument(
         "--output_path",
         type=str,
         required=True,
+        help="Path to a txt file where the translated sentences will be written. If the input is a directory, "
+        "the output will be a directory with the same structure.",
     )
     parser.add_argument(
     parser.add_argument(
         "--max_length",
         type=int,
+        default=256,
         help="Maximum number of tokens in the source sentence and generated sentence. "
         "Increase this value to translate longer sentences, at the cost of increasing memory usage.",
     )
         "It must include the special token %%SENTENCE%% which will be replaced by the sentence to translate.",
     )
+    parser.add_argument(
+        "--trust_remote_code",
+        action="store_true",
+        help="If set we will trust remote code in HuggingFace models. This is required for some models.",
+    )
     args = parser.parse_args()
     main(
         sentences_path=args.sentences_path,
+        sentences_dir=args.sentences_dir,
+        files_extension=args.files_extension,
         output_path=args.output_path,
         source_lang=args.source_lang,
         target_lang=args.target_lang,
         keep_tokenization_spaces=args.keep_tokenization_spaces,
         repetition_penalty=args.repetition_penalty,
         prompt=args.prompt,
+        trust_remote_code=args.trust_remote_code,
     )