[Request] Add support for CTranslate2 integration

#1
by solaoi - opened

I attempted to use CTranslate2 with a HuggingFace model, but encountered an issue.
Here's what I tried:

  1. Converted the model to CTranslate2 format:
ct2-transformers-converter --model Mitsua/elan-mt-bt-ja-en --output_dir elan-mt-bt-ja-en
  1. Installed the required packages:
pip install ctranslate2 transformers sacremoses
  1. Used the following code for inference:
import ctranslate2
import transformers

translator = ctranslate2.Translator("./elan-mt-bt-ja-en")

tokenizer = transformers.AutoTokenizer.from_pretrained("Mitsua/elan-mt-bt-ja-en")

source = tokenizer.convert_ids_to_tokens(tokenizer.encode("こんにけは、solaoiγ¨η”³γ—γΎγ™γ€‚γŠδΌšγ„γ§γγ¦γ€γ¨γ¦γ‚‚ε¬‰γ—γ„γ§γ™γ€‚"))
results = translator.translate_batch([source])
target = results[0].hypotheses[0]

print(tokenizer.decode(tokenizer.convert_tokens_to_ids(target)))

Unfortunately, this resulted in an invalid output:

,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

However, when using the skata/fugumt-ja-en model, the conversion and inference worked successfully:

  1. Converted the model to CTranslate2 format:
ct2-transformers-converter --model staka/fugumt-ja-en --output_dir fugumt-ja-en
  1. Used the following code for inference:
import ctranslate2
import transformers

translator = ctranslate2.Translator("./fugumt-ja-en")

tokenizer = transformers.AutoTokenizer.from_pretrained("skata/fugumt-ja-en")

source = tokenizer.convert_ids_to_tokens(tokenizer.encode("こんにけは、solaoiγ¨η”³γ—γΎγ™γ€‚γŠδΌšγ„γ§γγ¦γ€γ¨γ¦γ‚‚ε¬‰γ—γ„γ§γ™γ€‚"))
results = translator.translate_batch([source])
target = results[0].hypotheses[0]

print(tokenizer.decode(tokenizer.convert_tokens_to_ids(target)))

this resulted in an output:

Hello, my name is Solaoi. I'm very happy to see you.

Sign up or log in to comment