[Request] Add support for CTranslate2 integration
#1
by
solaoi
- opened
I attempted to use CTranslate2 with a HuggingFace model, but encountered an issue.
Here's what I tried:
- Converted the model to CTranslate2 format:
ct2-transformers-converter --model Mitsua/elan-mt-bt-ja-en --output_dir elan-mt-bt-ja-en
- Installed the required packages:
pip install ctranslate2 transformers sacremoses
- Used the following code for inference:
import ctranslate2
import transformers
translator = ctranslate2.Translator("./elan-mt-bt-ja-en")
tokenizer = transformers.AutoTokenizer.from_pretrained("Mitsua/elan-mt-bt-ja-en")
source = tokenizer.convert_ids_to_tokens(tokenizer.encode("γγγ«γ‘γ―γsolaoiγ¨η³γγΎγγγδΌγγ§γγ¦γγ¨γ¦γε¬γγγ§γγ"))
results = translator.translate_batch([source])
target = results[0].hypotheses[0]
print(tokenizer.decode(tokenizer.convert_tokens_to_ids(target)))
Unfortunately, this resulted in an invalid output:
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
However, when using the skata/fugumt-ja-en model, the conversion and inference worked successfully:
- Converted the model to CTranslate2 format:
ct2-transformers-converter --model staka/fugumt-ja-en --output_dir fugumt-ja-en
- Used the following code for inference:
import ctranslate2
import transformers
translator = ctranslate2.Translator("./fugumt-ja-en")
tokenizer = transformers.AutoTokenizer.from_pretrained("skata/fugumt-ja-en")
source = tokenizer.convert_ids_to_tokens(tokenizer.encode("γγγ«γ‘γ―γsolaoiγ¨η³γγΎγγγδΌγγ§γγ¦γγ¨γ¦γε¬γγγ§γγ"))
results = translator.translate_batch([source])
target = results[0].hypotheses[0]
print(tokenizer.decode(tokenizer.convert_tokens_to_ids(target)))
this resulted in an output:
Hello, my name is Solaoi. I'm very happy to see you.