ctranslate2 BLEU comparison to Marianmt fine-tuned model

#1
by Adeptschneider - opened

Hello @gaudi I see you have added in your model card description that ctranslate2 models have an equally good BLEU score to their equivalent opus model loaded in PyTorch. In my case, my model's BLEU score reduces when I load it using Ctranslate2. I'd appreciate feedback on the same. The model achieves a BLEU score of 9.23. I'm building a machine translation task for Dyula to French. Dyula is a low-resource language. I see you're building quite many machine translation models with Ctranslate2. Are you looking to put these models into production? What are you cooking?

Owner

Hey @Adeptschneider . I hope all is well on your end!

BLEU scores do tend to degrade when the checkpoint is quantized. In the CTranslate2 conversion command, a float 16 quantization is being applied via the "--quantization float16" flag. That being said, the degradation should typically only be about 1.0 point or so (e.g. based on some past issues raised in CTranslate2's Github Repo). If you're seeing greater degradation; I may have to look at the config files generated by CTranslate and see if there is anything to be tweaked there. The original model checkpoint can also be recompiled with different flags to maintain precision and potentially a better score. The command I used to compile the original checkpoint is the in the README (it was a command I found from one of michaelfeil's repos). That may be a good starting point to recompile the original checkpoint.

The BLEU scores in the readme are the generic scores listed on CTranslate2's Github Repository. Unfortunately, they're not specific to the model. If Dyula is a low-resource language, the BLEU scores may be much lower than what is posted there to start with. Do you know what the original checkpoint's BLEU score is on the data you're benchmarking with?

We currently aren't using these models in production. However, we are experimenting with several of them for production use-cases. Our challenge is in scale (volume of translation requests). We're trying to identify what is the most optimal solution for machine translation that balances fast inferencing performance, while still maintaining quality in translation (to some degree). I had set-up a pipeline that automatically pulls down the Opus models, converts them to ctranslate2 models, and then pushes them back up as a new HF repo. I was originally pushing these as private repos, but I figured it may be helpful for others to leverage them as well; hence the volume of repos. :)

I hope this provides at least some help! When I get the chance, I can pull down this checkpoint as well and see what I can identify! Hopefully some others in the HF community can also provide some insight here as well!

Sign up or log in to comment