Base model

#1
by tomaarsen HF staff - opened

Hello!

Great job here! I wanted to let you know that I think you'll get even stronger results if you take a base model that has a better understanding of Arabic (and has a tokenizer that works better with it!). Some examples:

The former uses an Arabic-specific tokenizer (see the vocabulary here), and the latter uses a tokenizer from xlm-roberta which is suited for many languages including Arabic. The current base model, https://huggingface.co/tomaarsen/mpnet-base-all-nli-triplet, is based on the primarily English tokenizer from mpnet-base. You can look at its vocabulary here. As you can see, it has some arabic tokens, so models using this tokenizer can learn to understand Arabic, but they might struggle more than with e.g. the AraBERT tokenizer.

  • Tom Aarsen

Hello Tom,

Thank you for your message.

Yes, I have tried some base models specifically designed to understand Arabic and achieved impressive results, with similrity ranging between 85-87%. However, I'm currently exploring whether models trained on English data can also learn and perform well in a new language. So far, the results have been promising as you have seen. Essentially, I am conducting an investigation into the capabilities of these models.

Again, thanks for your recommendation, definitely I will go and try it

Hello Tom,

Regarding our previous discussion, As I am reviewing all the models I have utilized. I discovered some interesting insights, such as certain multilingual models achieving similar results to the English models with only a few Arabic tokens. Additionally, base models that are proficient in Arabic performed well during fine-tuning. I am keen to hear your thoughts on this matter.

collection : https://huggingface.co/collections/Omartificial-Intelligence-Space/arabic-matryoshka-embedding-models-666f764d3b570f44d7f77d4e

Very interesting results! Certainly not what I had expected.
The good performance of https://huggingface.co/Omartificial-Intelligence-Space/Arabic-all-nli-triplet-Matryoshka is reasonable, though, because the base model of https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2 was trained specifically for paraphrases, which is quite similar to the STS task that you're evaluating with here, and it was trained multilingually. LaBSE is a similar case, I believe. I'm mostly surprised at the relatively poor performance of the Arabic-only models like Arabert.

Something to consider is that my https://huggingface.co/tomaarsen/mpnet-base-all-nli-triplet and https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2 were both already trained with the English AllNLI, and I think your Arabic NLI is just a translation right? (I'm just guessing as it has roughly the same number of training samples)
That might correlate with the improved performance, although I'm not sure how much that says about whether the model will generalize well to other Arabic texts.

  • Tom Aarsen

Thanks for your feedback

Yes, the data was translated using NMT and it has very similar training size to the original ones. Also, soon I will share the results of the test with the base Vs. the fine-tuned ones and they are nice for almost all models.

Sign up or log in to comment