Base model

by tomaarsen HF staff - opened Jun 16, 2024

Jun 16, 2024

Hello!

Great job here! I wanted to let you know that I think you'll get even stronger results if you take a base model that has a better understanding of Arabic (and has a tokenizer that works better with it!). Some examples:

The former uses an Arabic-specific tokenizer (see the vocabulary here), and the latter uses a tokenizer from xlm-roberta which is suited for many languages including Arabic. The current base model, https://huggingface.co/tomaarsen/mpnet-base-all-nli-triplet, is based on the primarily English tokenizer from mpnet-base. You can look at its vocabulary here. As you can see, it has some arabic tokens, so models using this tokenizer can learn to understand Arabic, but they might struggle more than with e.g. the AraBERT tokenizer.

Tom Aarsen

Omartificial-Intelligence-Space

Owner Jun 16, 2024

•

edited Jun 16, 2024

Hello Tom,

Thank you for your message.

Yes, I have tried some base models specifically designed to understand Arabic and achieved impressive results, with similrity ranging between 85-87%. However, I'm currently exploring whether models trained on English data can also learn and perform well in a new language. So far, the results have been promising as you have seen. Essentially, I am conducting an investigation into the capabilities of these models.

Again, thanks for your recommendation, definitely I will go and try it

Omartificial-Intelligence-Space

Owner Jun 17, 2024

•

edited Jun 17, 2024

Hello Tom,

Regarding our previous discussion, As I am reviewing all the models I have utilized. I discovered some interesting insights, such as certain multilingual models achieving similar results to the English models with only a few Arabic tokens. Additionally, base models that are proficient in Arabic performed well during fine-tuning. I am keen to hear your thoughts on this matter.

collection : https://huggingface.co/collections/Omartificial-Intelligence-Space/arabic-matryoshka-embedding-models-666f764d3b570f44d7f77d4e

tomaarsen

Jun 17, 2024

Very interesting results! Certainly not what I had expected.
The good performance of https://huggingface.co/Omartificial-Intelligence-Space/Arabic-all-nli-triplet-Matryoshka is reasonable, though, because the base model of https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2 was trained specifically for paraphrases, which is quite similar to the STS task that you're evaluating with here, and it was trained multilingually. LaBSE is a similar case, I believe. I'm mostly surprised at the relatively poor performance of the Arabic-only models like Arabert.

Something to consider is that my https://huggingface.co/tomaarsen/mpnet-base-all-nli-triplet and https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2 were both already trained with the English AllNLI, and I think your Arabic NLI is just a translation right? (I'm just guessing as it has roughly the same number of training samples)
That might correlate with the improved performance, although I'm not sure how much that says about whether the model will generalize well to other Arabic texts.

Tom Aarsen

Omartificial-Intelligence-Space

Owner Jun 17, 2024

Thanks for your feedback

Yes, the data was translated using NMT and it has very similar training size to the original ones. Also, soon I will share the results of the test with the base Vs. the fine-tuned ones and they are nice for almost all models.

Omartificial-Intelligence-Space changed discussion status to closed Jul 4, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment