Credits

by roberto-viviani - opened May 13

May 13

Hallo,

I am inquiring about crediting the use of this model in a manuscript. I understand that the method is described in arXiv:2004.09813. I am not clear if any further work went into the model that I have to acknowledge.

Thank you in advance.

tomaarsen

Owner May 13

•

edited May 13

Hello!

This model was trained with this script: https://github.com/UKPLab/sentence-transformers/blob/v3.0-pre-release/examples/training/multilingual/make_multilingual.py, but with max_sentences_per_language set to 5000. This almost corresponds to https://arxiv.org/pdf/2004.09813, but with a slight variation: rather than applying the MSE Loss between

the Teacher EN embeddings & the Student EN embeddings, and
the Teacher EN embeddings & the Student non-EN embeddings,

this model was only trained by applying MSE Loss between

the Teacher EN embeddings & the Student non-EN embeddings.

This is actually an oversight in the implementation, which I still have to fix.

So, in short:

This model uses a weaker (presumably) version of arXiv:2004.09813
This model is only trained on ~30k pairs. As a result, it's not going to be great. You can see this in the STS17 scores (they're about 0.45 to 0.55, while you can expect 0.70-0.85 with a lot more training). Further training would heavily increase the performance.
This model is not representative of the performance of arXiv:2004.09813

And if you're looking to cite this model (though I think you perhaps should not), then you can use:

arXiv:2004.09813 for the method
arXiv:1908.10084 for the implementation

Feel free to ask if you have some more questions.

Tom Aarsen

roberto-viviani

May 13

Thank you for your quick and extensive reply.

From your reply I understand that you yourself fitted the model. With your permission, I would mention your name as the author of the model card in the manuscript.

In my application, your model performs much like multilingual BERT from sentence-transformers/paraphrase-multilingual-mpnet-base-v2, but a bit better, so I am inclined to use it, also because there are few multilingual models. In my application the best models are, perhaps unsurprisingly, OpenAI large and Google's gecko.

However, my application is not really a useful benchmark. We have a dataset of 490 German-speaking individuals who responded to two rating scales (with 5 and 6 subscales). We look at whether the pattern of response correlations between subscales and item pairs is explained by the semantic similarity given by the cosine distance of the embeddings. User are not rating semantic similarity here, are just assessing if statements reflect their own experience of themselves. However, there are genuine response correlations that are not explained by semantic similarity, so there is no ground truth.

tomaarsen

Owner May 13

From your reply I understand that you yourself fitted the model. With your permission, I would mention your name as the author of the model card in the manuscript.

Indeed, I created the training script & trained the model myself. Feel free to name me as the author.

Very interesting to hear that this model performs well compared to paraphrase-multilingual-mpnet-base-v2, I would not have expected it. Perhaps one explanation is that the German portion of the training data for paraphrase-multilingual-mpnet-base-v2 is much smaller than for this model (i.e., 1/6th for this model).

That is quite an interesting experiment, thanks for sharing!

I don't mean to overstep, but in case you are looking for more (open) models that might work well, you might get good results from:

The first two are my go-to for multilingual embedding models (alongside paraphrase-multilingual-mpnet-base-v2), and the latter was specifically designed for German. I suspect that each will outperform both this model and paraphrase-multilingual-mpnet-base-v2.

Tom Aarsen

roberto-viviani

May 17

Thank you for the suggestions -- I now tested the embeddings of these models.

As I said, my application has no "ground truth", but I assess the model by the extent to which it predicts the correlations of responses of items in psychological scales. Your model clearly outperformed jina, was somewhat superior to bge-m3 (whose performance was nevertheless good, the same as paraphrase-multilingual-mpnet-base-v2), and achieved the same performance as multilingual-e5-large-instruct. It might be worth testing more formally.

Best,
Roberto Viviani

tomaarsen

Owner May 21

Thanks for giving those models a try. Best of luck regarding your upcoming manuscript!

Tom Aarsen

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment