sentence-transformers/all-MiniLM-L6-v2 · How is it able to give similarity scores for Russian?

Aug 23, 2023

I tried some Russian words and got similarity scores. Given that this model was train on English datasets, how is it able to give similarity scores for Russian? Not only the words are not in English, the letters are not the letters of the English alphabet...

amkazan

May 22, 2024

so did you understand how that works? i wonder how that happens too

tomaarsen

Sentence Transformers org May 22, 2024

Hello!

This is because this model relies on the BERT tokenizer, which has tokens for non-English characters too. For example:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
tokenized = model.tokenizer("Привет, как поживаешь?")
# {'input_ids': [101, 1194, 16856, 10325, 25529, 15290, 22919, 1010, 1189, 10260, 23925, 1194, 14150, 29743, 10325, 25529, 10260, 15290, 29753, 23742, 1029, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
# So, as you can see, it has tokens for the Russian text

model.tokenizer.decode(tokenized["input_ids"])
# '[CLS] привет, как поживаешь? [SEP]'
# We can undo this again, and you can see the original text again

So, the model can keep Russian words/tokens apart, but it was never taught that "Привет, как поживаешь?" is similar to "Привет, как ты?", and dissimilar to "Сегодня утром он поехал на работу".

from sentence_transformers.util import cos_sim

embeddings = model.encode(["Привет, как поживаешь?", "Привет, как ты?", "Сегодня утром он поехал на работу"])
print(cos_sim(embeddings[0], embeddings[1])) # should be high
# tensor([[0.7669]])
print(cos_sim(embeddings[0], embeddings[2])) # should be low
# tensor([[0.5084]])

Interestingly, NLP models trained on one language tend to pick up some mild details from other languages that it has seen a little bit of during (pre-)training, so perhaps that's why the similar sentences are indeed more similar according to the model. It could also be arbitrary.

Either way, you should use a model that was trained for Russian, e.g. https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

model = SentenceTransformer("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")
embeddings = model.encode(["Привет, как поживаешь?", "Привет, как ты?", "Сегодня утром он поехал на работу"])
print(cos_sim(embeddings[0], embeddings[1]))
# tensor([[0.9786]])
print(cos_sim(embeddings[0], embeddings[2]))
# tensor([[0.2380]])

Much better!

Tom Aarsen

tomaarsen changed discussion status to closed May 22, 2024