How is it able to give similarity scores for Russian?
I tried some Russian words and got similarity scores. Given that this model was train on English datasets, how is it able to give similarity scores for Russian? Not only the words are not in English, the letters are not the letters of the English alphabet...
so did you understand how that works? i wonder how that happens too
Hello!
This is because this model relies on the BERT tokenizer, which has tokens for non-English characters too. For example:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
tokenized = model.tokenizer("Привет, как поживаешь?")
# {'input_ids': [101, 1194, 16856, 10325, 25529, 15290, 22919, 1010, 1189, 10260, 23925, 1194, 14150, 29743, 10325, 25529, 10260, 15290, 29753, 23742, 1029, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
# So, as you can see, it has tokens for the Russian text
model.tokenizer.decode(tokenized["input_ids"])
# '[CLS] привет, как поживаешь? [SEP]'
# We can undo this again, and you can see the original text again
So, the model can keep Russian words/tokens apart, but it was never taught that "Привет, как поживаешь?" is similar to "Привет, как ты?", and dissimilar to "Сегодня утром он поехал на работу".
from sentence_transformers.util import cos_sim
embeddings = model.encode(["Привет, как поживаешь?", "Привет, как ты?", "Сегодня утром он поехал на работу"])
print(cos_sim(embeddings[0], embeddings[1])) # should be high
# tensor([[0.7669]])
print(cos_sim(embeddings[0], embeddings[2])) # should be low
# tensor([[0.5084]])
Interestingly, NLP models trained on one language tend to pick up some mild details from other languages that it has seen a little bit of during (pre-)training, so perhaps that's why the similar sentences are indeed more similar according to the model. It could also be arbitrary.
Either way, you should use a model that was trained for Russian, e.g. https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
model = SentenceTransformer("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")
embeddings = model.encode(["Привет, как поживаешь?", "Привет, как ты?", "Сегодня утром он поехал на работу"])
print(cos_sim(embeddings[0], embeddings[1]))
# tensor([[0.9786]])
print(cos_sim(embeddings[0], embeddings[2]))
# tensor([[0.2380]])
Much better!
- Tom Aarsen