Sentence Similarity
sentence-transformers
PyTorch
TensorFlow
Rust
ONNX
Safetensors
Transformers
English
bert
feature-extraction
Inference Endpoints
text-embeddings-inference
5 papers

How is it able to give similarity scores for Russian?

#28
by drmeir - opened

I tried some Russian words and got similarity scores. Given that this model was train on English datasets, how is it able to give similarity scores for Russian? Not only the words are not in English, the letters are not the letters of the English alphabet...

so did you understand how that works? i wonder how that happens too

Sentence Transformers org

Hello!

This is because this model relies on the BERT tokenizer, which has tokens for non-English characters too. For example:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
tokenized = model.tokenizer("Привет, как поживаешь?")
# {'input_ids': [101, 1194, 16856, 10325, 25529, 15290, 22919, 1010, 1189, 10260, 23925, 1194, 14150, 29743, 10325, 25529, 10260, 15290, 29753, 23742, 1029, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
# So, as you can see, it has tokens for the Russian text

model.tokenizer.decode(tokenized["input_ids"])
# '[CLS] привет, как поживаешь? [SEP]'
# We can undo this again, and you can see the original text again

So, the model can keep Russian words/tokens apart, but it was never taught that "Привет, как поживаешь?" is similar to "Привет, как ты?", and dissimilar to "Сегодня утром он поехал на работу".

from sentence_transformers.util import cos_sim

embeddings = model.encode(["Привет, как поживаешь?", "Привет, как ты?", "Сегодня утром он поехал на работу"])
print(cos_sim(embeddings[0], embeddings[1])) # should be high
# tensor([[0.7669]])
print(cos_sim(embeddings[0], embeddings[2])) # should be low
# tensor([[0.5084]])

Interestingly, NLP models trained on one language tend to pick up some mild details from other languages that it has seen a little bit of during (pre-)training, so perhaps that's why the similar sentences are indeed more similar according to the model. It could also be arbitrary.

Either way, you should use a model that was trained for Russian, e.g. https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

model = SentenceTransformer("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")
embeddings = model.encode(["Привет, как поживаешь?", "Привет, как ты?", "Сегодня утром он поехал на работу"])
print(cos_sim(embeddings[0], embeddings[1]))
# tensor([[0.9786]])
print(cos_sim(embeddings[0], embeddings[2]))
# tensor([[0.2380]])

Much better!

  • Tom Aarsen
tomaarsen changed discussion status to closed

Sign up or log in to comment