sergeyzh
/

LaBSE-ru-sts

@@ -1,133 +1,129 @@
----
-language:
-- ru
-pipeline_tag: sentence-similarity
-tags:
-- russian
-- pretraining
-- embeddings
-- feature-extraction
-- sentence-similarity
-- sentence-transformers
-- transformers
-license: mit
-base_model: cointegrated/LaBSE-en-ru
----
-## Базовый Bert для Semantic text similarity (STS) на GPU
-Качественная модель BERT для расчетов эмбедингов предложений на русском языке. Модель основана на [cointegrated/LaBSE-en-ru](https://huggingface.co/cointegrated/LaBSE-en-ru) - имеет аналогичные размеры контекста (512), ембединга (768) и быстродействие. Является второй и лучшей по качеству моделью в серии BERT-STS.
-На STS и близких задачах (PI, NLI, SA, TI) для русского языка конкурирует по качеству с моделью [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) (но потребляет на 77% меньше памяти и быстрее на 80%).
-## Выбор модели из серии BERT-STS (качество/скорость)
-| Рекомендуемая модель                      | CPU  <br> (STS; snt/s) | GPU  <br> (STS; snt/s) |
-|:---------------------------------|:---------:|:---------:|
-| Быстрая модель (скорость) | [rubert-tiny-sts](https://huggingface.co/sergeyzh/rubert-tiny-sts) <br> (0.797; 1190) | - |
-| Базовая модель  (качество) | [rubert-mini-sts](https://huggingface.co/sergeyzh/rubert-mini-sts) <br> (0.815; 539) | **LaBSE-ru-sts <br> (0.845; 1894)** |
-## Лучшая модель для использования в составе RAG LLMs при инференсе на GPU:
-- высокое качество при нечетких запросах (отличный метрики на задачах STS, PI, NLI);
-- низкое влияение эмоциональной окраски текста на ембединг (средние показатели на задачах SA, TI);
-- легкое расширение базы текстовых документов (скорость работы на GPU > 1k предложений в секунду);
-- ускорение алгоритмов knn при поиске соответствий (пониженная размерность эмбединга 768);
-- простота использования (совместимость с [SentenceTransformer](https://github.com/UKPLab/sentence-transformers)).
-## Использование модели с библиотекой `transformers`:
-```python
-# pip install transformers sentencepiece
-import torch
-from transformers import AutoTokenizer, AutoModel
-tokenizer = AutoTokenizer.from_pretrained("sergeyzh/LaBSE-ru-sts")
-model = AutoModel.from_pretrained("sergeyzh/LaBSE-ru-sts")
-# model.cuda()  # uncomment it if you have a GPU
-def embed_bert_cls(text, model, tokenizer):
-    t = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
-    with torch.no_grad():
-        model_output = model(**{k: v.to(model.device) for k, v in t.items()})
-    embeddings = model_output.last_hidden_state[:, 0, :]
-    embeddings = torch.nn.functional.normalize(embeddings)
-    return embeddings[0].cpu().numpy()
-print(embed_bert_cls('привет мир', model, tokenizer).shape)
-# (768,)
-```
-## Использование с `sentence_transformers`:
-```Python
-from sentence_transformers import SentenceTransformer, util
-model = SentenceTransformer('sergeyzh/LaBSE-ru-sts')
-sentences = ["привет мир", "hello world", "здравствуй вселенная"]
-embeddings = model.encode(sentences)
-print(util.dot_score(embeddings, embeddings))
-```
-## Метрики
-Оценки модели на бенчмарке [encodechka](https://github.com/avidale/encodechka):
-| Модель                           | STS       | PI        | NLI       | SA        | TI        |
-|:---------------------------------|:---------:|:---------:|:---------:|:---------:|:---------:|
-| [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large)   |   0.862   |   0.727   |   0.473   |   0.810   |   0.979   |
-| **sergeyzh/LaBSE-ru-sts**     | **0.845** | **0.737** | **0.481** | **0.805** | **0.957** |
-| [sergeyzh/rubert-mini-sts](https://huggingface.co/sergeyzh/rubert-mini-sts)     |   0.815   |   0.723   |   0.477   |   0.791   |   0.949   |
-| [sergeyzh/rubert-tiny-sts](https://huggingface.co/sergeyzh/rubert-tiny-sts)     |   0.797   |   0.702   |   0.453   |   0.778   |   0.946   |
-| [Tochka-AI/ruRoPEBert-e5-base-512](https://huggingface.co/Tochka-AI/ruRoPEBert-e5-base-512) |   0.793   |   0.704   |   0.457   |   0.803   |   0.970   |
-| [cointegrated/LaBSE-en-ru](https://huggingface.co/cointegrated/LaBSE-en-ru)         |   0.794   |   0.659   |   0.431   |   0.761   |   0.946   |
-| [cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2)        |   0.750   |   0.651   |   0.417   |   0.737   |   0.937   |
-**Задачи:**
-- Semantic text similarity (**STS**);
-- Paraphrase identification (**PI**);
-- Natural language inference (**NLI**);
-- Sentiment analysis (**SA**);
-- Toxicity identification (**TI**).
-## Быстродействие и размеры
-На бенчмарке [encodechka](https://github.com/avidale/encodechka):
-| Модель                           | CPU       | GPU       | size      | dim       | n_ctx     | n_vocab   |
-|:---------------------------------|----------:|----------:|----------:|----------:|----------:|----------:|
-| [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large)   | 149.026   |  15.629   |   2136    |   1024    |    514    |  250002   |
-| **sergeyzh/LaBSE-ru-sts**      |**42.835** | **8.561** |  **490**  |  **768**  |  **512**  | **55083**  |
-| [sergeyzh/rubert-mini-sts](https://huggingface.co/sergeyzh/rubert-mini-sts)     |   6.417   |   5.517   |    123    |    312    |    2048   |   83828   |
-| [sergeyzh/rubert-tiny-sts](https://huggingface.co/sergeyzh/rubert-tiny-sts)     |   3.208   |   3.379   |    111    |    312    |    2048   |   83828   |
-| [Tochka-AI/ruRoPEBert-e5-base-512](https://huggingface.co/Tochka-AI/ruRoPEBert-e5-base-512) |  43.314   |   9.338   |    532    |    768    |    512    |   69382   |
-| [cointegrated/LaBSE-en-ru](https://huggingface.co/cointegrated/LaBSE-en-ru)         |  42.867   |   8.549   |    490    |    768    |    512    |   55083   |
-| [cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2)        |   3.212   |   3.384   |    111    |    312    |    2048   |   83828   |
-При использовании батчей с `sentence_transformers`:
-```python
-from sentence_transformers import SentenceTransformer
-model_name = 'sergeyzh/LaBSE-ru-sts'
-model = SentenceTransformer(model_name, device='cpu')
-sentences = ["Тест быстродействия на CPU Ryzen 7 3800X: batch = 50"] * 50
-%timeit -n 5 -r 3 model.encode(sentences)
-# 882 ms ± 104 ms per loop (mean ± std. dev. of 3 runs, 5 loops each)
-# 50/0.882 = 57 snt/s
-model = SentenceTransformer(model_name, device='cuda')
-sentences = ["Тест быстродействия на GPU RTX 3060: batch = 1500"] * 1500
-%timeit -n 5 -r 3 model.encode(sentences)
-# 792 ms ± 29 ms per loop (mean ± std. dev. of 3 runs, 5 loops each)
-# 1500/0.792 = 1894 snt/s
-```
-## Связанные ресурсы
-Вопросы использования модели обсуждаются в [русскоязычном чате NLP](https://t.me/natural_language_processing).

+---
+language:
+- ru
+pipeline_tag: sentence-similarity
+tags:
+- russian
+- pretraining
+- embeddings
+- feature-extraction
+- sentence-similarity
+- sentence-transformers
+- transformers
+license: mit
+base_model: cointegrated/LaBSE-en-ru
+---
+## Базовый Bert для Semantic text similarity (STS) на GPU
+Качественная модель BERT для расчетов эмбеддингов предложений на русском языке. Модель основана на [cointegrated/LaBSE-en-ru](https://huggingface.co/cointegrated/LaBSE-en-ru) - имеет аналогичные размеры контекста (512), ембеддинга (768) и быстродействие.
+## Использование модели с библиотекой `transformers`:
+```python
+# pip install transformers sentencepiece
+import torch
+from transformers import AutoTokenizer, AutoModel
+tokenizer = AutoTokenizer.from_pretrained("sergeyzh/LaBSE-ru-sts")
+model = AutoModel.from_pretrained("sergeyzh/LaBSE-ru-sts")
+# model.cuda()  # uncomment it if you have a GPU
+def embed_bert_cls(text, model, tokenizer):
+    t = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
+    with torch.no_grad():
+        model_output = model(**{k: v.to(model.device) for k, v in t.items()})
+    embeddings = model_output.last_hidden_state[:, 0, :]
+    embeddings = torch.nn.functional.normalize(embeddings)
+    return embeddings[0].cpu().numpy()
+print(embed_bert_cls('привет мир', model, tokenizer).shape)
+# (768,)
+```
+## Использование с `sentence_transformers`:
+```Python
+from sentence_transformers import SentenceTransformer, util
+model = SentenceTransformer('sergeyzh/LaBSE-ru-sts')
+sentences = ["привет мир", "hello world", "здравствуй вселенная"]
+embeddings = model.encode(sentences)
+print(util.dot_score(embeddings, embeddings))
+```
+## Метрики
+Оценки модели на бенчмарке [encodechka](https://github.com/avidale/encodechka):
+| Модель                           | STS       | PI        | NLI       | SA        | TI        |
+|:---------------------------------|:---------:|:---------:|:---------:|:---------:|:---------:|
+| [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large)   |   0.862   |   0.727   |   0.473   |   0.810   |   0.979   |
+| **sergeyzh/LaBSE-ru-sts**     | **0.845** | **0.737** | **0.481** | **0.805** | **0.957** |
+| [sergeyzh/rubert-mini-sts](https://huggingface.co/sergeyzh/rubert-mini-sts)     |   0.815   |   0.723   |   0.477   |   0.791   |   0.949   |
+| [sergeyzh/rubert-tiny-sts](https://huggingface.co/sergeyzh/rubert-tiny-sts)     |   0.797   |   0.702   |   0.453   |   0.778   |   0.946   |
+| [Tochka-AI/ruRoPEBert-e5-base-512](https://huggingface.co/Tochka-AI/ruRoPEBert-e5-base-512) |   0.793   |   0.704   |   0.457   |   0.803   |   0.970   |
+| [cointegrated/LaBSE-en-ru](https://huggingface.co/cointegrated/LaBSE-en-ru)         |   0.794   |   0.659   |   0.431   |   0.761   |   0.946   |
+| [cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2)        |   0.750   |   0.651   |   0.417   |   0.737   |   0.937   |
+**Задачи:**
+- Semantic text similarity (**STS**);
+- Paraphrase identification (**PI**);
+- Natural language inference (**NLI**);
+- Sentiment analysis (**SA**);
+- Toxicity identification (**TI**).
+## Быстродействие и размеры
+Оценки модели на бенчмарке [encodechka](https://github.com/avidale/encodechka):
+| Модель                           | CPU       | GPU       | size      | dim       | n_ctx     | n_vocab   |
+|:---------------------------------|----------:|----------:|----------:|----------:|----------:|----------:|
+| [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large)   | 149.026   |  15.629   |   2136    |   1024    |    514    |  250002   |
+| **sergeyzh/LaBSE-ru-sts**      |**42.835** | **8.561** |  **490**  |  **768**  |  **512**  | **55083**  |
+| [sergeyzh/rubert-mini-sts](https://huggingface.co/sergeyzh/rubert-mini-sts)     |   6.417   |   5.517   |    123    |    312    |    2048   |   83828   |
+| [sergeyzh/rubert-tiny-sts](https://huggingface.co/sergeyzh/rubert-tiny-sts)     |   3.208   |   3.379   |    111    |    312    |    2048   |   83828   |
+| [Tochka-AI/ruRoPEBert-e5-base-512](https://huggingface.co/Tochka-AI/ruRoPEBert-e5-base-512) |  43.314   |   9.338   |    532    |    768    |    512    |   69382   |
+| [cointegrated/LaBSE-en-ru](https://huggingface.co/cointegrated/LaBSE-en-ru)         |  42.867   |   8.549   |    490    |    768    |    512    |   55083   |
+| [cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2)        |   3.212   |   3.384   |    111    |    312    |    2048   |   83828   |
+Оценки модели на бенчмарке [ruMTEB](https://habr.com/ru/companies/sberdevices/articles/831150/):
+|Model Name                         | Metric              | sbert_large_ mt_nlu_ru | sbert_large_ nlu_ru | LaBSE-ru-sts    | [LaBSE-ru-turbo](https://huggingface.co/sergeyzh/LaBSE-ru-turbo)    | multilingual-e5-small | multilingual-e5-base | multilingual-e5-large |
+|:----------------------------------|:--------------------|-----------------------:|--------------------:|----------------:|------------------:|----------------------:|---------------------:|----------------------:|
+|CEDRClassification                 | Accuracy            |         0.368          |         0.358       |      0.418      |        0.451      |        0.401          |        0.423         |       **0.448**       |
+|GeoreviewClassification            | Accuracy            |         0.397          |         0.400       |      0.406      |        0.438      |        0.447          |        0.461         |       **0.497**       |
+|GeoreviewClusteringP2P             | V-measure           |         0.584          |         0.590       |      0.626      |      **0.644**    |        0.586          |        0.545         |         0.605         |
+|HeadlineClassification             | Accuracy            |         0.772          |       **0.793**     |      0.633      |        0.688      |        0.732          |        0.757         |         0.758         |
+|InappropriatenessClassification    | Accuracy            |       **0.646**        |         0.625       |      0.599      |        0.615      |        0.592          |        0.588         |         0.616         |
+|KinopoiskClassification            | Accuracy            |         0.503          |         0.495       |      0.496      |        0.521      |        0.500          |        0.509         |       **0.566**       |
+|RiaNewsRetrieval                   | NDCG@10             |         0.214          |         0.111       |      0.651      |        0.694      |        0.700          |        0.702         |       **0.807**       |
+|RuBQReranking                      | MAP@10              |         0.561          |         0.468       |      0.688      |        0.687      |        0.715          |        0.720         |       **0.756**       |
+|RuBQRetrieval                      | NDCG@10             |         0.298          |         0.124       |      0.622      |        0.657      |        0.685          |        0.696         |       **0.741**       |
+|RuReviewsClassification            | Accuracy            |         0.589          |         0.583       |      0.599      |        0.632      |        0.612          |        0.630         |       **0.653**       |
+|RuSTSBenchmarkSTS                  | Pearson correlation |         0.712          |         0.588       |      0.788      |        0.822      |        0.781          |        0.796         |       **0.831**       |
+|RuSciBenchGRNTIClassification      | Accuracy            |         0.542          |         0.539       |      0.529      |        0.569      |        0.550          |        0.563         |       **0.582**       |
+|RuSciBenchGRNTIClusteringP2P       | V-measure           |       **0.522**        |         0.504       |      0.486      |        0.517      |        0.511          |        0.516         |         0.520         |
+|RuSciBenchOECDClassification       | Accuracy            |         0.438          |         0.430       |      0.406      |        0.440      |        0.427          |        0.423         |       **0.445**       |
+|RuSciBenchOECDClusteringP2P        | V-measure           |       **0.473**        |         0.464       |      0.426      |        0.452      |        0.443          |        0.448         |         0.450         |
+|SensitiveTopicsClassification      | Accuracy            |       **0.285**        |         0.280       |      0.262      |        0.272      |        0.228          |        0.234         |         0.257         |
+|TERRaClassification                | Average Precision   |         0.520          |         0.502       |    **0.587**    |        0.585      |        0.551          |        0.550         |         0.584         |
+|Model Name                         | Metric              | sbert_large_ mt_nlu_ru | sbert_large_ nlu_ru | LaBSE-ru-sts    | [LaBSE-ru-turbo](https://huggingface.co/sergeyzh/LaBSE-ru-turbo)    | multilingual-e5-small | multilingual-e5-base | multilingual-e5-large |
+|:----------------------------------|:--------------------|-----------------------:|--------------------:|----------------:|------------------:|----------------------:|----------------------:|---------------------:|
+|Classification                     | Accuracy            |         0.554          |        0.552        |      0.524      |        0.558      |        0.551          |        0.561          |      **0.588**       |
+|Clustering                         | V-measure           |         0.526          |        0.519        |      0.513      |      **0.538**    |        0.513          |        0.503          |        0.525         |
+|MultiLabelClassification           | Accuracy            |         0.326          |        0.319        |      0.340      |      **0.361**    |        0.314          |        0.329          |        0.353         |
+|PairClassification                 | Average Precision   |         0.520          |        0.502        |      0.587      |      **0.585**    |        0.551          |        0.550          |        0.584         |
+|Reranking                          | MAP@10              |         0.561          |        0.468        |      0.688      |        0.687      |        0.715          |        0.720          |      **0.756**       |
+|Retrieval                          | NDCG@10             |         0.256          |        0.118        |      0.637      |        0.675      |        0.697          |        0.699          |      **0.774**       |
+|STS                                | Pearson correlation |         0.712          |        0.588        |      0.788      |        0.822      |        0.781          |        0.796          |      **0.831**       |
+|Average                            | Average             |         0.494          |        0.438        |      0.582      |        0.604      |        0.588          |        0.594          |      **0.630**       |