sergeyzh commited on
Commit
66b677b
1 Parent(s): 0959ed2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +129 -133
README.md CHANGED
@@ -1,133 +1,129 @@
1
- ---
2
- language:
3
- - ru
4
-
5
- pipeline_tag: sentence-similarity
6
-
7
- tags:
8
- - russian
9
- - pretraining
10
- - embeddings
11
- - feature-extraction
12
- - sentence-similarity
13
- - sentence-transformers
14
- - transformers
15
-
16
- license: mit
17
- base_model: cointegrated/LaBSE-en-ru
18
-
19
- ---
20
-
21
- ## Базовый Bert для Semantic text similarity (STS) на GPU
22
-
23
- Качественная модель BERT для расчетов эмбедингов предложений на русском языке. Модель основана на [cointegrated/LaBSE-en-ru](https://huggingface.co/cointegrated/LaBSE-en-ru) - имеет аналогичные размеры контекста (512), ембединга (768) и быстродействие. Является второй и лучшей по качеству моделью в серии BERT-STS.
24
-
25
- На STS и близких задачах (PI, NLI, SA, TI) для русского языка конкурирует по качеству с моделью [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) (но потребляет на 77% меньше памяти и быстрее на 80%).
26
-
27
- ## Выбор модели из серии BERT-STS (качество/скорость)
28
- | Рекомендуемая модель | CPU <br> (STS; snt/s) | GPU <br> (STS; snt/s) |
29
- |:---------------------------------|:---------:|:---------:|
30
- | Быстрая модель (скорость) | [rubert-tiny-sts](https://huggingface.co/sergeyzh/rubert-tiny-sts) <br> (0.797; 1190) | - |
31
- | Базовая модель (качество) | [rubert-mini-sts](https://huggingface.co/sergeyzh/rubert-mini-sts) <br> (0.815; 539) | **LaBSE-ru-sts <br> (0.845; 1894)** |
32
-
33
- ## Лучшая модель для использования в составе RAG LLMs при инференсе на GPU:
34
- - высокое качество при нечетких запросах (отличный метрики на задачах STS, PI, NLI);
35
- - низкое влияение эмоциональной окраски текста на ембединг (средние показатели на задачах SA, TI);
36
- - легкое расширение базы текстовых документов (скорость работы на GPU > 1k предложений в секунду);
37
- - ускорение алгоритмов knn при поиске соответствий (пониженная размерность эмбединга 768);
38
- - простота использования (совместимость с [SentenceTransformer](https://github.com/UKPLab/sentence-transformers)).
39
-
40
- ## Использование модели с библиотекой `transformers`:
41
-
42
- ```python
43
- # pip install transformers sentencepiece
44
- import torch
45
- from transformers import AutoTokenizer, AutoModel
46
- tokenizer = AutoTokenizer.from_pretrained("sergeyzh/LaBSE-ru-sts")
47
- model = AutoModel.from_pretrained("sergeyzh/LaBSE-ru-sts")
48
- # model.cuda() # uncomment it if you have a GPU
49
-
50
- def embed_bert_cls(text, model, tokenizer):
51
- t = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
52
- with torch.no_grad():
53
- model_output = model(**{k: v.to(model.device) for k, v in t.items()})
54
- embeddings = model_output.last_hidden_state[:, 0, :]
55
- embeddings = torch.nn.functional.normalize(embeddings)
56
- return embeddings[0].cpu().numpy()
57
-
58
- print(embed_bert_cls('привет мир', model, tokenizer).shape)
59
- # (768,)
60
- ```
61
-
62
- ## Использование с `sentence_transformers`:
63
- ```Python
64
- from sentence_transformers import SentenceTransformer, util
65
-
66
- model = SentenceTransformer('sergeyzh/LaBSE-ru-sts')
67
-
68
- sentences = ["привет мир", "hello world", "здравствуй вселенная"]
69
- embeddings = model.encode(sentences)
70
- print(util.dot_score(embeddings, embeddings))
71
- ```
72
-
73
- ## Метрики
74
- Оценки модели на бенчмарке [encodechka](https://github.com/avidale/encodechka):
75
-
76
- | Модель | STS | PI | NLI | SA | TI |
77
- |:---------------------------------|:---------:|:---------:|:---------:|:---------:|:---------:|
78
- | [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) | 0.862 | 0.727 | 0.473 | 0.810 | 0.979 |
79
- | **sergeyzh/LaBSE-ru-sts** | **0.845** | **0.737** | **0.481** | **0.805** | **0.957** |
80
- | [sergeyzh/rubert-mini-sts](https://huggingface.co/sergeyzh/rubert-mini-sts) | 0.815 | 0.723 | 0.477 | 0.791 | 0.949 |
81
- | [sergeyzh/rubert-tiny-sts](https://huggingface.co/sergeyzh/rubert-tiny-sts) | 0.797 | 0.702 | 0.453 | 0.778 | 0.946 |
82
- | [Tochka-AI/ruRoPEBert-e5-base-512](https://huggingface.co/Tochka-AI/ruRoPEBert-e5-base-512) | 0.793 | 0.704 | 0.457 | 0.803 | 0.970 |
83
- | [cointegrated/LaBSE-en-ru](https://huggingface.co/cointegrated/LaBSE-en-ru) | 0.794 | 0.659 | 0.431 | 0.761 | 0.946 |
84
- | [cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2) | 0.750 | 0.651 | 0.417 | 0.737 | 0.937 |
85
-
86
- **Задачи:**
87
-
88
- - Semantic text similarity (**STS**);
89
- - Paraphrase identification (**PI**);
90
- - Natural language inference (**NLI**);
91
- - Sentiment analysis (**SA**);
92
- - Toxicity identification (**TI**).
93
-
94
- ## Быстродействие и размеры
95
-
96
- На бенчмарке [encodechka](https://github.com/avidale/encodechka):
97
-
98
- | Модель | CPU | GPU | size | dim | n_ctx | n_vocab |
99
- |:---------------------------------|----------:|----------:|----------:|----------:|----------:|----------:|
100
- | [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) | 149.026 | 15.629 | 2136 | 1024 | 514 | 250002 |
101
- | **sergeyzh/LaBSE-ru-sts** |**42.835** | **8.561** | **490** | **768** | **512** | **55083** |
102
- | [sergeyzh/rubert-mini-sts](https://huggingface.co/sergeyzh/rubert-mini-sts) | 6.417 | 5.517 | 123 | 312 | 2048 | 83828 |
103
- | [sergeyzh/rubert-tiny-sts](https://huggingface.co/sergeyzh/rubert-tiny-sts) | 3.208 | 3.379 | 111 | 312 | 2048 | 83828 |
104
- | [Tochka-AI/ruRoPEBert-e5-base-512](https://huggingface.co/Tochka-AI/ruRoPEBert-e5-base-512) | 43.314 | 9.338 | 532 | 768 | 512 | 69382 |
105
- | [cointegrated/LaBSE-en-ru](https://huggingface.co/cointegrated/LaBSE-en-ru) | 42.867 | 8.549 | 490 | 768 | 512 | 55083 |
106
- | [cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2) | 3.212 | 3.384 | 111 | 312 | 2048 | 83828 |
107
-
108
-
109
-
110
- При использовании батчей с `sentence_transformers`:
111
-
112
- ```python
113
- from sentence_transformers import SentenceTransformer
114
-
115
- model_name = 'sergeyzh/LaBSE-ru-sts'
116
- model = SentenceTransformer(model_name, device='cpu')
117
- sentences = ["Тест быстродействия на CPU Ryzen 7 3800X: batch = 50"] * 50
118
- %timeit -n 5 -r 3 model.encode(sentences)
119
-
120
- # 882 ms ± 104 ms per loop (mean ± std. dev. of 3 runs, 5 loops each)
121
- # 50/0.882 = 57 snt/s
122
-
123
- model = SentenceTransformer(model_name, device='cuda')
124
- sentences = ["Тест быстродействия на GPU RTX 3060: batch = 1500"] * 1500
125
- %timeit -n 5 -r 3 model.encode(sentences)
126
-
127
- # 792 ms ± 29 ms per loop (mean ± std. dev. of 3 runs, 5 loops each)
128
- # 1500/0.792 = 1894 snt/s
129
- ```
130
-
131
- ## Связанные ресурсы
132
- Вопросы использования модели обсуждаются в [русскоязычном чате NLP](https://t.me/natural_language_processing).
133
-
 
1
+ ---
2
+ language:
3
+ - ru
4
+
5
+ pipeline_tag: sentence-similarity
6
+
7
+ tags:
8
+ - russian
9
+ - pretraining
10
+ - embeddings
11
+ - feature-extraction
12
+ - sentence-similarity
13
+ - sentence-transformers
14
+ - transformers
15
+
16
+ license: mit
17
+ base_model: cointegrated/LaBSE-en-ru
18
+
19
+ ---
20
+
21
+ ## Базовый Bert для Semantic text similarity (STS) на GPU
22
+
23
+ Качественная модель BERT для расчетов эмбеддингов предложений на русском языке. Модель основана на [cointegrated/LaBSE-en-ru](https://huggingface.co/cointegrated/LaBSE-en-ru) - имеет аналогичные размеры контекста (512), ембеддинга (768) и быстродействие.
24
+
25
+ ## Использование модели с библиотекой `transformers`:
26
+
27
+ ```python
28
+ # pip install transformers sentencepiece
29
+ import torch
30
+ from transformers import AutoTokenizer, AutoModel
31
+ tokenizer = AutoTokenizer.from_pretrained("sergeyzh/LaBSE-ru-sts")
32
+ model = AutoModel.from_pretrained("sergeyzh/LaBSE-ru-sts")
33
+ # model.cuda() # uncomment it if you have a GPU
34
+
35
+ def embed_bert_cls(text, model, tokenizer):
36
+ t = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
37
+ with torch.no_grad():
38
+ model_output = model(**{k: v.to(model.device) for k, v in t.items()})
39
+ embeddings = model_output.last_hidden_state[:, 0, :]
40
+ embeddings = torch.nn.functional.normalize(embeddings)
41
+ return embeddings[0].cpu().numpy()
42
+
43
+ print(embed_bert_cls('привет мир', model, tokenizer).shape)
44
+ # (768,)
45
+ ```
46
+
47
+ ## Использование с `sentence_transformers`:
48
+ ```Python
49
+ from sentence_transformers import SentenceTransformer, util
50
+
51
+ model = SentenceTransformer('sergeyzh/LaBSE-ru-sts')
52
+
53
+ sentences = ["привет мир", "hello world", "здравствуй вселенная"]
54
+ embeddings = model.encode(sentences)
55
+ print(util.dot_score(embeddings, embeddings))
56
+ ```
57
+
58
+ ## Метрики
59
+ Оценки модели на бенчмарке [encodechka](https://github.com/avidale/encodechka):
60
+
61
+ | Модель | STS | PI | NLI | SA | TI |
62
+ |:---------------------------------|:---------:|:---------:|:---------:|:---------:|:---------:|
63
+ | [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) | 0.862 | 0.727 | 0.473 | 0.810 | 0.979 |
64
+ | **sergeyzh/LaBSE-ru-sts** | **0.845** | **0.737** | **0.481** | **0.805** | **0.957** |
65
+ | [sergeyzh/rubert-mini-sts](https://huggingface.co/sergeyzh/rubert-mini-sts) | 0.815 | 0.723 | 0.477 | 0.791 | 0.949 |
66
+ | [sergeyzh/rubert-tiny-sts](https://huggingface.co/sergeyzh/rubert-tiny-sts) | 0.797 | 0.702 | 0.453 | 0.778 | 0.946 |
67
+ | [Tochka-AI/ruRoPEBert-e5-base-512](https://huggingface.co/Tochka-AI/ruRoPEBert-e5-base-512) | 0.793 | 0.704 | 0.457 | 0.803 | 0.970 |
68
+ | [cointegrated/LaBSE-en-ru](https://huggingface.co/cointegrated/LaBSE-en-ru) | 0.794 | 0.659 | 0.431 | 0.761 | 0.946 |
69
+ | [cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2) | 0.750 | 0.651 | 0.417 | 0.737 | 0.937 |
70
+
71
+ **Задачи:**
72
+
73
+ - Semantic text similarity (**STS**);
74
+ - Paraphrase identification (**PI**);
75
+ - Natural language inference (**NLI**);
76
+ - Sentiment analysis (**SA**);
77
+ - Toxicity identification (**TI**).
78
+
79
+ ## Быстродействие и размеры
80
+
81
+ Оценки модели на бенчмарке [encodechka](https://github.com/avidale/encodechka):
82
+
83
+ | Модель | CPU | GPU | size | dim | n_ctx | n_vocab |
84
+ |:---------------------------------|----------:|----------:|----------:|----------:|----------:|----------:|
85
+ | [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) | 149.026 | 15.629 | 2136 | 1024 | 514 | 250002 |
86
+ | **sergeyzh/LaBSE-ru-sts** |**42.835** | **8.561** | **490** | **768** | **512** | **55083** |
87
+ | [sergeyzh/rubert-mini-sts](https://huggingface.co/sergeyzh/rubert-mini-sts) | 6.417 | 5.517 | 123 | 312 | 2048 | 83828 |
88
+ | [sergeyzh/rubert-tiny-sts](https://huggingface.co/sergeyzh/rubert-tiny-sts) | 3.208 | 3.379 | 111 | 312 | 2048 | 83828 |
89
+ | [Tochka-AI/ruRoPEBert-e5-base-512](https://huggingface.co/Tochka-AI/ruRoPEBert-e5-base-512) | 43.314 | 9.338 | 532 | 768 | 512 | 69382 |
90
+ | [cointegrated/LaBSE-en-ru](https://huggingface.co/cointegrated/LaBSE-en-ru) | 42.867 | 8.549 | 490 | 768 | 512 | 55083 |
91
+ | [cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2) | 3.212 | 3.384 | 111 | 312 | 2048 | 83828 |
92
+
93
+
94
+ Оценки модели на бенчмарке [ruMTEB](https://habr.com/ru/companies/sberdevices/articles/831150/):
95
+
96
+ |Model Name | Metric | sbert_large_ mt_nlu_ru | sbert_large_ nlu_ru | LaBSE-ru-sts | [LaBSE-ru-turbo](https://huggingface.co/sergeyzh/LaBSE-ru-turbo) | multilingual-e5-small | multilingual-e5-base | multilingual-e5-large |
97
+ |:----------------------------------|:--------------------|-----------------------:|--------------------:|----------------:|------------------:|----------------------:|---------------------:|----------------------:|
98
+ |CEDRClassification | Accuracy | 0.368 | 0.358 | 0.418 | 0.451 | 0.401 | 0.423 | **0.448** |
99
+ |GeoreviewClassification | Accuracy | 0.397 | 0.400 | 0.406 | 0.438 | 0.447 | 0.461 | **0.497** |
100
+ |GeoreviewClusteringP2P | V-measure | 0.584 | 0.590 | 0.626 | **0.644** | 0.586 | 0.545 | 0.605 |
101
+ |HeadlineClassification | Accuracy | 0.772 | **0.793** | 0.633 | 0.688 | 0.732 | 0.757 | 0.758 |
102
+ |InappropriatenessClassification | Accuracy | **0.646** | 0.625 | 0.599 | 0.615 | 0.592 | 0.588 | 0.616 |
103
+ |KinopoiskClassification | Accuracy | 0.503 | 0.495 | 0.496 | 0.521 | 0.500 | 0.509 | **0.566** |
104
+ |RiaNewsRetrieval | NDCG@10 | 0.214 | 0.111 | 0.651 | 0.694 | 0.700 | 0.702 | **0.807** |
105
+ |RuBQReranking | MAP@10 | 0.561 | 0.468 | 0.688 | 0.687 | 0.715 | 0.720 | **0.756** |
106
+ |RuBQRetrieval | NDCG@10 | 0.298 | 0.124 | 0.622 | 0.657 | 0.685 | 0.696 | **0.741** |
107
+ |RuReviewsClassification | Accuracy | 0.589 | 0.583 | 0.599 | 0.632 | 0.612 | 0.630 | **0.653** |
108
+ |RuSTSBenchmarkSTS | Pearson correlation | 0.712 | 0.588 | 0.788 | 0.822 | 0.781 | 0.796 | **0.831** |
109
+ |RuSciBenchGRNTIClassification | Accuracy | 0.542 | 0.539 | 0.529 | 0.569 | 0.550 | 0.563 | **0.582** |
110
+ |RuSciBenchGRNTIClusteringP2P | V-measure | **0.522** | 0.504 | 0.486 | 0.517 | 0.511 | 0.516 | 0.520 |
111
+ |RuSciBenchOECDClassification | Accuracy | 0.438 | 0.430 | 0.406 | 0.440 | 0.427 | 0.423 | **0.445** |
112
+ |RuSciBenchOECDClusteringP2P | V-measure | **0.473** | 0.464 | 0.426 | 0.452 | 0.443 | 0.448 | 0.450 |
113
+ |SensitiveTopicsClassification | Accuracy | **0.285** | 0.280 | 0.262 | 0.272 | 0.228 | 0.234 | 0.257 |
114
+ |TERRaClassification | Average Precision | 0.520 | 0.502 | **0.587** | 0.585 | 0.551 | 0.550 | 0.584 |
115
+
116
+ |Model Name | Metric | sbert_large_ mt_nlu_ru | sbert_large_ nlu_ru | LaBSE-ru-sts | [LaBSE-ru-turbo](https://huggingface.co/sergeyzh/LaBSE-ru-turbo) | multilingual-e5-small | multilingual-e5-base | multilingual-e5-large |
117
+ |:----------------------------------|:--------------------|-----------------------:|--------------------:|----------------:|------------------:|----------------------:|----------------------:|---------------------:|
118
+ |Classification | Accuracy | 0.554 | 0.552 | 0.524 | 0.558 | 0.551 | 0.561 | **0.588** |
119
+ |Clustering | V-measure | 0.526 | 0.519 | 0.513 | **0.538** | 0.513 | 0.503 | 0.525 |
120
+ |MultiLabelClassification | Accuracy | 0.326 | 0.319 | 0.340 | **0.361** | 0.314 | 0.329 | 0.353 |
121
+ |PairClassification | Average Precision | 0.520 | 0.502 | 0.587 | **0.585** | 0.551 | 0.550 | 0.584 |
122
+ |Reranking | MAP@10 | 0.561 | 0.468 | 0.688 | 0.687 | 0.715 | 0.720 | **0.756** |
123
+ |Retrieval | NDCG@10 | 0.256 | 0.118 | 0.637 | 0.675 | 0.697 | 0.699 | **0.774** |
124
+ |STS | Pearson correlation | 0.712 | 0.588 | 0.788 | 0.822 | 0.781 | 0.796 | **0.831** |
125
+ |Average | Average | 0.494 | 0.438 | 0.582 | 0.604 | 0.588 | 0.594 | **0.630** |
126
+
127
+
128
+
129
+