Update README.md
Browse files
README.md
CHANGED
@@ -1,133 +1,129 @@
|
|
1 |
-
---
|
2 |
-
language:
|
3 |
-
- ru
|
4 |
-
|
5 |
-
pipeline_tag: sentence-similarity
|
6 |
-
|
7 |
-
tags:
|
8 |
-
- russian
|
9 |
-
- pretraining
|
10 |
-
- embeddings
|
11 |
-
- feature-extraction
|
12 |
-
- sentence-similarity
|
13 |
-
- sentence-transformers
|
14 |
-
- transformers
|
15 |
-
|
16 |
-
license: mit
|
17 |
-
base_model: cointegrated/LaBSE-en-ru
|
18 |
-
|
19 |
-
---
|
20 |
-
|
21 |
-
## Базовый Bert для Semantic text similarity (STS) на GPU
|
22 |
-
|
23 |
-
Качественная модель BERT для расчетов
|
24 |
-
|
25 |
-
|
26 |
-
|
27 |
-
|
28 |
-
|
29 |
-
|
30 |
-
|
31 |
-
|
32 |
-
|
33 |
-
|
34 |
-
|
35 |
-
|
36 |
-
|
37 |
-
|
38 |
-
|
39 |
-
|
40 |
-
|
41 |
-
|
42 |
-
|
43 |
-
|
44 |
-
|
45 |
-
|
46 |
-
|
47 |
-
|
48 |
-
|
49 |
-
|
50 |
-
|
51 |
-
|
52 |
-
|
53 |
-
|
54 |
-
|
55 |
-
|
56 |
-
|
57 |
-
|
58 |
-
|
59 |
-
|
60 |
-
|
61 |
-
|
62 |
-
|
63 |
-
|
64 |
-
|
65 |
-
|
66 |
-
|
67 |
-
|
68 |
-
|
69 |
-
|
70 |
-
|
71 |
-
|
72 |
-
|
73 |
-
|
74 |
-
|
75 |
-
|
76 |
-
|
77 |
-
|
78 |
-
|
79 |
-
|
80 |
-
|
81 |
-
|
82 |
-
|
83 |
-
|
|
84 |
-
|
85 |
-
|
86 |
-
|
87 |
-
|
88 |
-
-
|
89 |
-
-
|
90 |
-
-
|
91 |
-
-
|
92 |
-
|
93 |
-
|
94 |
-
|
95 |
-
|
96 |
-
|
97 |
-
|
98 |
-
|
|
99 |
-
|
100 |
-
|
|
101 |
-
| **
|
102 |
-
|
|
103 |
-
|
|
104 |
-
|
|
105 |
-
|
|
106 |
-
|
|
107 |
-
|
108 |
-
|
109 |
-
|
110 |
-
|
111 |
-
|
112 |
-
|
113 |
-
|
114 |
-
|
115 |
-
|
116 |
-
|
117 |
-
|
118 |
-
|
119 |
-
|
120 |
-
|
121 |
-
|
122 |
-
|
123 |
-
|
124 |
-
|
125 |
-
|
126 |
-
|
127 |
-
|
128 |
-
|
129 |
-
|
130 |
-
|
131 |
-
## Связанные ресурсы
|
132 |
-
Вопросы использования модели обсуждаются в [русскоязычном чате NLP](https://t.me/natural_language_processing).
|
133 |
-
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- ru
|
4 |
+
|
5 |
+
pipeline_tag: sentence-similarity
|
6 |
+
|
7 |
+
tags:
|
8 |
+
- russian
|
9 |
+
- pretraining
|
10 |
+
- embeddings
|
11 |
+
- feature-extraction
|
12 |
+
- sentence-similarity
|
13 |
+
- sentence-transformers
|
14 |
+
- transformers
|
15 |
+
|
16 |
+
license: mit
|
17 |
+
base_model: cointegrated/LaBSE-en-ru
|
18 |
+
|
19 |
+
---
|
20 |
+
|
21 |
+
## Базовый Bert для Semantic text similarity (STS) на GPU
|
22 |
+
|
23 |
+
Качественная модель BERT для расчетов эмбеддингов предложений на русском языке. Модель основана на [cointegrated/LaBSE-en-ru](https://huggingface.co/cointegrated/LaBSE-en-ru) - имеет аналогичные размеры контекста (512), ембеддинга (768) и быстродействие.
|
24 |
+
|
25 |
+
## Использование модели с библиотекой `transformers`:
|
26 |
+
|
27 |
+
```python
|
28 |
+
# pip install transformers sentencepiece
|
29 |
+
import torch
|
30 |
+
from transformers import AutoTokenizer, AutoModel
|
31 |
+
tokenizer = AutoTokenizer.from_pretrained("sergeyzh/LaBSE-ru-sts")
|
32 |
+
model = AutoModel.from_pretrained("sergeyzh/LaBSE-ru-sts")
|
33 |
+
# model.cuda() # uncomment it if you have a GPU
|
34 |
+
|
35 |
+
def embed_bert_cls(text, model, tokenizer):
|
36 |
+
t = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
|
37 |
+
with torch.no_grad():
|
38 |
+
model_output = model(**{k: v.to(model.device) for k, v in t.items()})
|
39 |
+
embeddings = model_output.last_hidden_state[:, 0, :]
|
40 |
+
embeddings = torch.nn.functional.normalize(embeddings)
|
41 |
+
return embeddings[0].cpu().numpy()
|
42 |
+
|
43 |
+
print(embed_bert_cls('привет мир', model, tokenizer).shape)
|
44 |
+
# (768,)
|
45 |
+
```
|
46 |
+
|
47 |
+
## Использование с `sentence_transformers`:
|
48 |
+
```Python
|
49 |
+
from sentence_transformers import SentenceTransformer, util
|
50 |
+
|
51 |
+
model = SentenceTransformer('sergeyzh/LaBSE-ru-sts')
|
52 |
+
|
53 |
+
sentences = ["привет мир", "hello world", "здравствуй вселенная"]
|
54 |
+
embeddings = model.encode(sentences)
|
55 |
+
print(util.dot_score(embeddings, embeddings))
|
56 |
+
```
|
57 |
+
|
58 |
+
## Метрики
|
59 |
+
Оценки модели на бенчмарке [encodechka](https://github.com/avidale/encodechka):
|
60 |
+
|
61 |
+
| Модель | STS | PI | NLI | SA | TI |
|
62 |
+
|:---------------------------------|:---------:|:---------:|:---------:|:---------:|:---------:|
|
63 |
+
| [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) | 0.862 | 0.727 | 0.473 | 0.810 | 0.979 |
|
64 |
+
| **sergeyzh/LaBSE-ru-sts** | **0.845** | **0.737** | **0.481** | **0.805** | **0.957** |
|
65 |
+
| [sergeyzh/rubert-mini-sts](https://huggingface.co/sergeyzh/rubert-mini-sts) | 0.815 | 0.723 | 0.477 | 0.791 | 0.949 |
|
66 |
+
| [sergeyzh/rubert-tiny-sts](https://huggingface.co/sergeyzh/rubert-tiny-sts) | 0.797 | 0.702 | 0.453 | 0.778 | 0.946 |
|
67 |
+
| [Tochka-AI/ruRoPEBert-e5-base-512](https://huggingface.co/Tochka-AI/ruRoPEBert-e5-base-512) | 0.793 | 0.704 | 0.457 | 0.803 | 0.970 |
|
68 |
+
| [cointegrated/LaBSE-en-ru](https://huggingface.co/cointegrated/LaBSE-en-ru) | 0.794 | 0.659 | 0.431 | 0.761 | 0.946 |
|
69 |
+
| [cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2) | 0.750 | 0.651 | 0.417 | 0.737 | 0.937 |
|
70 |
+
|
71 |
+
**Задачи:**
|
72 |
+
|
73 |
+
- Semantic text similarity (**STS**);
|
74 |
+
- Paraphrase identification (**PI**);
|
75 |
+
- Natural language inference (**NLI**);
|
76 |
+
- Sentiment analysis (**SA**);
|
77 |
+
- Toxicity identification (**TI**).
|
78 |
+
|
79 |
+
## Быстродействие и размеры
|
80 |
+
|
81 |
+
Оценки модели на бенчмарке [encodechka](https://github.com/avidale/encodechka):
|
82 |
+
|
83 |
+
| Модель | CPU | GPU | size | dim | n_ctx | n_vocab |
|
84 |
+
|:---------------------------------|----------:|----------:|----------:|----------:|----------:|----------:|
|
85 |
+
| [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) | 149.026 | 15.629 | 2136 | 1024 | 514 | 250002 |
|
86 |
+
| **sergeyzh/LaBSE-ru-sts** |**42.835** | **8.561** | **490** | **768** | **512** | **55083** |
|
87 |
+
| [sergeyzh/rubert-mini-sts](https://huggingface.co/sergeyzh/rubert-mini-sts) | 6.417 | 5.517 | 123 | 312 | 2048 | 83828 |
|
88 |
+
| [sergeyzh/rubert-tiny-sts](https://huggingface.co/sergeyzh/rubert-tiny-sts) | 3.208 | 3.379 | 111 | 312 | 2048 | 83828 |
|
89 |
+
| [Tochka-AI/ruRoPEBert-e5-base-512](https://huggingface.co/Tochka-AI/ruRoPEBert-e5-base-512) | 43.314 | 9.338 | 532 | 768 | 512 | 69382 |
|
90 |
+
| [cointegrated/LaBSE-en-ru](https://huggingface.co/cointegrated/LaBSE-en-ru) | 42.867 | 8.549 | 490 | 768 | 512 | 55083 |
|
91 |
+
| [cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2) | 3.212 | 3.384 | 111 | 312 | 2048 | 83828 |
|
92 |
+
|
93 |
+
|
94 |
+
Оценки модели на бенчмарке [ruMTEB](https://habr.com/ru/companies/sberdevices/articles/831150/):
|
95 |
+
|
96 |
+
|Model Name | Metric | sbert_large_ mt_nlu_ru | sbert_large_ nlu_ru | LaBSE-ru-sts | [LaBSE-ru-turbo](https://huggingface.co/sergeyzh/LaBSE-ru-turbo) | multilingual-e5-small | multilingual-e5-base | multilingual-e5-large |
|
97 |
+
|:----------------------------------|:--------------------|-----------------------:|--------------------:|----------------:|------------------:|----------------------:|---------------------:|----------------------:|
|
98 |
+
|CEDRClassification | Accuracy | 0.368 | 0.358 | 0.418 | 0.451 | 0.401 | 0.423 | **0.448** |
|
99 |
+
|GeoreviewClassification | Accuracy | 0.397 | 0.400 | 0.406 | 0.438 | 0.447 | 0.461 | **0.497** |
|
100 |
+
|GeoreviewClusteringP2P | V-measure | 0.584 | 0.590 | 0.626 | **0.644** | 0.586 | 0.545 | 0.605 |
|
101 |
+
|HeadlineClassification | Accuracy | 0.772 | **0.793** | 0.633 | 0.688 | 0.732 | 0.757 | 0.758 |
|
102 |
+
|InappropriatenessClassification | Accuracy | **0.646** | 0.625 | 0.599 | 0.615 | 0.592 | 0.588 | 0.616 |
|
103 |
+
|KinopoiskClassification | Accuracy | 0.503 | 0.495 | 0.496 | 0.521 | 0.500 | 0.509 | **0.566** |
|
104 |
+
|RiaNewsRetrieval | NDCG@10 | 0.214 | 0.111 | 0.651 | 0.694 | 0.700 | 0.702 | **0.807** |
|
105 |
+
|RuBQReranking | MAP@10 | 0.561 | 0.468 | 0.688 | 0.687 | 0.715 | 0.720 | **0.756** |
|
106 |
+
|RuBQRetrieval | NDCG@10 | 0.298 | 0.124 | 0.622 | 0.657 | 0.685 | 0.696 | **0.741** |
|
107 |
+
|RuReviewsClassification | Accuracy | 0.589 | 0.583 | 0.599 | 0.632 | 0.612 | 0.630 | **0.653** |
|
108 |
+
|RuSTSBenchmarkSTS | Pearson correlation | 0.712 | 0.588 | 0.788 | 0.822 | 0.781 | 0.796 | **0.831** |
|
109 |
+
|RuSciBenchGRNTIClassification | Accuracy | 0.542 | 0.539 | 0.529 | 0.569 | 0.550 | 0.563 | **0.582** |
|
110 |
+
|RuSciBenchGRNTIClusteringP2P | V-measure | **0.522** | 0.504 | 0.486 | 0.517 | 0.511 | 0.516 | 0.520 |
|
111 |
+
|RuSciBenchOECDClassification | Accuracy | 0.438 | 0.430 | 0.406 | 0.440 | 0.427 | 0.423 | **0.445** |
|
112 |
+
|RuSciBenchOECDClusteringP2P | V-measure | **0.473** | 0.464 | 0.426 | 0.452 | 0.443 | 0.448 | 0.450 |
|
113 |
+
|SensitiveTopicsClassification | Accuracy | **0.285** | 0.280 | 0.262 | 0.272 | 0.228 | 0.234 | 0.257 |
|
114 |
+
|TERRaClassification | Average Precision | 0.520 | 0.502 | **0.587** | 0.585 | 0.551 | 0.550 | 0.584 |
|
115 |
+
|
116 |
+
|Model Name | Metric | sbert_large_ mt_nlu_ru | sbert_large_ nlu_ru | LaBSE-ru-sts | [LaBSE-ru-turbo](https://huggingface.co/sergeyzh/LaBSE-ru-turbo) | multilingual-e5-small | multilingual-e5-base | multilingual-e5-large |
|
117 |
+
|:----------------------------------|:--------------------|-----------------------:|--------------------:|----------------:|------------------:|----------------------:|----------------------:|---------------------:|
|
118 |
+
|Classification | Accuracy | 0.554 | 0.552 | 0.524 | 0.558 | 0.551 | 0.561 | **0.588** |
|
119 |
+
|Clustering | V-measure | 0.526 | 0.519 | 0.513 | **0.538** | 0.513 | 0.503 | 0.525 |
|
120 |
+
|MultiLabelClassification | Accuracy | 0.326 | 0.319 | 0.340 | **0.361** | 0.314 | 0.329 | 0.353 |
|
121 |
+
|PairClassification | Average Precision | 0.520 | 0.502 | 0.587 | **0.585** | 0.551 | 0.550 | 0.584 |
|
122 |
+
|Reranking | MAP@10 | 0.561 | 0.468 | 0.688 | 0.687 | 0.715 | 0.720 | **0.756** |
|
123 |
+
|Retrieval | NDCG@10 | 0.256 | 0.118 | 0.637 | 0.675 | 0.697 | 0.699 | **0.774** |
|
124 |
+
|STS | Pearson correlation | 0.712 | 0.588 | 0.788 | 0.822 | 0.781 | 0.796 | **0.831** |
|
125 |
+
|Average | Average | 0.494 | 0.438 | 0.582 | 0.604 | 0.588 | 0.594 | **0.630** |
|
126 |
+
|
127 |
+
|
128 |
+
|
129 |
+
|
|
|
|
|
|
|
|