Egor Spirin commited on
Commit
ad33c33
·
1 Parent(s): f135393

Upload model, tokenizer, readme

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md CHANGED
@@ -1,3 +1,180 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: sentence-transformers
3
+ pipeline_tag: sentence-similarity
4
+ tags:
5
+ - sentence-transformers
6
+ - feature-extraction
7
+ - sentence-similarity
8
+
9
+ ---
10
+
11
+ # USER-base
12
+
13
+ **U**niversal **S**entence **E**ncoder for **R**ussian (USER) is a [sentence-transformer](https://www.SBERT.net) model for extracting embeddings exclusively for Russian language.
14
+ It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
15
+
16
+ This model is initialized from [`deepvk/deberta-v1-base`](https://huggingface.co/deepvk/deberta-v1-base) and trained to work exclusively with the Russian language. Its quality on other languages was not evaluated.
17
+
18
+
19
+ ## Usage
20
+
21
+ Using this model becomes easy when you have [`sentence-transformers`](https://www.SBERT.net) installed:
22
+
23
+ ```
24
+ pip install -U sentence-transformers
25
+ ```
26
+
27
+ Then you can use the model like this:
28
+
29
+ ```python
30
+ from sentence_transformers import SentenceTransformer
31
+
32
+ # Each input text should start with "query: " or "passage: ".
33
+ # For tasks other than retrieval, you can simply use the "query: " prefix.
34
+ input_texts = [
35
+ "query: Когда был спущен на воду первый миноносец «Спокойный»?",
36
+ "query: Есть ли нефть в Удмуртии?",
37
+ "passage: Спокойный (эсминец)\nЗачислен в списки ВМФ СССР 19 августа 1952 года.",
38
+ "passage: Нефтепоисковые работы в Удмуртии были начаты сразу после Второй мировой войны в 1945 году и продолжаются по сей день. Добыча нефти началась в 1967 году."
39
+ ]
40
+
41
+ model = SentenceTransformer("deepvk/USER-base")
42
+ embeddings = model.encode(input_texts, normalize_embeddings=True)
43
+ ```
44
+
45
+ However, you can use model directly with [`transformers`](https://huggingface.co/docs/transformers/en/index)
46
+
47
+ ```python
48
+ import torch.nn.functional as F
49
+ from torch import Tensor, inference_mode
50
+ from transformers import AutoTokenizer, AutoModel
51
+
52
+ def average_pool(
53
+ last_hidden_states: Tensor,
54
+ attention_mask: Tensor
55
+ ) -> Tensor:
56
+ last_hidden = last_hidden_states.masked_fill(
57
+ ~attention_mask[..., None].bool(), 0.0
58
+ )
59
+ return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
60
+
61
+ # Each input text should start with "query: " or "passage: ".
62
+ # For tasks other than retrieval, you can simply use the "query: " prefix.
63
+ input_texts = [
64
+ "query: Когда был спущен на воду первый миноносец «Спокойный»?",
65
+ "query: Есть ли нефть в Удмуртии?",
66
+ "passage: Спокойный (эсминец)\nЗачислен в списки ВМФ СССР 19 августа 1952 года.",
67
+ "passage: Нефтепоисковые работы в Удмуртии были начаты сразу после Второй мировой войны в 1945 году и продолжаются по сей день. Добыча нефти началась в 1967 году."
68
+ ]
69
+
70
+ tokenizer = AutoTokenizer.from_pretrained("deepvk/USER-base")
71
+ model = AutoModel.from_pretrained("deepvk/USER-base")
72
+
73
+ batch_dict = tokenizer(
74
+ input_texts, padding=True, truncation=True, return_tensors="pt"
75
+ )
76
+ with inference_mode():
77
+ outputs = model(**batch_dict)
78
+ embeddings = average_pool(
79
+ outputs.last_hidden_state, batch_dict["attention_mask"]
80
+ )
81
+ embeddings = F.normalize(embeddings, p=2, dim=1)
82
+
83
+ # Scores for query-passage
84
+ scores = (embeddings[:2] @ embeddings[2:].T) * 100
85
+ # [[55.86, 30.95],
86
+ # [22.82, 59.46]]
87
+ print(scores.round(decimals=2))
88
+ ```
89
+
90
+ ⚠️ **Attention** ⚠️
91
+
92
+ Each input text should start with "query: " or "passage: ".
93
+ For tasks other than retrieval, you can simply use the "query: " prefix.
94
+
95
+ ## Training Details
96
+
97
+ We aimed to follow the [`bge-base-en`](https://huggingface.co/BAAI/bge-base-en) model training algorithm, but we made several improvements along the way.
98
+
99
+ **Initialization:** [`deepvk/deberta-v1-base`](https://huggingface.co/deepvk/deberta-v1-base)
100
+
101
+ **First-stage:** Contrastive pre-training with weak supervision on the Russian part of [mMarco corpus](https://github.com/unicamp-dl/mMARCO).
102
+
103
+ **Second-stage:** Supervised fine-tuning two different models based on data symmetry and then merging via [`LM-Cocktail`](https://arxiv.org/abs/2311.13534):
104
+
105
+ 1. We modified the instruction design by simplifying the multilingual approach to facilitate easier inference.
106
+ For symmetric data `(S1, S2)`, we used the instructions: `"query: S1"` and `"query: S2"`, and for asymmetric data, we used `"query: S1"` with `"passage: S2"`.
107
+
108
+ 2. Since we split the data, we could additionally apply the [AnglE loss](https://arxiv.org/abs/2309.12871) to the symmetric model, which enhances performance on symmetric tasks.
109
+
110
+ 3. Finally, we combined the two models, tuning the weights for the merger using `LM-Cocktail` to produce the final model, **USER**.
111
+
112
+ ### Dataset
113
+
114
+ During model development, we additional collect 2 datasets:
115
+ [`deepvk/ru-HNP`](https://huggingface.co/datasets/deepvk/ru-HNP) and
116
+ [`deepvk/ru-WANLI`](https://huggingface.co/datasets/deepvk/ru-WANLI).
117
+
118
+ | Symmetric Dataset | Size | Asymmetric Dataset | Size |
119
+ |-------------------|-------|--------------------|------|
120
+ | **AllNLI** | 282 644 | [**MIRACL**](https://huggingface.co/datasets/Shitao/bge-m3-data/tree/main) | 10 000 |
121
+ | [MedNLI](https://github.com/jgc128/mednli) | 3 699 | [MLDR](https://huggingface.co/datasets/Shitao/bge-m3-data/tree/main) | 1 864 |
122
+ | [RCB](https://huggingface.co/datasets/RussianNLP/russian_super_glue) | 392 | [Lenta](https://github.com/yutkin/Lenta.Ru-News-Dataset) | 185 972 |
123
+ | [Terra](https://huggingface.co/datasets/RussianNLP/russian_super_glue) | 1 359 | [Mlsum](https://huggingface.co/datasets/reciTAL/mlsum) | 51 112 |
124
+ | [Tapaco](https://huggingface.co/datasets/tapaco) | 91 240 | [Mr-TyDi](https://huggingface.co/datasets/Shitao/bge-m3-data/tree/main) | 536 600 |
125
+ | [Opus100](https://huggingface.co/datasets/Helsinki-NLP/opus-100) | 1 000 000 | [Panorama](https://huggingface.co/datasets/its5Q/panorama) | 11 024 |
126
+ | [BiblePar](https://huggingface.co/datasets/Helsinki-NLP/bible_para) | 62 195 | [PravoIsrael](https://huggingface.co/datasets/TarasHu/pravoIsrael) | 26 364 |
127
+ | [RudetoxifierDataDetox](https://huggingface.co/datasets/d0rj/rudetoxifier_data_detox) | 31 407 | [Xlsum](https://huggingface.co/datasets/csebuetnlp/xlsum) | 124 486 |
128
+ | [RuParadetox](https://huggingface.co/datasets/s-nlp/ru_paradetox) | 11 090 | [Fialka-v1](https://huggingface.co/datasets/0x7o/fialka-v1) | 130 000 |
129
+ | [**deepvk/ru-WANLI**](https://huggingface.co/datasets/deepvk/ru-WANLI) | 35 455 | [RussianKeywords](https://huggingface.co/datasets/Milana/russian_keywords) | 16 461 |
130
+ | [**deepvk/ru-HNP**](https://huggingface.co/datasets/deepvk/ru-HNP) | 500 000 | [Gazeta](https://huggingface.co/datasets/IlyaGusev/gazeta) | 121 928 |
131
+ | | | [Gsm8k-ru](https://huggingface.co/datasets/d0rj/gsm8k-ru) | 7 470 |
132
+ | | | [DSumRu](https://huggingface.co/datasets/bragovo/dsum_ru) | 27 191 |
133
+ | | | [SummDialogNews](https://huggingface.co/datasets/CarlBrendt/Summ_Dialog_News) | 75 700 |
134
+
135
+
136
+ **Total positive pairs:** 3,352,653
137
+ **Total negative pairs:** 792,644 (negative pairs from AIINLI, MIRACL, deepvk/ru-WANLI, deepvk/ru-HNP)
138
+
139
+ For all labeled datasets, we only use its training set for fine-tuning. For datasets Gazeta, Mlsum, Xlsum: pairs (title/text) and (title/summary) are combined and used as asymmetric data. AllNLI is a combination of SNLI, MNLI and ANLI.
140
+
141
+ ## Experiments
142
+
143
+ As a baseline, we chose the current top models from the [`encodechka`](https://github.com/avidale/encodechka) leaderboard table. In addition, we evaluate model on the russian subset of [`MTEB`](https://github.com/embeddings-benchmark/mteb), which include 10 tasks. Unfortunately, we could not validate the bge-m3 on some MTEB tasks, specifically clustering, due to excessive computational resources. Besides these two benchmarks, we also evaluated the models on the [`MIRACL`](https://github.com/project-miracl/miracl). All experiments were conducted using NVIDIA TESLA A100 40 GB GPU. We use validation scripts from the official repositories for each of the tasks.
144
+
145
+ | Model | Size (w/o Embeddings) | [**Encodechka**](https://github.com/avidale/encodechka) (*Mean S*) | [**MTEB**](https://github.com/embeddings-benchmark/mteb) (*Mean Ru*) | [**Miracl**](http://miracl.ai/) (*Recall@100*) |
146
+ |-----------------------------------------|-------|-----------------------------|------------------------|--------------------------------|
147
+ | [`bge-m3`](https://huggingface.co/BAAI/bge-m3) | 303 | **0.786** | **0.694** | **0.959** |
148
+ | [`multilingual-e5-large`](https://huggingface.co/intfloat/multilingual-e5-large) | 303 | 0.78 | 0.665 | 0.927 |
149
+ | `USER` (this model) | 85 | <u>0.772</u> | <u>0.666</u> | 0.763 |
150
+ [`paraphrase-multilingual-mpnet-base-v2`](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | 85 | 0.76 | 0.625 | 0.149 |
151
+ | [`multilingual-e5-base`](https://huggingface.co/intfloat/multilingual-e5-base) | 85 | 0.756 | 0.645 | <u>0.915</u> |
152
+ | [`LaBSE-en-ru`](https://huggingface.co/cointegrated/LaBSE-en-ru) | 85 | 0.74 | 0.599 | 0.327 |
153
+ | [`sn-xlm-roberta-base-snli-mnli-anli-xnli`](https://huggingface.co/symanto/sn-xlm-roberta-base-snli-mnli-anli-xnli) | 85 | 0.74 | 0.593 | 0.08 |
154
+
155
+ Model sizes are shown, with larger models visually distinct from the others.
156
+ Absolute leaders in the metrics are highlighted in bold, and the leaders among models of our size is underlined.
157
+
158
+ In this way, our solution outperforms all other models of the same size on both Encodechka and MTEB. Given that the model is slightly underperforming in retrieval tasks relative to existing solutions, we aim to address this in our future research.
159
+
160
+ ## FAQ
161
+
162
+ **Do I need to add the prefix "query: " and "passage: " to input texts?**
163
+
164
+ Yes, this is how the model is trained, otherwise you will see a performance degradation.
165
+ Here are some rules of thumb:
166
+ - Use `"query: "` and `"passage: "` correspondingly for asymmetric tasks such as passage retrieval in open QA, ad-hoc information retrieval.
167
+ - Use `"query: "` prefix for symmetric tasks such as semantic similarity, bitext mining, paraphrase retrieval.
168
+ - Use `"query: "` prefix if you want to use embeddings as features, such as linear probing classification, clustering.
169
+
170
+ ## Citations
171
+
172
+ ```
173
+ @misc{deepvk2024user,
174
+ title={USER: Universal Sentence Encoder for Russian},
175
+ author={Malashenko, Boris and Zemerov, Anton and Spirin, Egor},
176
+ url={https://huggingface.co/datasets/deepvk/USER-base},
177
+ publisher={Hugging Face}
178
+ year={2024},
179
+ }
180
+ ```
config.json ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "DebertaModel"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "hidden_act": "gelu",
7
+ "hidden_dropout_prob": 0.1,
8
+ "hidden_size": 768,
9
+ "initializer_range": 0.02,
10
+ "intermediate_size": 3072,
11
+ "layer_norm_eps": 1e-07,
12
+ "max_position_embeddings": 512,
13
+ "max_relative_positions": -1,
14
+ "model_type": "deberta",
15
+ "num_attention_heads": 12,
16
+ "num_hidden_layers": 12,
17
+ "pad_token_id": 0,
18
+ "pooler_dropout": 0,
19
+ "pooler_hidden_act": "gelu",
20
+ "pooler_hidden_size": 768,
21
+ "pos_att_type": null,
22
+ "position_biased_input": true,
23
+ "relative_attention": false,
24
+ "torch_dtype": "float32",
25
+ "transformers_version": "4.38.2",
26
+ "type_vocab_size": 0,
27
+ "vocab_size": 50265
28
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "2.5.1",
4
+ "transformers": "4.38.2",
5
+ "pytorch": "2.2.0a0+81ea7a4"
6
+ },
7
+ "prompts": {},
8
+ "default_prompt_name": null
9
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f4c56eb54eecc47a8274ca8b09bcdf834b0c7b5d15f27d4e8fe32b7218f3c6bd
3
+ size 496192016
modules.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "",
18
+ "type": "sentence_transformers.models.Normalize"
19
+ }
20
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 512,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "cls_token": {
10
+ "content": "<s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "eos_token": {
17
+ "content": "</s>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "mask_token": {
24
+ "content": "<mask>",
25
+ "lstrip": true,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "pad_token": {
31
+ "content": "<pad>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ },
37
+ "sep_token": {
38
+ "content": "</s>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false
43
+ },
44
+ "unk_token": {
45
+ "content": "<unk>",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false
50
+ }
51
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "added_tokens_decoder": {
4
+ "0": {
5
+ "content": "<s>",
6
+ "lstrip": false,
7
+ "normalized": false,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ },
12
+ "1": {
13
+ "content": "<pad>",
14
+ "lstrip": false,
15
+ "normalized": false,
16
+ "rstrip": false,
17
+ "single_word": false,
18
+ "special": true
19
+ },
20
+ "2": {
21
+ "content": "</s>",
22
+ "lstrip": false,
23
+ "normalized": false,
24
+ "rstrip": false,
25
+ "single_word": false,
26
+ "special": true
27
+ },
28
+ "3": {
29
+ "content": "<unk>",
30
+ "lstrip": false,
31
+ "normalized": false,
32
+ "rstrip": false,
33
+ "single_word": false,
34
+ "special": true
35
+ },
36
+ "4": {
37
+ "content": "<mask>",
38
+ "lstrip": true,
39
+ "normalized": false,
40
+ "rstrip": false,
41
+ "single_word": false,
42
+ "special": true
43
+ }
44
+ },
45
+ "bos_token": "<s>",
46
+ "clean_up_tokenization_spaces": true,
47
+ "cls_token": "<s>",
48
+ "eos_token": "</s>",
49
+ "errors": "replace",
50
+ "mask_token": "<mask>",
51
+ "max_length": 512,
52
+ "model_max_length": 512,
53
+ "pad_to_multiple_of": null,
54
+ "pad_token": "<pad>",
55
+ "pad_token_type_id": 0,
56
+ "padding_side": "right",
57
+ "sep_token": "</s>",
58
+ "stride": 0,
59
+ "tokenizer_class": "RobertaTokenizer",
60
+ "trim_offsets": true,
61
+ "truncation_side": "right",
62
+ "truncation_strategy": "longest_first",
63
+ "unk_token": "<unk>"
64
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff