mp commited on
Commit
556651f
1 Parent(s): c3c777b

updated HF repo model card from confluence master document

Browse files
Files changed (1) hide show
  1. README.md +106 -103
README.md CHANGED
@@ -9,12 +9,9 @@ license_link: LICENSE
9
  This model card provides an overview of Pharia-1-Embedding-4608-control, an embedding model
10
  developed by Aleph Alpha Research*. Pharia-1-Embedding-4608-control has been built on top of Pharia-1-LLM-7B-control.
11
  For additional training details, including architecture, tokenization, tokenizer fertility, pre-training,
12
- instruction fine-tuning and resource usage we refer to the model card of Pharia-1-LLM-7B-control.
13
- Due to being trained with a diverse set of instructions, Pharia-1-Embedding-4608-control can deliver customized
14
- embeddings at runtime without further finetuning. Pharia-1-Embedding-4608-control was trained on carefully curated
15
- data in compliance with applicable EU and national regulations, including copyright and data privacy laws.
16
- Furthermore it shows good cross-lingual performance allowing for prompting and text to be embedded written
17
- in different languages. The finetuning was always performed using English instructions.
18
 
19
 
20
  ## Model Overview
@@ -22,7 +19,7 @@ in different languages. The finetuning was always performed using English instru
22
  - **Developed by:** Aleph Alpha Research
23
  <!--- **Funded by [optional]:** [More Information Needed]-->
24
  <!--- **Shared by [optional]:** [More Information Needed]-->
25
- - **Model type:** Embedding adapter on top of Pharia-1-LLM-7B-control trained with representational
26
  instruction-tuning (inspired by the approach of GritLM).
27
  - **Language(s) (NLP):** Trained on English, German, French, Spanish.
28
  <!--- **License:** [More Information Needed]-->
@@ -42,25 +39,12 @@ in different languages. The finetuning was always performed using English instru
42
  ### Model Access
43
 
44
  We provide access to our models through the channels listed below.
45
- - On-premise installation: Our customers are supplied with our full LLM and Embedding model stack, including model weights
46
- and inference runtime. Contact us for options to deploy Pharia-1-Embedding-4608-control in any cloud or on-premise environment.
47
- We provide our customers with open access to our full model checkpoint including weights and code for commercial use.
48
- Please refer to the changelog for updates to the models served. We do not deprecate officially released versions
49
- of old model generations when we release newer versions, so users can continue to have access to available models.
50
- No prompt data is stored when using our systems, which means that we do not
51
- collect PII (personally identifiable information) for any of our public API users as detailed in our Terms & Conditions.
52
- We do not log user inputs to the models. We do not train on user data.
53
- - **Note:** The same models are made available to users regardless of their geographic location,
54
- and the input language but subject to sanction regimes, technology export regulations, and other restrictions that may apply.
55
- The same offering is provided to all countries within and external to the European Union if no legal restrictions apply.
56
-
57
-
58
- <!-- Provide the basic links for the model.
59
-
60
- - **Repository:** [More Information Needed]
61
- - **Paper [optional]:** [More Information Needed]
62
- - **Demo [optional]:** [More Information Needed]
63
- -->
64
 
65
  ### Intended Use
66
 
@@ -78,8 +62,9 @@ including those related to military or nuclear applications, and activities not
78
  technology export regulations, and other restrictions that may apply. The models are to be used following ethical standards.
79
  The utilization of our technology is always governed by, and may be limited in accordance with,
80
  our Terms of Use, the Open Aleph License, or any specific agreement we might have established with you.
 
81
  For non-anonymous reports, we also provide an appeals mechanism for usage policy violations via
82
- our dedicated contact address violations@aleph-alpha.com to communicate with us.
83
 
84
  Customers and partners are enabled to use our ticketing
85
  system [ticketing system](https://servicedesk.aleph-alpha.de/external) for appeals, claims and feedback.
@@ -89,50 +74,52 @@ system [ticketing system](https://servicedesk.aleph-alpha.de/external) for appea
89
 
90
  Beyond the risks & limitations stated in
91
  the original [Pharia-1-LLM-7B-control](https://huggingface.co/Aleph-Alpha/Pharia-1-LLM-7B-control), the following limitation applies:
92
- Pharia-1-Embedding-4608-control has been optimized on embedding
93
  computation only. Therefore, we do not recommend usage for text generation purposes.
94
 
95
  ## How to Use
 
96
 
97
- ### Use with scaling inference code base
98
 
99
- To perform inference with the original model files, you’ll first need to install the
100
- [Scaling library](https://github.com/Aleph-Alpha/scaling). Follow the installation instructions provided in the repository's README file.
101
- After installation, download the model weights and use the Scaling inference module to load the
102
- checkpoint, vocabulary, and configuration files.
103
 
104
  ```
105
- from pathlib import Path
106
  from torch.nn import CosineSimilarity
107
- from scaling.transformer.inference import TransformerInferenceModule
108
- MODEL_PATH = "/path/to/model"
109
- inference_model = TransformerInferenceModule.from_checkpoint(
110
- checkpoint_dir=Path(MODEL_PATH),
111
- )
112
- # embed the query:
 
 
 
113
  query = "Which country is Galileo from?"
114
- query_embeddings = inference_model.encode_queries(query, convert_to_tensor=True)
115
  print(f"Type of embeddings: {type(query_embeddings)},\n\
116
  shape of query embeddings: {query_embeddings.shape}")
117
  # embed the documents:
118
  document_1 = "Galileo is a German television program series produced and broadcast on ProSieben television network. It is also sold to broadcasters in other countries (namely Russia and Poland). The first show was broadcast in 1998, and is now stored in the Arctic World Archive in Svalbard, Norway, after being transferred to special film created by Piql."
119
- document_embeddings_1 = inference_model.encode_corpus(document_1, convert_to_tensor=True)
120
  document_2 = "Galileo di Vincenzo Bonaiuti de' Galilei (15 February 1564 - 8 January 1642), commonly referred to as Galileo Galilei or mononymously as Galileo, was an Italian (Florentine) astronomer, physicist and engineer, sometimes described as a polymath. He was born in the city of Pisa, then part of the Duchy of Florence and present-day Italy."
121
- document_embeddings_2 = inference_model.encode_corpus(document_2, convert_to_tensor=True)
122
  # customized embeddings steering the query:
123
  instruction = "Represent the question about TV shows to find a paragraph that answers it."
124
- steered_query_embeddings = inference_model.encode_queries(query,
125
- instruction=instruction,
126
- convert_to_tensor=True)
 
 
127
  # compute similarity between steered query and both documents
128
- cossim = CosineSimilarity(dim=1, eps=1e-6)
129
  sim1 = round(cossim(document_embeddings_1, steered_query_embeddings).item(), 3)
130
  sim2 = round(cossim(document_embeddings_2, steered_query_embeddings).item(), 3)
131
  print("Steered embedding causes higher similarity of query to TV show:")
132
  print(f"Similarity query/TV show ({sim1}) > similarity query/Italian polymath: ({sim2})")
133
  ```
 
134
 
135
- ### Explanation of the instruct embedding code example
136
 
137
  Pharia-1-Embedding-4608-control is useful for any use-case that relates to estimating the similarity/relevance between
138
  text fragments. This is relevant for use-cases such as information retrieval, semantic search, re-ranking and clustering.
@@ -177,11 +164,13 @@ To further improve performance you can use instructions to steer the model. Inst
177
  understand nuances of your specific data and ultimately lead to embeddings that are more useful for your use-case.
178
  In this case, we aim to get embeddings that would lead to ranking the paragraph about the German TV Show higher
179
  than the paragraph about the Italian polymath.
 
180
  **Step 1:**
181
  Embed the Query with an Instruction
182
  ```"instruction": "Represent the question about TV shows to find a paragraph that answers it."```
183
  ```"input": "input": "Which country is Galileo from?"```
184
  → Embedding: ```[-0.6310919, 1.4309896, -0.85546875, ...]```
 
185
  **Step 2:**
186
  Compare the similarity
187
  We leave the embeddings of the documents untouched and now obtain the following cosine similarities:
@@ -204,33 +193,56 @@ and ultimately lead to embeddings that are more useful for your use-case.
204
 
205
  ## Evaluation
206
 
207
- Pharia-1-Embedding-4608-control has not been optimized for [MTEB](https://github.com/embeddings-benchmark/mteb) (a generic benchmark),
208
- and naturally would be expected to underperform on it as we optimize instead for real-world usage and multilinguality.
209
- Nonetheless, for comparability we share results on a subset of tasks of the
210
- English MTEB benchmark. The subset contains tasks from all task types (classification, summarization, etc.) of
211
- the full benchmark and is therefore roughly representative of it.
212
 
213
- #### MTEB English
214
- For this evaluation we use task-specific instructions from [MEDI2](https://huggingface.co/datasets/GritLM/MEDI2).
 
 
 
 
 
215
 
216
- |Model Name|ArguAna|AskUbuntuDupQuestions|BIOSSES|Banking77Classification|EmotionClassification|MedrxivClusteringS2S|NFCorpus|STS17|STSBenchmark|SciFact|SummEval|TwitterSemEval2015|Average|
217
- |--|--|--|--|--|--|--|--|--|--|--|--|--|--|
218
- |Pharia-1-Embedding-4608-control|51.09|61.71|84.56|86.37|51.77|34.29|37.82|89.56|87.08|69.7 |30.95|70.97|**62.99**|
219
- |luminous-base (symmetric) |41.94|56.17|80.42|82.49|48.1 |28.7 |28.32|90.31|86.07|56.77|31.44|69.81|**58.38**|
220
- |GritLM-7B |54.95|67.34|88.19|88.45|47.98|36.80|38.27|89.88|85.64|64.99|30.78|70.12|**63.62**|
221
- |LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-supervised|54.74|65.19|84.92|88.05|51.2|32.96|22.82|89.58|88.05|73.90|31.01|88.79|**64.27**|
 
 
 
222
 
223
 
224
- #### Ablation for “No Instruction” case
225
- We ablate how performance changes when not using task-specific instructions for the embeddings.
226
 
227
- |Model Name|ArguAna|AskUbuntuDupQuestions|BIOSSES|Banking77Classification|EmotionClassification|MedrxivClusteringS2S|NFCorpus|STS17|STSBenchmark|SciFact|SummEval|TwitterSemEval2015|Average|
228
- |--|--|--|--|--|--|--|--|--|--|--|--|--|--|
229
- |Instruction |51.09|61.71|84.56|86.37|51.77|34.29|37.82|89.56|87.08|69.7 |30.95|70.97|**62.99**|
230
- |No Instruction |50.23|60.31|84.45|86.36|50.6 |31.87|37.58|88.75|86.39|71.28|31.00|68.92|**62.31**|
231
- |Relative Δ|-1.71%|-2.32%|-0.13%|-0.01%|-2.31%|-7.59%|-0.64%|-0.91%|-0.80%|2.22%|0.16%|-2.97%|**-1.09%**|
 
 
 
 
232
 
233
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
234
  #### Methodology for Multilingual Evaluations (European languages)
235
  * Context: MTEB is a collection of tasks across many task types (e.g. classification, retrieval etc.). Furthermore, tasks can
236
  have N subsets on different languages. Subsets itself can also contain N languages, e.g. translation-related tasks. Base script
@@ -253,12 +265,18 @@ from [mteb/scripts/task_selection/europe_tasks.csv at main · embeddings-benchma
253
  - i.e. this gives 20-2=18 translation pair subsets between the 5 languages. -2 because Italian ↔︎ German doesn’t exist.
254
  - this is done because otherwise there are 250 translation pair subsets which are not as relevant (e.g. they contain Vietnamese ↔︎ Portuguese)
255
 
 
 
 
256
  #### Europe by task
257
 
258
  | Model Name | AmazonCounterfactualClassification | BUCC.v2 | DiaBlaBitextMining | MassiveScenarioClassification | NTREXBitextMining | STS17 | Average |
259
  |-------------------------------------------------------|-------------------------------------:|----------:|---------------------:|--------------------------------:|--------------------:|---------:|----------:|
260
- | Pharia-1-Embedding-4608-control | 0.724946 | 0.991884 | 0.865101 | 0.755763 | 0.982374 | 0.876741 | 0.866135 |
261
- | GritLM-7B | 0.766381 | 0.994298 | 0.864504 | 0.789334 | 0.984593 | 0.880716 | 0.879971 |
 
 
 
262
 
263
  #### Europe by language
264
 
@@ -266,48 +284,33 @@ from [mteb/scripts/task_selection/europe_tasks.csv at main · embeddings-benchma
266
  |-------------------------------------------------------|-----------:|-----------:|-----------:|-----------:|-----------:|-----------:|----------:|
267
  | Pharia-1-Embedding-4608-control | 0.925309 | 0.902113 | 0.937961 | 0.953719 | 0.942352 | 0.945642 | 0.934516 |
268
  | GritLM-7B | 0.934603 | 0.905669 | 0.942364 | 0.962042 | 0.949731 | 0.947428 | 0.940306 |
 
 
269
 
270
 
271
- #### Evaluations on cross-lingual capabilities
272
- There are important use cases where one wants to retrieve multiple documents on a topic or answering questions that are formulated in a
273
- different language than the query. This increases recall and information retrieval coverage. For testing on cross-lingual capabilities
274
- we evaluated Pharia-1-Embedding-4608-control, GritLM and Nvidia-Embed-v2 on the MLQA-V1 datasets (Facebook) for German/English and
275
- English/Spanish language pairings. For German/French we used the CLSD-WMT19 dataset providing correct and adversarial translations
276
- of a sentence in the corresponding pair language. In order to check quality over a larger range of sample size we did the accuracy
277
- computations for varying number of samples taken from the MLQA-V1 dataset. For the CLSD-WMT19 evaluation we employed the
278
- full set of data (2900 samples available).
279
-
280
 
281
- #### MLQA-V1 Ger/Eng cross-lingual accuracies for the considered models
282
-
283
- |# of samples|Pharia4608|GritLM|Nvidia-Embed-v2|BGE-Gemma2|
284
- |:---:|:---:|:---:|:---:|:---:|
285
- |1000|86.0%|82.5%|77.0%|87.0%|
286
- |2000|79.5%|73.4%|69.4%|76.8%|
287
- |4000|65.3%|59.2%|56.0%|62.7%|
288
- |6000|54.3%|48.6%|45.6%|52.6%|
289
- |10000|38.6%|32.8%|32.8%|39.4%|
290
 
 
291
 
292
- #### MLQA-V1 Eng/Esp cross-lingual accuracies for the considered models
 
 
 
 
 
293
 
294
- |# samples|Pharia4608|GritLM|NV-Embed-v2|BGE-Gemma2|
295
- |:---:|:---:|:---:|:---:|:---:|
296
- |1000|87.5%|82.0%|81.5%|87.0%|
297
- |2000|78.5%|73.9%|70.7%|77.0%|
298
- |4000|65.5%|59.3%|56.9%|64.2%|
299
- |6000|55.3%|49.2%|46.2%|53.4%|
300
- |10000|41.7%|35.5%|33.2%|40.0%|
301
 
302
- #### CLSD-WMT19 Ger/Fra (2900 samples) cross-lingual evaluation for the considered models
303
 
 
 
304
 
305
- |Model Name | accuracy |
306
- |:-----------------------------:|:--------------------------------:|
307
- |Pharia-1-Embedding-4608-control|95.1% |
308
- |GritLM-7B |94.2% |
309
- |Nvidia-Embed-v2 |93.4% |
310
- |BGE-Gemma2 |95.4% |
 
311
 
312
 
313
  ## Training Details
 
9
  This model card provides an overview of Pharia-1-Embedding-4608-control, an embedding model
10
  developed by Aleph Alpha Research*. Pharia-1-Embedding-4608-control has been built on top of Pharia-1-LLM-7B-control.
11
  For additional training details, including architecture, tokenization, tokenizer fertility, pre-training,
12
+ instruction fine-tuning and resource usage we refer to the model card of [Pharia-1-LLM-7B-control](https://huggingface.co/Aleph-Alpha/Pharia-1-LLM-7B-control).
13
+
14
+ Due to being trained with a diverse set of instructions, Pharia-1-Embedding-4608-control can deliver customized embeddings at runtime without further finetuning. Pharia-1-Embedding-4608-control was trained on carefully curated data in compliance with applicable EU and national regulations, including copyright and data privacy laws. Furthermore it shows strong cross-lingual performance allowing for prompting and text to be embedded written in different languages. The finetuning was always performed using English instructions.
 
 
 
15
 
16
 
17
  ## Model Overview
 
19
  - **Developed by:** Aleph Alpha Research
20
  <!--- **Funded by [optional]:** [More Information Needed]-->
21
  <!--- **Shared by [optional]:** [More Information Needed]-->
22
+ - **Model type/architecture:** Embedding adapter on top of Pharia-1-LLM-7B-control trained with representational
23
  instruction-tuning (inspired by the approach of GritLM).
24
  - **Language(s) (NLP):** Trained on English, German, French, Spanish.
25
  <!--- **License:** [More Information Needed]-->
 
39
  ### Model Access
40
 
41
  We provide access to our models through the channels listed below.
42
+ - On-premise installation: Our customers are supplied with our full LLM and Embedding model stack, including model weights and inference runtime. Contact us for options to deploy Pharia-1-Embedding-4608-control in any cloud or on-premise environment. We provide our customers with open access to our full model checkpoint including weights and code for commercial use.
43
+ Downloadable from Huggingface: An HF-adapted version of our model can be found in our Huggingface repo (https://huggingface.co/Aleph-Alpha/Pharia-1-Embedding-4608-control-hf) together with code snippets that make the model easy to use.
44
+ Please refer to the changelog for updates to the models served. We do not deprecate officially released versions of old model generations when we release newer versions, so users can continue to have access to available models.
45
+ No prompt data is stored when using our systems, which means that we do not collect PII (personally identifiable information) for any of our public API users as detailed in our Terms & Conditions. We do not log user inputs to the models. We do not train on user data.
46
+ - **Note**: The same models are made available to users regardless of their geographic location, and the input language but subject to sanction regimes, technology export regulations, and other restrictions that may apply. The same offering is provided to all countries within and external to the European Union if no legal restrictions apply.
47
+
 
 
 
 
 
 
 
 
 
 
 
 
 
48
 
49
  ### Intended Use
50
 
 
62
  technology export regulations, and other restrictions that may apply. The models are to be used following ethical standards.
63
  The utilization of our technology is always governed by, and may be limited in accordance with,
64
  our Terms of Use, the Open Aleph License, or any specific agreement we might have established with you.
65
+
66
  For non-anonymous reports, we also provide an appeals mechanism for usage policy violations via
67
+ our dedicated contact address [violations@aleph-alpha.com](violations@aleph-alpha.com) to communicate with us.
68
 
69
  Customers and partners are enabled to use our ticketing
70
  system [ticketing system](https://servicedesk.aleph-alpha.de/external) for appeals, claims and feedback.
 
74
 
75
  Beyond the risks & limitations stated in
76
  the original [Pharia-1-LLM-7B-control](https://huggingface.co/Aleph-Alpha/Pharia-1-LLM-7B-control), the following limitation applies:
77
+ - Pharia-1-Embedding-4608-control has been optimized on embedding
78
  computation only. Therefore, we do not recommend usage for text generation purposes.
79
 
80
  ## How to Use
81
+ We provide two access pathways for our Pharia4608 embedding model. The first one leverages the HF ecosystem and can be found here: https://huggingface.co/Aleph-Alpha/Pharia-1-Embedding-4608-control-hf. The code snippet in the box below demonstrates its use. As soon as the model class is invoked, the model will we loaded from the repo and is ready for use. The other access pathway is through our public scaling code base. In this version the model weights were not converted to HF format and the repo https://huggingface.co/Aleph-Alpha/Pharia-1-Embedding-4608-control can be cloned as is. The model path has to be adjusted to the local path where the model was downloaded. The model cards in the corresponding repositories only the code snippet which applies to the specific repo.
82
 
83
+ ### Use with Huggingface
84
 
 
 
 
 
85
 
86
  ```
 
87
  from torch.nn import CosineSimilarity
88
+ from transformers import AutoConfig, AutoModel
89
+ from transformers import PreTrainedTokenizerFast
90
+ MODEL_PATH = 'Aleph-Alpha/Pharia-1-Embedding-4608-control-hf'
91
+ config = AutoConfig.from_pretrained(MODEL_PATH, trust_remote_code=True)
92
+ tokenizer = PreTrainedTokenizerFast.from_pretrained(MODEL_PATH)
93
+ model = AutoModel.from_pretrained(MODEL_PATH,
94
+ trust_remote_code=True,
95
+ config=config,
96
+ tokenizer=tokenizer).cuda()
97
  query = "Which country is Galileo from?"
98
+ query_embeddings = model.encode_queries(query, convert_to_tensor=True)
99
  print(f"Type of embeddings: {type(query_embeddings)},\n\
100
  shape of query embeddings: {query_embeddings.shape}")
101
  # embed the documents:
102
  document_1 = "Galileo is a German television program series produced and broadcast on ProSieben television network. It is also sold to broadcasters in other countries (namely Russia and Poland). The first show was broadcast in 1998, and is now stored in the Arctic World Archive in Svalbard, Norway, after being transferred to special film created by Piql."
103
+ document_embeddings_1 = model.encode_corpus(document_1, convert_to_tensor=True)
104
  document_2 = "Galileo di Vincenzo Bonaiuti de' Galilei (15 February 1564 - 8 January 1642), commonly referred to as Galileo Galilei or mononymously as Galileo, was an Italian (Florentine) astronomer, physicist and engineer, sometimes described as a polymath. He was born in the city of Pisa, then part of the Duchy of Florence and present-day Italy."
105
+ document_embeddings_2 = model.encode_corpus(document_2, convert_to_tensor=True)
106
  # customized embeddings steering the query:
107
  instruction = "Represent the question about TV shows to find a paragraph that answers it."
108
+ steered_query_embeddings = model.encode_queries(
109
+ query,
110
+ instruction=instruction,
111
+ convert_to_tensor=True
112
+ )
113
  # compute similarity between steered query and both documents
114
+ cossim = CosineSimilarity(dim=0, eps=1e-6)
115
  sim1 = round(cossim(document_embeddings_1, steered_query_embeddings).item(), 3)
116
  sim2 = round(cossim(document_embeddings_2, steered_query_embeddings).item(), 3)
117
  print("Steered embedding causes higher similarity of query to TV show:")
118
  print(f"Similarity query/TV show ({sim1}) > similarity query/Italian polymath: ({sim2})")
119
  ```
120
+ Disclaimer: For the official evaluation scores we used the Scaling compatible checkpoint available under Pharia-1-Embedding-4608-control (https://huggingface.co/Aleph-Alpha/Pharia-1-Embedding-4608-control)
121
 
122
+ ### Example for instruction embedding
123
 
124
  Pharia-1-Embedding-4608-control is useful for any use-case that relates to estimating the similarity/relevance between
125
  text fragments. This is relevant for use-cases such as information retrieval, semantic search, re-ranking and clustering.
 
164
  understand nuances of your specific data and ultimately lead to embeddings that are more useful for your use-case.
165
  In this case, we aim to get embeddings that would lead to ranking the paragraph about the German TV Show higher
166
  than the paragraph about the Italian polymath.
167
+
168
  **Step 1:**
169
  Embed the Query with an Instruction
170
  ```"instruction": "Represent the question about TV shows to find a paragraph that answers it."```
171
  ```"input": "input": "Which country is Galileo from?"```
172
  → Embedding: ```[-0.6310919, 1.4309896, -0.85546875, ...]```
173
+
174
  **Step 2:**
175
  Compare the similarity
176
  We leave the embeddings of the documents untouched and now obtain the following cosine similarities:
 
193
 
194
  ## Evaluation
195
 
196
+ ### Evaluations on cross-lingual capabilities
 
 
 
 
197
 
198
+ There are important use cases where one wants to retrieve multiple documents on a topic or answering questions that are formulated
199
+ in a different language than the query. This increases recall and information retrieval coverage. For testing on cross-lingual
200
+ capabilities we evaluated Pharia-1-Embedding-4608-control, GritLM, Nvidia-Embed-v2 and BGE-Multilingual-Gemma2
201
+ on the MLQA-V1 datasets (Facebook) for German/English and English/Spanish language pairings. For German/French we
202
+ used the CLSD-WMT19 dataset providing correct and adversarial translations of a sentence in the corresponding pair language.
203
+ In order to check quality over a larger range of sample size we did the accuracy computations for varying number of samples
204
+ taken from the MLQA-V1 dataset. For the CLSD-WMT19 evaluation we employed the full set of data (2900 samples available).
205
 
206
+ #### MLQA-V1 Ger/Eng cross-lingual accuracies for the considered models
207
+
208
+ |# of samples|Pharia4608|GritLM|Nvidia-Embed-v2|BGE-Gemma2|
209
+ |:---:|:---:|:---:|:---:|:---:|
210
+ |1000|86.0%|82.5%|77.0%|87.0%|
211
+ |2000|79.5%|73.4%|69.4%|76.8%|
212
+ |4000|65.3%|59.2%|56.0%|62.7%|
213
+ |6000|54.3%|48.6%|45.6%|52.6%|
214
+ |10000|38.6%|32.8%|32.8%|39.4%|
215
 
216
 
217
+ #### MLQA-V1 Eng/Esp cross-lingual accuracies for the considered models
 
218
 
219
+ |# samples|Pharia4608|GritLM|NV-Embed-v2|BGE-Gemma2|
220
+ |:---:|:---:|:---:|:---:|:---:|
221
+ |1000|87.5%|82.0%|81.5%|87.0%|
222
+ |2000|78.5%|73.9%|70.7%|77.0%|
223
+ |4000|65.5%|59.3%|56.9%|64.2%|
224
+ |6000|55.3%|49.2%|46.2%|53.4%|
225
+ |10000|41.7%|35.5%|33.2%|40.0%|
226
+
227
+ #### CLSD-WMT19 Ger/Fra (2900 samples) cross-lingual evaluation for the considered models
228
 
229
 
230
+ |Model Name | accuracy |
231
+ |:-----------------------------:|:--------------------------------:|
232
+ |Pharia-1-Embedding-4608-control|95.1% |
233
+ |GritLM-7B |94.2% |
234
+ |Nvidia-Embed-v2 |93.4% |
235
+ |BGE-Gemma2 |95.4% |
236
+
237
+
238
+ ## Evaluations on MTEB tasks
239
+
240
+ To evaluate our models multilingual capabilities we evaluate it against other source-available, high-performing embedding models listen in the
241
+ MTEB leaderboard. For the following evaluations we compare the following models:
242
+ - NVEmbed-V2: The highest scoring model in the MTEB leaderboard at time of the release
243
+ - BGE-Multilingual-Gemma2: The highest scoring multilingual model in the MTEB leaderboard at the time of release.
244
+ - GritLM: A generative representational instruction tuned language model.
245
+
246
  #### Methodology for Multilingual Evaluations (European languages)
247
  * Context: MTEB is a collection of tasks across many task types (e.g. classification, retrieval etc.). Furthermore, tasks can
248
  have N subsets on different languages. Subsets itself can also contain N languages, e.g. translation-related tasks. Base script
 
265
  - i.e. this gives 20-2=18 translation pair subsets between the 5 languages. -2 because Italian ↔︎ German doesn’t exist.
266
  - this is done because otherwise there are 250 translation pair subsets which are not as relevant (e.g. they contain Vietnamese ↔︎ Portuguese)
267
 
268
+ We used the official scores reported in MTEB Leaderboard if reported, but for some models and subset we created the scores ourselves with the official Huggingface checkpoints and
269
+ instructions referenced in the Paper or Model card.
270
+
271
  #### Europe by task
272
 
273
  | Model Name | AmazonCounterfactualClassification | BUCC.v2 | DiaBlaBitextMining | MassiveScenarioClassification | NTREXBitextMining | STS17 | Average |
274
  |-------------------------------------------------------|-------------------------------------:|----------:|---------------------:|--------------------------------:|--------------------:|---------:|----------:|
275
+ | Pharia-1-Embedding-4608-control | 72.49 | 99.19 | 86.51 | 75.58 | 98.24 | 87.67 | 86.61 |
276
+ | GritLM-7B | 76.64 | 99.43 | 86.45 | 78.93 | 98.46 | 88.07 | 87.99 |
277
+ | BGE-Multilingual-Gemma2 | 69.72 | 99.38 | 86.90 | 78.57 | 98.58 | 86.69 | 86.64 |
278
+ | Nvidia-Embed-v2 | 70.72 | 99.14 | 73.22 | 75.21 | 96.65 | 87.36 | 83.72 |
279
+
280
 
281
  #### Europe by language
282
 
 
284
  |-------------------------------------------------------|-----------:|-----------:|-----------:|-----------:|-----------:|-----------:|----------:|
285
  | Pharia-1-Embedding-4608-control | 0.925309 | 0.902113 | 0.937961 | 0.953719 | 0.942352 | 0.945642 | 0.934516 |
286
  | GritLM-7B | 0.934603 | 0.905669 | 0.942364 | 0.962042 | 0.949731 | 0.947428 | 0.940306 |
287
+ | BGE-Multilingual-Gemma2| 93.07 | 92.17 | 94.91 | 94.64 | 96.28 | 94.94 | 94.35 |
288
+ | Nvidia-Embed-v2 | 91.58 | 88.85 | 90.51 | 93.94 | 95.08 | 93.78| 92.29 |
289
 
290
 
 
 
 
 
 
 
 
 
 
291
 
 
 
 
 
 
 
 
 
 
292
 
293
+ #### MTEB – English only
294
 
295
+ | |Retrieval|Classification|STS|Summarization|PairClassification|Clustering|Reranking|Average|
296
+ |---|--|--|--|--|--|--|--|--|
297
+ |Nvidia-Embed-v2|62.65|90.37|84.31|30.7|88.67|58.46|60.65|72.31|
298
+ |BGE-Multilingual-Gemma2|59.24|88.08|83.88|31.2|85.84|54.65|59.72|69.88|
299
+ |GritLM-7B|57.36|78.65|83.35|30.39|87.29|50.61|60.48|66.58|
300
+ |Pharia-1-Embedding-4608-control|39.15 |74.40|82.7 |30.95 |81.73|46.23|57.45|58.94|
301
 
 
 
 
 
 
 
 
302
 
 
303
 
304
+ #### Ablation for “No Instruction” case
305
+ We ablate how performance changes when not using task-specific instructions for the embeddings.
306
 
307
+ |Model Name|ArguAna|AskUbuntuDupQuestions|BIOSSES|Banking77Classification|EmotionClassification|MedrxivClusteringS2S|NFCorpus|STS17|STSBenchmark|SciFact|SummEval|TwitterSemEval2015|Average|
308
+ |--|--|--|--|--|--|--|--|--|--|--|--|--|--|
309
+ |Instruction |51.09|61.71|84.56|86.37|51.77|34.29|37.82|89.56|87.08|69.7 |30.95|70.97|**62.99**|
310
+ |No Instruction |50.23|60.31|84.45|86.36|50.6 |31.87|37.58|88.75|86.39|71.28|31.00|68.92|**62.31**|
311
+ |Relative Δ|-1.71%|-2.32%|-0.13%|-0.01%|-2.31%|-7.59%|-0.64%|-0.91%|-0.80%|2.22%|0.16%|-2.97%|**-1.09%**|
312
+
313
+ We observe slightly reduced performance across most tasks when not using task-specific instructions with an average loss in performance of roughly 1%.
314
 
315
 
316
  ## Training Details