mp
commited on
Commit
•
556651f
1
Parent(s):
c3c777b
updated HF repo model card from confluence master document
Browse files
README.md
CHANGED
@@ -9,12 +9,9 @@ license_link: LICENSE
|
|
9 |
This model card provides an overview of Pharia-1-Embedding-4608-control, an embedding model
|
10 |
developed by Aleph Alpha Research*. Pharia-1-Embedding-4608-control has been built on top of Pharia-1-LLM-7B-control.
|
11 |
For additional training details, including architecture, tokenization, tokenizer fertility, pre-training,
|
12 |
-
instruction fine-tuning and resource usage we refer to the model card of Pharia-1-LLM-7B-control.
|
13 |
-
|
14 |
-
embeddings at runtime without further finetuning. Pharia-1-Embedding-4608-control was trained on carefully curated
|
15 |
-
data in compliance with applicable EU and national regulations, including copyright and data privacy laws.
|
16 |
-
Furthermore it shows good cross-lingual performance allowing for prompting and text to be embedded written
|
17 |
-
in different languages. The finetuning was always performed using English instructions.
|
18 |
|
19 |
|
20 |
## Model Overview
|
@@ -22,7 +19,7 @@ in different languages. The finetuning was always performed using English instru
|
|
22 |
- **Developed by:** Aleph Alpha Research
|
23 |
<!--- **Funded by [optional]:** [More Information Needed]-->
|
24 |
<!--- **Shared by [optional]:** [More Information Needed]-->
|
25 |
-
- **Model type:** Embedding adapter on top of Pharia-1-LLM-7B-control trained with representational
|
26 |
instruction-tuning (inspired by the approach of GritLM).
|
27 |
- **Language(s) (NLP):** Trained on English, German, French, Spanish.
|
28 |
<!--- **License:** [More Information Needed]-->
|
@@ -42,25 +39,12 @@ in different languages. The finetuning was always performed using English instru
|
|
42 |
### Model Access
|
43 |
|
44 |
We provide access to our models through the channels listed below.
|
45 |
-
- On-premise installation: Our customers are supplied with our full LLM and Embedding model stack, including model weights
|
46 |
-
|
47 |
-
|
48 |
-
|
49 |
-
of
|
50 |
-
|
51 |
-
collect PII (personally identifiable information) for any of our public API users as detailed in our Terms & Conditions.
|
52 |
-
We do not log user inputs to the models. We do not train on user data.
|
53 |
-
- **Note:** The same models are made available to users regardless of their geographic location,
|
54 |
-
and the input language but subject to sanction regimes, technology export regulations, and other restrictions that may apply.
|
55 |
-
The same offering is provided to all countries within and external to the European Union if no legal restrictions apply.
|
56 |
-
|
57 |
-
|
58 |
-
<!-- Provide the basic links for the model.
|
59 |
-
|
60 |
-
- **Repository:** [More Information Needed]
|
61 |
-
- **Paper [optional]:** [More Information Needed]
|
62 |
-
- **Demo [optional]:** [More Information Needed]
|
63 |
-
-->
|
64 |
|
65 |
### Intended Use
|
66 |
|
@@ -78,8 +62,9 @@ including those related to military or nuclear applications, and activities not
|
|
78 |
technology export regulations, and other restrictions that may apply. The models are to be used following ethical standards.
|
79 |
The utilization of our technology is always governed by, and may be limited in accordance with,
|
80 |
our Terms of Use, the Open Aleph License, or any specific agreement we might have established with you.
|
|
|
81 |
For non-anonymous reports, we also provide an appeals mechanism for usage policy violations via
|
82 |
-
our dedicated contact address violations@aleph-alpha.com to communicate with us.
|
83 |
|
84 |
Customers and partners are enabled to use our ticketing
|
85 |
system [ticketing system](https://servicedesk.aleph-alpha.de/external) for appeals, claims and feedback.
|
@@ -89,50 +74,52 @@ system [ticketing system](https://servicedesk.aleph-alpha.de/external) for appea
|
|
89 |
|
90 |
Beyond the risks & limitations stated in
|
91 |
the original [Pharia-1-LLM-7B-control](https://huggingface.co/Aleph-Alpha/Pharia-1-LLM-7B-control), the following limitation applies:
|
92 |
-
Pharia-1-Embedding-4608-control has been optimized on embedding
|
93 |
computation only. Therefore, we do not recommend usage for text generation purposes.
|
94 |
|
95 |
## How to Use
|
|
|
96 |
|
97 |
-
### Use with
|
98 |
|
99 |
-
To perform inference with the original model files, you’ll first need to install the
|
100 |
-
[Scaling library](https://github.com/Aleph-Alpha/scaling). Follow the installation instructions provided in the repository's README file.
|
101 |
-
After installation, download the model weights and use the Scaling inference module to load the
|
102 |
-
checkpoint, vocabulary, and configuration files.
|
103 |
|
104 |
```
|
105 |
-
from pathlib import Path
|
106 |
from torch.nn import CosineSimilarity
|
107 |
-
from
|
108 |
-
|
109 |
-
|
110 |
-
|
111 |
-
)
|
112 |
-
|
|
|
|
|
|
|
113 |
query = "Which country is Galileo from?"
|
114 |
-
query_embeddings =
|
115 |
print(f"Type of embeddings: {type(query_embeddings)},\n\
|
116 |
shape of query embeddings: {query_embeddings.shape}")
|
117 |
# embed the documents:
|
118 |
document_1 = "Galileo is a German television program series produced and broadcast on ProSieben television network. It is also sold to broadcasters in other countries (namely Russia and Poland). The first show was broadcast in 1998, and is now stored in the Arctic World Archive in Svalbard, Norway, after being transferred to special film created by Piql."
|
119 |
-
document_embeddings_1 =
|
120 |
document_2 = "Galileo di Vincenzo Bonaiuti de' Galilei (15 February 1564 - 8 January 1642), commonly referred to as Galileo Galilei or mononymously as Galileo, was an Italian (Florentine) astronomer, physicist and engineer, sometimes described as a polymath. He was born in the city of Pisa, then part of the Duchy of Florence and present-day Italy."
|
121 |
-
document_embeddings_2 =
|
122 |
# customized embeddings steering the query:
|
123 |
instruction = "Represent the question about TV shows to find a paragraph that answers it."
|
124 |
-
steered_query_embeddings =
|
125 |
-
|
126 |
-
|
|
|
|
|
127 |
# compute similarity between steered query and both documents
|
128 |
-
cossim = CosineSimilarity(dim=
|
129 |
sim1 = round(cossim(document_embeddings_1, steered_query_embeddings).item(), 3)
|
130 |
sim2 = round(cossim(document_embeddings_2, steered_query_embeddings).item(), 3)
|
131 |
print("Steered embedding causes higher similarity of query to TV show:")
|
132 |
print(f"Similarity query/TV show ({sim1}) > similarity query/Italian polymath: ({sim2})")
|
133 |
```
|
|
|
134 |
|
135 |
-
###
|
136 |
|
137 |
Pharia-1-Embedding-4608-control is useful for any use-case that relates to estimating the similarity/relevance between
|
138 |
text fragments. This is relevant for use-cases such as information retrieval, semantic search, re-ranking and clustering.
|
@@ -177,11 +164,13 @@ To further improve performance you can use instructions to steer the model. Inst
|
|
177 |
understand nuances of your specific data and ultimately lead to embeddings that are more useful for your use-case.
|
178 |
In this case, we aim to get embeddings that would lead to ranking the paragraph about the German TV Show higher
|
179 |
than the paragraph about the Italian polymath.
|
|
|
180 |
**Step 1:**
|
181 |
Embed the Query with an Instruction
|
182 |
```"instruction": "Represent the question about TV shows to find a paragraph that answers it."```
|
183 |
```"input": "input": "Which country is Galileo from?"```
|
184 |
→ Embedding: ```[-0.6310919, 1.4309896, -0.85546875, ...]```
|
|
|
185 |
**Step 2:**
|
186 |
Compare the similarity
|
187 |
We leave the embeddings of the documents untouched and now obtain the following cosine similarities:
|
@@ -204,33 +193,56 @@ and ultimately lead to embeddings that are more useful for your use-case.
|
|
204 |
|
205 |
## Evaluation
|
206 |
|
207 |
-
|
208 |
-
and naturally would be expected to underperform on it as we optimize instead for real-world usage and multilinguality.
|
209 |
-
Nonetheless, for comparability we share results on a subset of tasks of the
|
210 |
-
English MTEB benchmark. The subset contains tasks from all task types (classification, summarization, etc.) of
|
211 |
-
the full benchmark and is therefore roughly representative of it.
|
212 |
|
213 |
-
|
214 |
-
|
|
|
|
|
|
|
|
|
|
|
215 |
|
216 |
-
|
217 |
-
|
218 |
-
|
|
219 |
-
|
220 |
-
|
|
221 |
-
|
|
|
|
|
|
|
|
222 |
|
223 |
|
224 |
-
####
|
225 |
-
We ablate how performance changes when not using task-specific instructions for the embeddings.
|
226 |
|
227 |
-
|
228 |
-
|
229 |
-
|
|
230 |
-
|
|
231 |
-
|
|
|
|
|
|
|
|
|
|
232 |
|
233 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
234 |
#### Methodology for Multilingual Evaluations (European languages)
|
235 |
* Context: MTEB is a collection of tasks across many task types (e.g. classification, retrieval etc.). Furthermore, tasks can
|
236 |
have N subsets on different languages. Subsets itself can also contain N languages, e.g. translation-related tasks. Base script
|
@@ -253,12 +265,18 @@ from [mteb/scripts/task_selection/europe_tasks.csv at main · embeddings-benchma
|
|
253 |
- i.e. this gives 20-2=18 translation pair subsets between the 5 languages. -2 because Italian ↔︎ German doesn’t exist.
|
254 |
- this is done because otherwise there are 250 translation pair subsets which are not as relevant (e.g. they contain Vietnamese ↔︎ Portuguese)
|
255 |
|
|
|
|
|
|
|
256 |
#### Europe by task
|
257 |
|
258 |
| Model Name | AmazonCounterfactualClassification | BUCC.v2 | DiaBlaBitextMining | MassiveScenarioClassification | NTREXBitextMining | STS17 | Average |
|
259 |
|-------------------------------------------------------|-------------------------------------:|----------:|---------------------:|--------------------------------:|--------------------:|---------:|----------:|
|
260 |
-
| Pharia-1-Embedding-4608-control |
|
261 |
-
| GritLM-7B |
|
|
|
|
|
|
|
262 |
|
263 |
#### Europe by language
|
264 |
|
@@ -266,48 +284,33 @@ from [mteb/scripts/task_selection/europe_tasks.csv at main · embeddings-benchma
|
|
266 |
|-------------------------------------------------------|-----------:|-----------:|-----------:|-----------:|-----------:|-----------:|----------:|
|
267 |
| Pharia-1-Embedding-4608-control | 0.925309 | 0.902113 | 0.937961 | 0.953719 | 0.942352 | 0.945642 | 0.934516 |
|
268 |
| GritLM-7B | 0.934603 | 0.905669 | 0.942364 | 0.962042 | 0.949731 | 0.947428 | 0.940306 |
|
|
|
|
|
269 |
|
270 |
|
271 |
-
#### Evaluations on cross-lingual capabilities
|
272 |
-
There are important use cases where one wants to retrieve multiple documents on a topic or answering questions that are formulated in a
|
273 |
-
different language than the query. This increases recall and information retrieval coverage. For testing on cross-lingual capabilities
|
274 |
-
we evaluated Pharia-1-Embedding-4608-control, GritLM and Nvidia-Embed-v2 on the MLQA-V1 datasets (Facebook) for German/English and
|
275 |
-
English/Spanish language pairings. For German/French we used the CLSD-WMT19 dataset providing correct and adversarial translations
|
276 |
-
of a sentence in the corresponding pair language. In order to check quality over a larger range of sample size we did the accuracy
|
277 |
-
computations for varying number of samples taken from the MLQA-V1 dataset. For the CLSD-WMT19 evaluation we employed the
|
278 |
-
full set of data (2900 samples available).
|
279 |
-
|
280 |
|
281 |
-
#### MLQA-V1 Ger/Eng cross-lingual accuracies for the considered models
|
282 |
-
|
283 |
-
|# of samples|Pharia4608|GritLM|Nvidia-Embed-v2|BGE-Gemma2|
|
284 |
-
|:---:|:---:|:---:|:---:|:---:|
|
285 |
-
|1000|86.0%|82.5%|77.0%|87.0%|
|
286 |
-
|2000|79.5%|73.4%|69.4%|76.8%|
|
287 |
-
|4000|65.3%|59.2%|56.0%|62.7%|
|
288 |
-
|6000|54.3%|48.6%|45.6%|52.6%|
|
289 |
-
|10000|38.6%|32.8%|32.8%|39.4%|
|
290 |
|
|
|
291 |
|
292 |
-
|
|
|
|
|
|
|
|
|
|
|
293 |
|
294 |
-
|# samples|Pharia4608|GritLM|NV-Embed-v2|BGE-Gemma2|
|
295 |
-
|:---:|:---:|:---:|:---:|:---:|
|
296 |
-
|1000|87.5%|82.0%|81.5%|87.0%|
|
297 |
-
|2000|78.5%|73.9%|70.7%|77.0%|
|
298 |
-
|4000|65.5%|59.3%|56.9%|64.2%|
|
299 |
-
|6000|55.3%|49.2%|46.2%|53.4%|
|
300 |
-
|10000|41.7%|35.5%|33.2%|40.0%|
|
301 |
|
302 |
-
#### CLSD-WMT19 Ger/Fra (2900 samples) cross-lingual evaluation for the considered models
|
303 |
|
|
|
|
|
304 |
|
305 |
-
|Model Name
|
306 |
-
|
307 |
-
|
|
308 |
-
|
|
309 |
-
|
|
310 |
-
|
|
|
311 |
|
312 |
|
313 |
## Training Details
|
|
|
9 |
This model card provides an overview of Pharia-1-Embedding-4608-control, an embedding model
|
10 |
developed by Aleph Alpha Research*. Pharia-1-Embedding-4608-control has been built on top of Pharia-1-LLM-7B-control.
|
11 |
For additional training details, including architecture, tokenization, tokenizer fertility, pre-training,
|
12 |
+
instruction fine-tuning and resource usage we refer to the model card of [Pharia-1-LLM-7B-control](https://huggingface.co/Aleph-Alpha/Pharia-1-LLM-7B-control).
|
13 |
+
|
14 |
+
Due to being trained with a diverse set of instructions, Pharia-1-Embedding-4608-control can deliver customized embeddings at runtime without further finetuning. Pharia-1-Embedding-4608-control was trained on carefully curated data in compliance with applicable EU and national regulations, including copyright and data privacy laws. Furthermore it shows strong cross-lingual performance allowing for prompting and text to be embedded written in different languages. The finetuning was always performed using English instructions.
|
|
|
|
|
|
|
15 |
|
16 |
|
17 |
## Model Overview
|
|
|
19 |
- **Developed by:** Aleph Alpha Research
|
20 |
<!--- **Funded by [optional]:** [More Information Needed]-->
|
21 |
<!--- **Shared by [optional]:** [More Information Needed]-->
|
22 |
+
- **Model type/architecture:** Embedding adapter on top of Pharia-1-LLM-7B-control trained with representational
|
23 |
instruction-tuning (inspired by the approach of GritLM).
|
24 |
- **Language(s) (NLP):** Trained on English, German, French, Spanish.
|
25 |
<!--- **License:** [More Information Needed]-->
|
|
|
39 |
### Model Access
|
40 |
|
41 |
We provide access to our models through the channels listed below.
|
42 |
+
- On-premise installation: Our customers are supplied with our full LLM and Embedding model stack, including model weights and inference runtime. Contact us for options to deploy Pharia-1-Embedding-4608-control in any cloud or on-premise environment. We provide our customers with open access to our full model checkpoint including weights and code for commercial use.
|
43 |
+
Downloadable from Huggingface: An HF-adapted version of our model can be found in our Huggingface repo (https://huggingface.co/Aleph-Alpha/Pharia-1-Embedding-4608-control-hf) together with code snippets that make the model easy to use.
|
44 |
+
Please refer to the changelog for updates to the models served. We do not deprecate officially released versions of old model generations when we release newer versions, so users can continue to have access to available models.
|
45 |
+
No prompt data is stored when using our systems, which means that we do not collect PII (personally identifiable information) for any of our public API users as detailed in our Terms & Conditions. We do not log user inputs to the models. We do not train on user data.
|
46 |
+
- **Note**: The same models are made available to users regardless of their geographic location, and the input language but subject to sanction regimes, technology export regulations, and other restrictions that may apply. The same offering is provided to all countries within and external to the European Union if no legal restrictions apply.
|
47 |
+
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
48 |
|
49 |
### Intended Use
|
50 |
|
|
|
62 |
technology export regulations, and other restrictions that may apply. The models are to be used following ethical standards.
|
63 |
The utilization of our technology is always governed by, and may be limited in accordance with,
|
64 |
our Terms of Use, the Open Aleph License, or any specific agreement we might have established with you.
|
65 |
+
|
66 |
For non-anonymous reports, we also provide an appeals mechanism for usage policy violations via
|
67 |
+
our dedicated contact address [violations@aleph-alpha.com](violations@aleph-alpha.com) to communicate with us.
|
68 |
|
69 |
Customers and partners are enabled to use our ticketing
|
70 |
system [ticketing system](https://servicedesk.aleph-alpha.de/external) for appeals, claims and feedback.
|
|
|
74 |
|
75 |
Beyond the risks & limitations stated in
|
76 |
the original [Pharia-1-LLM-7B-control](https://huggingface.co/Aleph-Alpha/Pharia-1-LLM-7B-control), the following limitation applies:
|
77 |
+
- Pharia-1-Embedding-4608-control has been optimized on embedding
|
78 |
computation only. Therefore, we do not recommend usage for text generation purposes.
|
79 |
|
80 |
## How to Use
|
81 |
+
We provide two access pathways for our Pharia4608 embedding model. The first one leverages the HF ecosystem and can be found here: https://huggingface.co/Aleph-Alpha/Pharia-1-Embedding-4608-control-hf. The code snippet in the box below demonstrates its use. As soon as the model class is invoked, the model will we loaded from the repo and is ready for use. The other access pathway is through our public scaling code base. In this version the model weights were not converted to HF format and the repo https://huggingface.co/Aleph-Alpha/Pharia-1-Embedding-4608-control can be cloned as is. The model path has to be adjusted to the local path where the model was downloaded. The model cards in the corresponding repositories only the code snippet which applies to the specific repo.
|
82 |
|
83 |
+
### Use with Huggingface
|
84 |
|
|
|
|
|
|
|
|
|
85 |
|
86 |
```
|
|
|
87 |
from torch.nn import CosineSimilarity
|
88 |
+
from transformers import AutoConfig, AutoModel
|
89 |
+
from transformers import PreTrainedTokenizerFast
|
90 |
+
MODEL_PATH = 'Aleph-Alpha/Pharia-1-Embedding-4608-control-hf'
|
91 |
+
config = AutoConfig.from_pretrained(MODEL_PATH, trust_remote_code=True)
|
92 |
+
tokenizer = PreTrainedTokenizerFast.from_pretrained(MODEL_PATH)
|
93 |
+
model = AutoModel.from_pretrained(MODEL_PATH,
|
94 |
+
trust_remote_code=True,
|
95 |
+
config=config,
|
96 |
+
tokenizer=tokenizer).cuda()
|
97 |
query = "Which country is Galileo from?"
|
98 |
+
query_embeddings = model.encode_queries(query, convert_to_tensor=True)
|
99 |
print(f"Type of embeddings: {type(query_embeddings)},\n\
|
100 |
shape of query embeddings: {query_embeddings.shape}")
|
101 |
# embed the documents:
|
102 |
document_1 = "Galileo is a German television program series produced and broadcast on ProSieben television network. It is also sold to broadcasters in other countries (namely Russia and Poland). The first show was broadcast in 1998, and is now stored in the Arctic World Archive in Svalbard, Norway, after being transferred to special film created by Piql."
|
103 |
+
document_embeddings_1 = model.encode_corpus(document_1, convert_to_tensor=True)
|
104 |
document_2 = "Galileo di Vincenzo Bonaiuti de' Galilei (15 February 1564 - 8 January 1642), commonly referred to as Galileo Galilei or mononymously as Galileo, was an Italian (Florentine) astronomer, physicist and engineer, sometimes described as a polymath. He was born in the city of Pisa, then part of the Duchy of Florence and present-day Italy."
|
105 |
+
document_embeddings_2 = model.encode_corpus(document_2, convert_to_tensor=True)
|
106 |
# customized embeddings steering the query:
|
107 |
instruction = "Represent the question about TV shows to find a paragraph that answers it."
|
108 |
+
steered_query_embeddings = model.encode_queries(
|
109 |
+
query,
|
110 |
+
instruction=instruction,
|
111 |
+
convert_to_tensor=True
|
112 |
+
)
|
113 |
# compute similarity between steered query and both documents
|
114 |
+
cossim = CosineSimilarity(dim=0, eps=1e-6)
|
115 |
sim1 = round(cossim(document_embeddings_1, steered_query_embeddings).item(), 3)
|
116 |
sim2 = round(cossim(document_embeddings_2, steered_query_embeddings).item(), 3)
|
117 |
print("Steered embedding causes higher similarity of query to TV show:")
|
118 |
print(f"Similarity query/TV show ({sim1}) > similarity query/Italian polymath: ({sim2})")
|
119 |
```
|
120 |
+
Disclaimer: For the official evaluation scores we used the Scaling compatible checkpoint available under Pharia-1-Embedding-4608-control (https://huggingface.co/Aleph-Alpha/Pharia-1-Embedding-4608-control)
|
121 |
|
122 |
+
### Example for instruction embedding
|
123 |
|
124 |
Pharia-1-Embedding-4608-control is useful for any use-case that relates to estimating the similarity/relevance between
|
125 |
text fragments. This is relevant for use-cases such as information retrieval, semantic search, re-ranking and clustering.
|
|
|
164 |
understand nuances of your specific data and ultimately lead to embeddings that are more useful for your use-case.
|
165 |
In this case, we aim to get embeddings that would lead to ranking the paragraph about the German TV Show higher
|
166 |
than the paragraph about the Italian polymath.
|
167 |
+
|
168 |
**Step 1:**
|
169 |
Embed the Query with an Instruction
|
170 |
```"instruction": "Represent the question about TV shows to find a paragraph that answers it."```
|
171 |
```"input": "input": "Which country is Galileo from?"```
|
172 |
→ Embedding: ```[-0.6310919, 1.4309896, -0.85546875, ...]```
|
173 |
+
|
174 |
**Step 2:**
|
175 |
Compare the similarity
|
176 |
We leave the embeddings of the documents untouched and now obtain the following cosine similarities:
|
|
|
193 |
|
194 |
## Evaluation
|
195 |
|
196 |
+
### Evaluations on cross-lingual capabilities
|
|
|
|
|
|
|
|
|
197 |
|
198 |
+
There are important use cases where one wants to retrieve multiple documents on a topic or answering questions that are formulated
|
199 |
+
in a different language than the query. This increases recall and information retrieval coverage. For testing on cross-lingual
|
200 |
+
capabilities we evaluated Pharia-1-Embedding-4608-control, GritLM, Nvidia-Embed-v2 and BGE-Multilingual-Gemma2
|
201 |
+
on the MLQA-V1 datasets (Facebook) for German/English and English/Spanish language pairings. For German/French we
|
202 |
+
used the CLSD-WMT19 dataset providing correct and adversarial translations of a sentence in the corresponding pair language.
|
203 |
+
In order to check quality over a larger range of sample size we did the accuracy computations for varying number of samples
|
204 |
+
taken from the MLQA-V1 dataset. For the CLSD-WMT19 evaluation we employed the full set of data (2900 samples available).
|
205 |
|
206 |
+
#### MLQA-V1 Ger/Eng cross-lingual accuracies for the considered models
|
207 |
+
|
208 |
+
|# of samples|Pharia4608|GritLM|Nvidia-Embed-v2|BGE-Gemma2|
|
209 |
+
|:---:|:---:|:---:|:---:|:---:|
|
210 |
+
|1000|86.0%|82.5%|77.0%|87.0%|
|
211 |
+
|2000|79.5%|73.4%|69.4%|76.8%|
|
212 |
+
|4000|65.3%|59.2%|56.0%|62.7%|
|
213 |
+
|6000|54.3%|48.6%|45.6%|52.6%|
|
214 |
+
|10000|38.6%|32.8%|32.8%|39.4%|
|
215 |
|
216 |
|
217 |
+
#### MLQA-V1 Eng/Esp cross-lingual accuracies for the considered models
|
|
|
218 |
|
219 |
+
|# samples|Pharia4608|GritLM|NV-Embed-v2|BGE-Gemma2|
|
220 |
+
|:---:|:---:|:---:|:---:|:---:|
|
221 |
+
|1000|87.5%|82.0%|81.5%|87.0%|
|
222 |
+
|2000|78.5%|73.9%|70.7%|77.0%|
|
223 |
+
|4000|65.5%|59.3%|56.9%|64.2%|
|
224 |
+
|6000|55.3%|49.2%|46.2%|53.4%|
|
225 |
+
|10000|41.7%|35.5%|33.2%|40.0%|
|
226 |
+
|
227 |
+
#### CLSD-WMT19 Ger/Fra (2900 samples) cross-lingual evaluation for the considered models
|
228 |
|
229 |
|
230 |
+
|Model Name | accuracy |
|
231 |
+
|:-----------------------------:|:--------------------------------:|
|
232 |
+
|Pharia-1-Embedding-4608-control|95.1% |
|
233 |
+
|GritLM-7B |94.2% |
|
234 |
+
|Nvidia-Embed-v2 |93.4% |
|
235 |
+
|BGE-Gemma2 |95.4% |
|
236 |
+
|
237 |
+
|
238 |
+
## Evaluations on MTEB tasks
|
239 |
+
|
240 |
+
To evaluate our models multilingual capabilities we evaluate it against other source-available, high-performing embedding models listen in the
|
241 |
+
MTEB leaderboard. For the following evaluations we compare the following models:
|
242 |
+
- NVEmbed-V2: The highest scoring model in the MTEB leaderboard at time of the release
|
243 |
+
- BGE-Multilingual-Gemma2: The highest scoring multilingual model in the MTEB leaderboard at the time of release.
|
244 |
+
- GritLM: A generative representational instruction tuned language model.
|
245 |
+
|
246 |
#### Methodology for Multilingual Evaluations (European languages)
|
247 |
* Context: MTEB is a collection of tasks across many task types (e.g. classification, retrieval etc.). Furthermore, tasks can
|
248 |
have N subsets on different languages. Subsets itself can also contain N languages, e.g. translation-related tasks. Base script
|
|
|
265 |
- i.e. this gives 20-2=18 translation pair subsets between the 5 languages. -2 because Italian ↔︎ German doesn’t exist.
|
266 |
- this is done because otherwise there are 250 translation pair subsets which are not as relevant (e.g. they contain Vietnamese ↔︎ Portuguese)
|
267 |
|
268 |
+
We used the official scores reported in MTEB Leaderboard if reported, but for some models and subset we created the scores ourselves with the official Huggingface checkpoints and
|
269 |
+
instructions referenced in the Paper or Model card.
|
270 |
+
|
271 |
#### Europe by task
|
272 |
|
273 |
| Model Name | AmazonCounterfactualClassification | BUCC.v2 | DiaBlaBitextMining | MassiveScenarioClassification | NTREXBitextMining | STS17 | Average |
|
274 |
|-------------------------------------------------------|-------------------------------------:|----------:|---------------------:|--------------------------------:|--------------------:|---------:|----------:|
|
275 |
+
| Pharia-1-Embedding-4608-control | 72.49 | 99.19 | 86.51 | 75.58 | 98.24 | 87.67 | 86.61 |
|
276 |
+
| GritLM-7B | 76.64 | 99.43 | 86.45 | 78.93 | 98.46 | 88.07 | 87.99 |
|
277 |
+
| BGE-Multilingual-Gemma2 | 69.72 | 99.38 | 86.90 | 78.57 | 98.58 | 86.69 | 86.64 |
|
278 |
+
| Nvidia-Embed-v2 | 70.72 | 99.14 | 73.22 | 75.21 | 96.65 | 87.36 | 83.72 |
|
279 |
+
|
280 |
|
281 |
#### Europe by language
|
282 |
|
|
|
284 |
|-------------------------------------------------------|-----------:|-----------:|-----------:|-----------:|-----------:|-----------:|----------:|
|
285 |
| Pharia-1-Embedding-4608-control | 0.925309 | 0.902113 | 0.937961 | 0.953719 | 0.942352 | 0.945642 | 0.934516 |
|
286 |
| GritLM-7B | 0.934603 | 0.905669 | 0.942364 | 0.962042 | 0.949731 | 0.947428 | 0.940306 |
|
287 |
+
| BGE-Multilingual-Gemma2| 93.07 | 92.17 | 94.91 | 94.64 | 96.28 | 94.94 | 94.35 |
|
288 |
+
| Nvidia-Embed-v2 | 91.58 | 88.85 | 90.51 | 93.94 | 95.08 | 93.78| 92.29 |
|
289 |
|
290 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
291 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
292 |
|
293 |
+
#### MTEB – English only
|
294 |
|
295 |
+
| |Retrieval|Classification|STS|Summarization|PairClassification|Clustering|Reranking|Average|
|
296 |
+
|---|--|--|--|--|--|--|--|--|
|
297 |
+
|Nvidia-Embed-v2|62.65|90.37|84.31|30.7|88.67|58.46|60.65|72.31|
|
298 |
+
|BGE-Multilingual-Gemma2|59.24|88.08|83.88|31.2|85.84|54.65|59.72|69.88|
|
299 |
+
|GritLM-7B|57.36|78.65|83.35|30.39|87.29|50.61|60.48|66.58|
|
300 |
+
|Pharia-1-Embedding-4608-control|39.15 |74.40|82.7 |30.95 |81.73|46.23|57.45|58.94|
|
301 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
302 |
|
|
|
303 |
|
304 |
+
#### Ablation for “No Instruction” case
|
305 |
+
We ablate how performance changes when not using task-specific instructions for the embeddings.
|
306 |
|
307 |
+
|Model Name|ArguAna|AskUbuntuDupQuestions|BIOSSES|Banking77Classification|EmotionClassification|MedrxivClusteringS2S|NFCorpus|STS17|STSBenchmark|SciFact|SummEval|TwitterSemEval2015|Average|
|
308 |
+
|--|--|--|--|--|--|--|--|--|--|--|--|--|--|
|
309 |
+
|Instruction |51.09|61.71|84.56|86.37|51.77|34.29|37.82|89.56|87.08|69.7 |30.95|70.97|**62.99**|
|
310 |
+
|No Instruction |50.23|60.31|84.45|86.36|50.6 |31.87|37.58|88.75|86.39|71.28|31.00|68.92|**62.31**|
|
311 |
+
|Relative Δ|-1.71%|-2.32%|-0.13%|-0.01%|-2.31%|-7.59%|-0.64%|-0.91%|-0.80%|2.22%|0.16%|-2.97%|**-1.09%**|
|
312 |
+
|
313 |
+
We observe slightly reduced performance across most tasks when not using task-specific instructions with an average loss in performance of roughly 1%.
|
314 |
|
315 |
|
316 |
## Training Details
|