mp commited on
Commit
e1be73f
1 Parent(s): dbcfe2a

copied and updated model card from the original Pharia4608 repository

Browse files
Files changed (1) hide show
  1. README.md +366 -5
README.md CHANGED
@@ -1,5 +1,366 @@
1
- ---
2
- license: other
3
- license_name: open-aleph-license
4
- license_link: LICENSE
5
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: open-aleph-license
4
+ license_link: LICENSE
5
+ ---
6
+
7
+ # Model Card for Pharia-1-Embedding-4608-control
8
+
9
+ This model card provides an overview of Pharia-1-Embedding-4608-control, an embedding model
10
+ developed by Aleph Alpha Research*. Pharia-1-Embedding-4608-control has been built on top of Pharia-1-LLM-7B-control.
11
+ For additional training details, including architecture, tokenization, tokenizer fertility, pre-training,
12
+ instruction fine-tuning and resource usage we refer to the model card of Pharia-1-LLM-7B-control.
13
+ Due to being trained with a diverse set of instructions, Pharia-1-Embedding-4608-control can deliver customized
14
+ embeddings at runtime without further finetuning. Pharia-1-Embedding-4608-control was trained on carefully curated
15
+ data in compliance with applicable EU and national regulations, including copyright and data privacy laws.
16
+ Furthermore it shows good cross-lingual performance allowing for prompting and text to be embedded written
17
+ in different languages. The finetuning was always performed using English instructions.
18
+
19
+
20
+ ## Model Overview
21
+
22
+ - **Developed by:** Aleph Alpha Research
23
+ <!--- **Funded by [optional]:** [More Information Needed]-->
24
+ <!--- **Shared by [optional]:** [More Information Needed]-->
25
+ - **Model type:** Embedding adapter on top of Pharia-1-LLM-7B-control trained with representational
26
+ instruction-tuning (inspired by the approach of GritLM).
27
+ - **Language(s) (NLP):** Trained on English, German, French, Spanish.
28
+ <!--- **License:** [More Information Needed]-->
29
+ <!--- **Finetuned from model [optional]:** [More Information Needed]-->
30
+ - **USP:** Model exhibits superior quality in pure cross-lingual tasks for (German, English, French & Spanish pairings, see evaluation below)
31
+
32
+
33
+ ### Model Description
34
+
35
+
36
+ |Model |Embedding Size|Description|
37
+ |--------------------------------|--------------|-----------|
38
+ |Pharia-1-Embedding-4608-control |4608|Pharia-1-Embedding-4608-control is an Embedding model optimized for German, French and Spanish and designed for customizable embeddings at runtime via instructions (prompts)|
39
+
40
+ <!-- Provide a longer summary of what this model is. -->
41
+
42
+ ### Model Access
43
+
44
+ We provide access to our models through the channels listed below.
45
+ - On-premise installation: Our customers are supplied with our full LLM and Embedding model stack, including model weights
46
+ and inference runtime. Contact us for options to deploy Pharia-1-Embedding-4608-control in any cloud or on-premise environment.
47
+ We provide our customers with open access to our full model checkpoint including weights and code for commercial use.
48
+ Please refer to the changelog for updates to the models served. We do not deprecate officially released versions
49
+ of old model generations when we release newer versions, so users can continue to have access to available models.
50
+ No prompt data is stored when using our systems, which means that we do not
51
+ collect PII (personally identifiable information) for any of our public API users as detailed in our Terms & Conditions.
52
+ We do not log user inputs to the models. We do not train on user data.
53
+ - **Note:** The same models are made available to users regardless of their geographic location,
54
+ and the input language but subject to sanction regimes, technology export regulations, and other restrictions that may apply.
55
+ The same offering is provided to all countries within and external to the European Union if no legal restrictions apply.
56
+
57
+
58
+ <!-- Provide the basic links for the model.
59
+
60
+ - **Repository:** [More Information Needed]
61
+ - **Paper [optional]:** [More Information Needed]
62
+ - **Demo [optional]:** [More Information Needed]
63
+ -->
64
+
65
+ ### Intended Use
66
+
67
+ Pharia-1-Embedding-4608-control is intended to be deployed as components of AI systems or applications.
68
+ Use-cases and the model's capabilities include but are not limited to: information retrieval, semantic search, re-ranking and clustering.
69
+
70
+
71
+ #### Out-of-Scope Use
72
+
73
+ Pharia-1-Embedding-4608-control is not to be used for illegal or unlawful actions of any kind and with any illegal
74
+ or unlawful content. This includes in particular prohibited activities such as engaging in terrorism,
75
+ violence, human trafficking, illegal distribution of materials to minors, sexual solicitation, any other
76
+ criminal activities, harassment, discrimination, creating or promoting malicious code or activities risking death or harm,
77
+ including those related to military or nuclear applications, and activities not in compliance with sanction regimes,
78
+ technology export regulations, and other restrictions that may apply. The models are to be used following ethical standards.
79
+ The utilization of our technology is always governed by, and may be limited in accordance with,
80
+ our Terms of Use, the Open Aleph License, or any specific agreement we might have established with you.
81
+ For non-anonymous reports, we also provide an appeals mechanism for usage policy violations via
82
+ our dedicated contact address violations@aleph-alpha.com to communicate with us.
83
+
84
+ Customers and partners are enabled to use our ticketing
85
+ system [ticketing system](https://servicedesk.aleph-alpha.de/external) for appeals, claims and feedback.
86
+
87
+
88
+ ### Use limitations
89
+
90
+ Beyond the risks & limitations stated in
91
+ the original [Pharia-1-LLM-7B-control](https://huggingface.co/Aleph-Alpha/Pharia-1-LLM-7B-control), the following limitation applies:
92
+ Pharia-1-Embedding-4608-control has been optimized on embedding
93
+ computation only. Therefore, we do not recommend usage for text generation purposes.
94
+
95
+ ## How to Use
96
+
97
+ ### Use with scaling inference code base
98
+
99
+ To perform inference with the original model files, you’ll first need to install the
100
+ [Scaling library](https://github.com/Aleph-Alpha/scaling). Follow the installation instructions provided in the repository's README file.
101
+ After installation, download the model weights and use the Scaling inference module to load the
102
+ checkpoint, vocabulary, and configuration files.
103
+
104
+ ```
105
+ from pathlib import Path
106
+ from torch.nn import CosineSimilarity
107
+ from scaling.transformer.inference import TransformerInferenceModule
108
+ MODEL_PATH = "/path/to/model"
109
+ inference_model = TransformerInferenceModule.from_checkpoint(
110
+ checkpoint_dir=Path(MODEL_PATH),
111
+ )
112
+ # embed the query:
113
+ query = "Which country is Galileo from?"
114
+ query_embeddings = inference_model.encode_queries(query, convert_to_tensor=True)
115
+ print(f"Type of embeddings: {type(query_embeddings)},\n\
116
+ shape of query embeddings: {query_embeddings.shape}")
117
+ # embed the documents:
118
+ document_1 = "Galileo is a German television program series produced and broadcast on ProSieben television network. It is also sold to broadcasters in other countries (namely Russia and Poland). The first show was broadcast in 1998, and is now stored in the Arctic World Archive in Svalbard, Norway, after being transferred to special film created by Piql."
119
+ document_embeddings_1 = inference_model.encode_corpus(document_1, convert_to_tensor=True)
120
+ document_2 = "Galileo di Vincenzo Bonaiuti de' Galilei (15 February 1564 - 8 January 1642), commonly referred to as Galileo Galilei or mononymously as Galileo, was an Italian (Florentine) astronomer, physicist and engineer, sometimes described as a polymath. He was born in the city of Pisa, then part of the Duchy of Florence and present-day Italy."
121
+ document_embeddings_2 = inference_model.encode_corpus(document_2, convert_to_tensor=True)
122
+ # customized embeddings steering the query:
123
+ instruction = "Represent the question about TV shows to find a paragraph that answers it."
124
+ steered_query_embeddings = inference_model.encode_queries(query,
125
+ instruction=instruction,
126
+ convert_to_tensor=True)
127
+ # compute similarity between steered query and both documents
128
+ cossim = CosineSimilarity(dim=1, eps=1e-6)
129
+ sim1 = round(cossim(document_embeddings_1, steered_query_embeddings).item(), 3)
130
+ sim2 = round(cossim(document_embeddings_2, steered_query_embeddings).item(), 3)
131
+ print("Steered embedding causes higher similarity of query to TV show:")
132
+ print(f"Similarity query/TV show ({sim1}) > similarity query/Italian polymath: ({sim2})")
133
+ ```
134
+
135
+ ### Explanation of the instruct embedding code example
136
+
137
+ Pharia-1-Embedding-4608-control is useful for any use-case that relates to estimating the similarity/relevance between
138
+ text fragments. This is relevant for use-cases such as information retrieval, semantic search, re-ranking and clustering.
139
+ We use the task of information retrieval as a guiding example where we assume the
140
+ following query: “Which country is Galileo from?” and two documents:
141
+ - Galileo is a German television program series produced and broadcast on ProSieben television network. It is also sold to broadcasters in other countries (namely Russia and Poland). The first show was broadcast in 1998, and is now stored in the Arctic World Archive in Svalbard, Norway, after being transferred to special film created by Piql.
142
+ - Galileo di Vincenzo Bonaiuti de' Galilei (15 February 1564 - 8 January 1642), commonly referred to as Galileo Galilei or mononymously as Galileo, was an Italian (Florentine) astronomer, physicist and engineer, sometimes described as a polymath. He was born in the city of Pisa, then part of the Duchy of Florence and present-day Italy.
143
+ Source: Wikipedia
144
+ For our guiding example we assume the context of this use-case is a Question-Answer system for movies and TV shows.
145
+
146
+ **Step 1:**
147
+
148
+ Embed the Query
149
+ ```
150
+ "input": "Which country is Galileo from?"
151
+ ```
152
+ → Embedding: ```[-0.6780134, 0.61449033, 0.102911085, ...]```
153
+
154
+ **Step 2:**
155
+
156
+ Embed the Documents
157
+ "input": "Galileo is a German television program series ..."
158
+ → Embedding: ```[-0.36119246, 0.7793595, -0.38735497, ...]```
159
+ "input": "Galileo di Vincenzo Bonaiuti de' Galilei ..."
160
+ → Embedding: ```[-0.25108248, 1.0496024, -0.20945309, ...]```
161
+
162
+ **Step 3:**
163
+
164
+ Compare the similarity
165
+ A typical similarity measure between vectors is cosine similarity. Higher numbers
166
+ indicate more similar vectors and by extension capture the concept of relevance.
167
+ In a RAG application these scores determine the ranking during the retrieval step.
168
+ In this example, we obtain the following cosine similarities:
169
+ Query vs. German TV show: ~0.661
170
+ Query vs. Italian polymath: ~0.757
171
+ This implies that the paragraph about the Italian polymath would be ranked higher than the paragraph
172
+ about the German TV show which is the one we’re interested in.
173
+
174
+ #### Customized Embeddings
175
+
176
+ To further improve performance you can use instructions to steer the model. Instructions can help the model
177
+ understand nuances of your specific data and ultimately lead to embeddings that are more useful for your use-case.
178
+ In this case, we aim to get embeddings that would lead to ranking the paragraph about the German TV Show higher
179
+ than the paragraph about the Italian polymath.
180
+ **Step 1:**
181
+ Embed the Query with an Instruction
182
+ ```"instruction": "Represent the question about TV shows to find a paragraph that answers it."```
183
+ ```"input": "input": "Which country is Galileo from?"```
184
+ → Embedding: ```[-0.6310919, 1.4309896, -0.85546875, ...]```
185
+ **Step 2:**
186
+ Compare the similarity
187
+ We leave the embeddings of the documents untouched and now obtain the following cosine similarities:
188
+ Query vs. German TV show: ~0.632
189
+ Query vs. Italian polymath: ~0.512
190
+ These new cosine similarities imply that the ranking has indeed changed and the paragraph about the German TV show is
191
+ **now more relevant**. This shows that instructions can help the model understand nuances in the data better
192
+ and ultimately lead to embeddings that are more useful for your use-case.
193
+
194
+ #### Tips on using the model
195
+
196
+ - First try and ideally evaluate the model on your data without instructions to see whether performance aligns with your expectations out-of-the-box
197
+ - If you decide to use an instruction with the aim of further boosting performance we suggest using this template as a guideline
198
+ * ```Template: Represent the [X] to find a [Y] that [describe how the X and Y relate]```
199
+ * Examples
200
+ 1. Represent the newspaper paragraph to find a newspaper paragraph with the same topic
201
+ 2. Represent the sentence to find another sentence with the same meaning
202
+ - In cases where the two texts to compare are different in nature (e.g. query and document) – also called “asymmetric” – we suggest to first add an instruction to query texts only. Again, try and ideally evaluate the model in this setting. Then, if your aim is to further boost performance, we suggest that you add instructions to document texts as well where [X] and [Y] are flipped accordingly.
203
+
204
+
205
+ ## Evaluation
206
+
207
+ Pharia-1-Embedding-4608-control has not been optimized for [MTEB](https://github.com/embeddings-benchmark/mteb) (a generic benchmark),
208
+ and naturally would be expected to underperform on it as we optimize instead for real-world usage and multilinguality.
209
+ Nonetheless, for comparability we share results on a subset of tasks of the
210
+ English MTEB benchmark. The subset contains tasks from all task types (classification, summarization, etc.) of
211
+ the full benchmark and is therefore roughly representative of it.
212
+
213
+ #### MTEB – English
214
+ For this evaluation we use task-specific instructions from [MEDI2](https://huggingface.co/datasets/GritLM/MEDI2).
215
+
216
+ |Model Name|ArguAna|AskUbuntuDupQuestions|BIOSSES|Banking77Classification|EmotionClassification|MedrxivClusteringS2S|NFCorpus|STS17|STSBenchmark|SciFact|SummEval|TwitterSemEval2015|Average|
217
+ |--|--|--|--|--|--|--|--|--|--|--|--|--|--|
218
+ |Pharia-1-Embedding-4608-control|51.09|61.71|84.56|86.37|51.77|34.29|37.82|89.56|87.08|69.7 |30.95|70.97|**62.99**|
219
+ |luminous-base (symmetric) |41.94|56.17|80.42|82.49|48.1 |28.7 |28.32|90.31|86.07|56.77|31.44|69.81|**58.38**|
220
+ |GritLM-7B |54.95|67.34|88.19|88.45|47.98|36.80|38.27|89.88|85.64|64.99|30.78|70.12|**63.62**|
221
+ |LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-supervised|54.74|65.19|84.92|88.05|51.2|32.96|22.82|89.58|88.05|73.90|31.01|88.79|**64.27**|
222
+
223
+
224
+ #### Ablation for “No Instruction” case
225
+ We ablate how performance changes when not using task-specific instructions for the embeddings.
226
+
227
+ |Model Name|ArguAna|AskUbuntuDupQuestions|BIOSSES|Banking77Classification|EmotionClassification|MedrxivClusteringS2S|NFCorpus|STS17|STSBenchmark|SciFact|SummEval|TwitterSemEval2015|Average|
228
+ |--|--|--|--|--|--|--|--|--|--|--|--|--|--|
229
+ |Instruction |51.09|61.71|84.56|86.37|51.77|34.29|37.82|89.56|87.08|69.7 |30.95|70.97|**62.99**|
230
+ |No Instruction |50.23|60.31|84.45|86.36|50.6 |31.87|37.58|88.75|86.39|71.28|31.00|68.92|**62.31**|
231
+ |Relative Δ|-1.71%|-2.32%|-0.13%|-0.01%|-2.31%|-7.59%|-0.64%|-0.91%|-0.80%|2.22%|0.16%|-2.97%|**-1.09%**|
232
+
233
+
234
+ #### Methodology for Multilingual Evaluations (European languages)
235
+ * Context: MTEB is a collection of tasks across many task types (e.g. classification, retrieval etc.). Furthermore, tasks can
236
+ have N subsets on different languages. Subsets itself can also contain N languages, e.g. translation-related tasks. Base script
237
+ actually comes from [gritlm/evaluation/eval_mteb.py at main · ContextualAI/gritlm](https://github.com/ContextualAI/gritlm/blob/main/evaluation/eval_mteb.py) and
238
+ includes Medi2-style instructions for many MTEB Tasks. The instructions are all in English. All evaluations use Medi2-style instructions except for
239
+ the “no instructions” case (see above). If a task does not have Medi2-style instructions, we skip the task. As European languages for
240
+ MTEB tests German, Italian, Spanish, Portuguese and French were used.
241
+ * For our Multilingual Evaluations (European languages) we use the tasks
242
+ from [mteb/scripts/task_selection/europe_tasks.csv at main · embeddings-benchmark/mteb](https://github.com/embeddings-benchmark/mteb/blob/main/scripts/task_selection/europe_tasks.csv) and then filter for tasks where there is at least one subset with at least one of the European languages.
243
+ * We skip BibleNLPBitextMining and FloresBitextMining because they don’t have ‘test’ splits, only ‘train’ split which we don’t want to use for evaluation (→ training data contamination likely)
244
+ * We evaluate subsets which contain at least one of the European languages → that’s why there is also an “English” language column because there are subsets that are e.g. En ↔︎ De and are thus considered
245
+ * The tasks that remain are
246
+ - AmazonCounterfactualClassification
247
+ - BUCC.v2
248
+ - DiaBlaBitextMining
249
+ - MassiveScenarioClassification
250
+ - NTREXBitextMining
251
+ - STS17
252
+ * For NTREXBitextMining the subsets are further filtered down to only pairs of the European languages instead of at least one European language
253
+ - i.e. this gives 20-2=18 translation pair subsets between the 5 languages. -2 because Italian ↔︎ German doesn’t exist.
254
+ - this is done because otherwise there are 250 translation pair subsets which are not as relevant (e.g. they contain Vietnamese ↔︎ Portuguese)
255
+
256
+ #### Europe by task
257
+
258
+ | Model Name | AmazonCounterfactualClassification | BUCC.v2 | DiaBlaBitextMining | MassiveScenarioClassification | NTREXBitextMining | STS17 | Average |
259
+ |-------------------------------------------------------|-------------------------------------:|----------:|---------------------:|--------------------------------:|--------------------:|---------:|----------:|
260
+ | luminous-base-symmetric | 0.710921 | 0.990569 | 0.85374 | 0.710148 | 0.971263 | 0.879475 | 0.852686 |
261
+ | Pharia-7b-2048-medi1-causal-weighted-adapter | 0.735118 | 0.984346 | 0.822481 | 0.749375 | 0.968538 | 0.852473 | 0.852055 |
262
+ | Pharia-1-Embedding-4608-control | 0.724946 | 0.991884 | 0.865101 | 0.755763 | 0.982374 | 0.876741 | 0.866135 |
263
+ | GritLM-7B | 0.766381 | 0.994298 | 0.864504 | 0.789334 | 0.984593 | 0.880716 | 0.879971 |
264
+
265
+ #### Europe by language
266
+
267
+ | Model Name | deu-Latn | eng-Latn | fra-Latn | por-Latn | ita-Latn | spa-Latn | Average |
268
+ |-------------------------------------------------------|-----------:|-----------:|-----------:|-----------:|-----------:|-----------:|----------:|
269
+ | luminous-base-symmetric | 0.913887 | 0.90055 | 0.929288 | 0.927929 | 0.932836 | 0.93469 | 0.923197 |
270
+ | Pharia-7b-2048-medi1-causal-weighted-adapter | 0.914817 | 0.876927 | 0.918247 | 0.938783 | 0.92802 | 0.934084 | 0.91848 |
271
+ | Pharia-1-Embedding-4608-control | 0.925309 | 0.902113 | 0.937961 | 0.953719 | 0.942352 | 0.945642 | 0.934516 |
272
+ | GritLM-7B | 0.934603 | 0.905669 | 0.942364 | 0.962042 | 0.949731 | 0.947428 | 0.940306 |
273
+
274
+
275
+ #### Evaluations on cross-lingual capabilities
276
+ There are important use cases where one wants to retrieve multiple documents on a topic or answering questions that are formulated in a
277
+ different language than the query. This increases recall and information retrieval coverage. For testing on cross-lingual capabilities
278
+ we evaluated Pharia-1-Embedding-4608-control, GritLM and Nvidia-Embed-v2 on the MLQA-V1 datasets (Facebook) for German/English and
279
+ English/Spanish language pairings. For German/French we used the CLSD-WMT19 dataset providing correct and adversarial translations
280
+ of a sentence in the corresponding pair language. In order to check quality over a larger range of sample size we did the accuracy
281
+ computations for varying number of samples taken from the MLQA-V1 dataset. For the CLSD-WMT19 evaluation we employed the
282
+ full set of data (2900 samples available).
283
+
284
+
285
+ #### MLQA-V1 Ger/Eng cross-lingual accuracies for the considered models
286
+
287
+ |# of samples|Pharia4608|GritLM|Nvidia-Embed-v2|BGE-Gemma2|
288
+ |:---:|:---:|:---:|:---:|:---:|
289
+ |1000|86.0%|82.5%|77.0%|87.0%|
290
+ |2000|79.5%|73.4%|69.4%|76.8%|
291
+ |4000|65.3%|59.2%|56.0%|62.7%|
292
+ |6000|54.3%|48.6%|45.6%|52.6%|
293
+ |10000|38.6%|32.8%|32.8%|39.4%|
294
+
295
+
296
+ #### MLQA-V1 Eng/Esp cross-lingual accuracies for the considered models
297
+
298
+ |# samples|Pharia4608|GritLM|NV-Embed-v2|BGE-Gemma2|
299
+ |:---:|:---:|:---:|:---:|:---:|
300
+ |1000|87.5%|82.0%|81.5%|87.0%|
301
+ |2000|78.5%|73.9%|70.7%|77.0%|
302
+ |4000|65.5%|59.3%|56.9%|64.2%|
303
+ |6000|55.3%|49.2%|46.2%|53.4%|
304
+ |10000|41.7%|35.5%|33.2%|40.0%|
305
+
306
+ #### CLSD-WMT19 Ger/Fra (2900 samples) cross-lingual evaluation for the considered models
307
+
308
+
309
+ |Model Name | accuracy |
310
+ |:-----------------------------:|:--------------------------------:|
311
+ |Pharia-1-Embedding-4608-control|95.1% |
312
+ |GritLM-7B |94.2% |
313
+ |Nvidia-Embed-v2 |93.4% |
314
+ |BGE-Gemma2 |95.4% |
315
+
316
+
317
+ ## Training Details
318
+
319
+ ### Model architecture
320
+
321
+ | | |
322
+ |-------|-------|
323
+ |Number of layers|27|
324
+ |Number of attention heads|36|
325
+ |Head size|128|
326
+ |Number of Key-Value heads|4|
327
+ |Size hidden dimension|4608|
328
+ |MLP expansion factor|4|
329
+ |MLP type|Standard|
330
+ |Vocabulary size|128,000|
331
+ |Rotary base|1,000,000|
332
+ |Total parameter count|7,041,544,704|
333
+
334
+ ### Training
335
+
336
+ Pharia-1-Embedding-4608-control is an adapter on top of Pharia-1-LLM-7B-control, trained with a context window
337
+ of 2048 Tokens. Pharia-1-Embedding-4608-control was trained with representational instruction-tuning (inspired by the
338
+ approach of GritLM) and a contrastive learning approach. The final layer is an embedding head with weighted mean pooling.
339
+ The train set consisted of a blend of open-source and proprietary datasets. Further postprocessing was used to optimize
340
+ for downstream use and multilinguality.
341
+
342
+ ### Tokenization
343
+
344
+ Our tokenizer has vocabulary size 128000 and was trained via the Unigram algorithm, using the implementation provided by the SentencePiece library.
345
+ The tokenizer training set was a small subset of our high-quality data. After the training procedure, we performed some additional cleaning steps:
346
+ * Split whole number tokens (e.g. 12345 ) into individual digit tokens
347
+ * Remove double spaces: removes the tokens which contains " " in the token
348
+ * Remove tokens that contain zero-width space (except itself)
349
+ * Remove tokens with more than 3 repeated characters in a substring: bananaaaa, caaaar
350
+ * Remove any token that contains “\n” and is not either "\n", "\r".
351
+
352
+ ### Tokenizer fertility
353
+
354
+ Tokenizer fertility is a metric used to evaluate tokenizer performance and measures a tokenizer’s ability to
355
+ represent text, calculated by dividing the number of tokens in a text (after tokenizing) by the number of words in that
356
+ same text [(https://arxiv.org/pdf/2310.08754)](https://arxiv.org/pdf/2310.08754). The tokenizer fertility of the Pharia-1-Embedding-4608-control model is lower
357
+ than that of Mistral-7B-Instruct-v0.3’s and llama-3.1-8b-instruct’s for 4 out of the supported 7 European languages.
358
+ Pharia-1-LLM-7B model’s tokenizer can thus represent the same text more efficiently, i.e. with less tokens, and is
359
+ therefore more cost-efficient at inference time.
360
+
361
+ |Tokenizer fertility |Pharia-1-LLM-7B-control, Pharia-1-LLM-7B-control-aligned|Mistral-7B-Instruct-v0.3|llama-3.1-8b-instruct|
362
+ |--|--|--|--|
363
+ |de|2.011|2.546|2.241|
364
+ |fr|1.896|2.105|1.836|
365
+ |es|1.673|2.030|1.749|
366
+ |en|1.633|1.681|1.410|