Update README.md
Browse files
README.md
CHANGED
@@ -340,26 +340,5 @@ for downstream use and multilinguality.
|
|
340 |
|
341 |
### Tokenization
|
342 |
|
343 |
-
|
344 |
-
|
345 |
-
* Split whole number tokens (e.g. 12345 ) into individual digit tokens
|
346 |
-
* Remove double spaces: removes the tokens which contains " " in the token
|
347 |
-
* Remove tokens that contain zero-width space (except itself)
|
348 |
-
* Remove tokens with more than 3 repeated characters in a substring: bananaaaa, caaaar
|
349 |
-
* Remove any token that contains “\n” and is not either "\n", "\r".
|
350 |
-
|
351 |
-
### Tokenizer fertility
|
352 |
-
|
353 |
-
Tokenizer fertility is a metric used to evaluate tokenizer performance and measures a tokenizer’s ability to
|
354 |
-
represent text, calculated by dividing the number of tokens in a text (after tokenizing) by the number of words in that
|
355 |
-
same text [(https://arxiv.org/pdf/2310.08754)](https://arxiv.org/pdf/2310.08754). The tokenizer fertility of the Pharia-1-Embedding-4608-control model is lower
|
356 |
-
than that of Mistral-7B-Instruct-v0.3’s and llama-3.1-8b-instruct’s for 4 out of the supported 7 European languages.
|
357 |
-
Pharia-1-LLM-7B model’s tokenizer can thus represent the same text more efficiently, i.e. with less tokens, and is
|
358 |
-
therefore more cost-efficient at inference time.
|
359 |
-
|
360 |
-
|Tokenizer fertility |Pharia-1-LLM-7B-control, Pharia-1-LLM-7B-control-aligned|Mistral-7B-Instruct-v0.3|llama-3.1-8b-instruct|
|
361 |
-
|--|--|--|--|
|
362 |
-
|de|2.011|2.546|2.241|
|
363 |
-
|fr|1.896|2.105|1.836|
|
364 |
-
|es|1.673|2.030|1.749|
|
365 |
-
|en|1.633|1.681|1.410|
|
|
|
340 |
|
341 |
### Tokenization
|
342 |
|
343 |
+
Tokenization taking place in this embedding model takes full advantage of the one in [Pharia-1-LLM-7B-control model](https://huggingface.co/Aleph-Alpha/Pharia-1-LLM-7B-control)
|
344 |
+
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|