peralp24 commited on
Commit
86ebf3b
1 Parent(s): 556651f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -23
README.md CHANGED
@@ -340,26 +340,5 @@ for downstream use and multilinguality.
340
 
341
  ### Tokenization
342
 
343
- Our tokenizer has vocabulary size 128000 and was trained via the Unigram algorithm, using the implementation provided by the SentencePiece library.
344
- The tokenizer training set was a small subset of our high-quality data. After the training procedure, we performed some additional cleaning steps:
345
- * Split whole number tokens (e.g. 12345 ) into individual digit tokens
346
- * Remove double spaces: removes the tokens which contains " " in the token
347
- * Remove tokens that contain zero-width space (except itself)
348
- * Remove tokens with more than 3 repeated characters in a substring: bananaaaa, caaaar
349
- * Remove any token that contains “\n” and is not either "\n", "\r".
350
-
351
- ### Tokenizer fertility
352
-
353
- Tokenizer fertility is a metric used to evaluate tokenizer performance and measures a tokenizer’s ability to
354
- represent text, calculated by dividing the number of tokens in a text (after tokenizing) by the number of words in that
355
- same text [(https://arxiv.org/pdf/2310.08754)](https://arxiv.org/pdf/2310.08754). The tokenizer fertility of the Pharia-1-Embedding-4608-control model is lower
356
- than that of Mistral-7B-Instruct-v0.3’s and llama-3.1-8b-instruct’s for 4 out of the supported 7 European languages.
357
- Pharia-1-LLM-7B model’s tokenizer can thus represent the same text more efficiently, i.e. with less tokens, and is
358
- therefore more cost-efficient at inference time.
359
-
360
- |Tokenizer fertility |Pharia-1-LLM-7B-control, Pharia-1-LLM-7B-control-aligned|Mistral-7B-Instruct-v0.3|llama-3.1-8b-instruct|
361
- |--|--|--|--|
362
- |de|2.011|2.546|2.241|
363
- |fr|1.896|2.105|1.836|
364
- |es|1.673|2.030|1.749|
365
- |en|1.633|1.681|1.410|
 
340
 
341
  ### Tokenization
342
 
343
+ Tokenization taking place in this embedding model takes full advantage of the one in [Pharia-1-LLM-7B-control model](https://huggingface.co/Aleph-Alpha/Pharia-1-LLM-7B-control)
344
+