Update README.md

Browse files

Files changed (1) hide show

README.md +2 -23

README.md CHANGED Viewed

@@ -340,26 +340,5 @@ for downstream use and multilinguality.
 ### Tokenization
-Our tokenizer has vocabulary size 128000 and was trained via the Unigram algorithm, using the implementation provided by the SentencePiece library.
-The tokenizer training set was a small subset of our high-quality data. After the training procedure, we performed some additional cleaning steps:
-* Split whole number tokens (e.g. 12345 ) into individual digit tokens
-* Remove double spaces: removes the tokens which contains " " in the token
-* Remove tokens that contain zero-width space (except itself)
-* Remove tokens with more than 3 repeated characters in a substring: bananaaaa, caaaar
-* Remove any token that contains “\n” and is not either "\n", "\r".
-### Tokenizer fertility
-Tokenizer fertility is a metric used to evaluate tokenizer performance and measures a tokenizer’s ability to
-represent text, calculated by dividing the number of tokens in a text (after tokenizing) by the number of words in that
-same text [(https://arxiv.org/pdf/2310.08754)](https://arxiv.org/pdf/2310.08754). The tokenizer fertility of the Pharia-1-Embedding-4608-control model is lower
-than that of Mistral-7B-Instruct-v0.3’s and llama-3.1-8b-instruct’s for 4 out of the supported 7 European languages.
-Pharia-1-LLM-7B model’s tokenizer can thus represent the same text more efficiently, i.e. with less tokens, and is
-therefore more cost-efficient at inference time.
-|Tokenizer fertility |Pharia-1-LLM-7B-control, Pharia-1-LLM-7B-control-aligned|Mistral-7B-Instruct-v0.3|llama-3.1-8b-instruct|
-|--|--|--|--|
-|de|2.011|2.546|2.241|
-|fr|1.896|2.105|1.836|
-|es|1.673|2.030|1.749|
-|en|1.633|1.681|1.410|


340
341	### Tokenization
342
343	+ Tokenization taking place in this embedding model takes full advantage of the one in [Pharia-1-LLM-7B-control model](https://huggingface.co/Aleph-Alpha/Pharia-1-LLM-7B-control)
344	+