Bilingual-Tokenizer / README.md
Mxode's picture
Update README.md
706a898 verified
metadata
license: gpl-3.0
datasets:
  - Mxode/IndustryCorpus-Subset-zh-en

Bilingual Tokenizer

A portion of the data from the IndustryCorpus-Subset-zh-en dataset was used for training.

This dataset consists of Chinese and English bilingual text.

10,000 samples were taken from the untrained portion to test the compression rate of the tokenizer.

Compression rate formula:

Compression rate=length after tokenizationcharacter length of the original corpus \text{Compression rate} = \frac{\text{length after tokenization}}{\text{character length of the original corpus}}

Here is the test result:

Model Tokenizer Size Compression Rate
deepseek-llm-7b-base 100015 36.63%
deepseek-coder-33b-base 32022 41.75%
gemma-2-27b 256000 37.75%
glm-4-9b 151343 34.26%
internlm2_5-7b-chat 92550 35.15%
Llama-2-7b-hf 32000 63.33%
Meta-Llama-3.1-8B 128256 41.48%
Mistral-7B-Instruct-v0.3 32768 52.43%
Phi-3.5-mini-instruct 32011 63.29%
Qwen2-7B-Instruct 151646 35.91%
Yi-1.5-9B 63992 36.86%
BilingualTokenizer-1K 1000 75.61%
BilingualTokenizer-2K 2000 62.26%
BilingualTokenizer-4K 4000 52.81%
BilingualTokenizer-8K 8000 45.92%
BilingualTokenizer-16K 16000 40.94%