metadata
license: gpl-3.0
datasets:
- Mxode/IndustryCorpus-Subset-zh-en
Bilingual Tokenizer
A portion of the data from the IndustryCorpus-Subset-zh-en dataset was used for training.
This dataset consists of Chinese and English bilingual text.
10,000 samples were taken from the untrained portion to test the compression rate of the tokenizer.
Compression rate formula:
Here is the test result:
Model | Tokenizer Size | Compression Rate |
---|---|---|
deepseek-llm-7b-base | 100015 | 36.63% |
deepseek-coder-33b-base | 32022 | 41.75% |
gemma-2-27b | 256000 | 37.75% |
glm-4-9b | 151343 | 34.26% |
internlm2_5-7b-chat | 92550 | 35.15% |
Llama-2-7b-hf | 32000 | 63.33% |
Meta-Llama-3.1-8B | 128256 | 41.48% |
Mistral-7B-Instruct-v0.3 | 32768 | 52.43% |
Phi-3.5-mini-instruct | 32011 | 63.29% |
Qwen2-7B-Instruct | 151646 | 35.91% |
Yi-1.5-9B | 63992 | 36.86% |
BilingualTokenizer-1K | 1000 | 75.61% |
BilingualTokenizer-2K | 2000 | 62.26% |
BilingualTokenizer-4K | 4000 | 52.81% |
BilingualTokenizer-8K | 8000 | 45.92% |
BilingualTokenizer-16K | 16000 | 40.94% |