Mxode
/

Bilingual-Tokenizer

Model card Files Files and versions Community

Bilingual-Tokenizer / README.md

Mxode

Update README.md

706a898 verified 3 months ago

preview code

raw

history blame contribute delete

2.51 kB

	---
	license: gpl-3.0
	datasets:
	- Mxode/IndustryCorpus-Subset-zh-en
	---

	# Bilingual Tokenizer


	A portion of the data from the [IndustryCorpus-Subset-zh-en](https://huggingface.co/datasets/Mxode/IndustryCorpus-Subset-zh-en) dataset was used for training.

	This dataset consists of Chinese and English bilingual text.

	10,000 samples were taken from the untrained portion to test the compression rate of the tokenizer.

	Compression rate formula:

	$$
	\text{Compression rate} = \frac{\text{length after tokenization}}{\text{character length of the original corpus}}
	$$

	Here is the test result:

	\| Model \| Tokenizer Size \| Compression Rate \|
	\| :----------------------------------------------------------: \| :------------: \| :------------: \|
	\| [deepseek-llm-7b-base](https://huggingface.co/deepseek-ai/deepseek-llm-7b-base) \| 100015 \| 36.63% \|
	\| [deepseek-coder-33b-base](https://huggingface.co/deepseek-ai/deepseek-coder-33b-base) \| 32022 \| 41.75% \|
	\| [gemma-2-27b](https://huggingface.co/google/gemma-2-27b) \| 256000 \| 37.75% \|
	\| [glm-4-9b](https://huggingface.co/THUDM/glm-4-9b) \| 151343 \| 34.26% \|
	\| [internlm2_5-7b-chat](https://huggingface.co/internlm/internlm2_5-7b-chat) \| 92550 \| 35.15% \|
	\| [Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) \| 32000 \| 63.33% \|
	\| [Meta-Llama-3.1-8B](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B) \| 128256 \| 41.48% \|
	\| [Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3) \| 32768 \| 52.43% \|
	\| [Phi-3.5-mini-instruct](https://huggingface.co/microsoft/Phi-3.5-mini-instruct) \| 32011 \| 63.29% \|
	\| [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) \| 151646 \| 35.91% \|
	\| [Yi-1.5-9B](https://huggingface.co/01-ai/Yi-1.5-9B) \| 63992 \| 36.86% \|
	\| BilingualTokenizer-1K \| 1000 \| 75.61% \|
	\| BilingualTokenizer-2K \| 2000 \| 62.26% \|
	\| BilingualTokenizer-4K \| 4000 \| 52.81% \|
	\| BilingualTokenizer-8K \| 8000 \| 45.92% \|
	\| BilingualTokenizer-16K \| 16000 \| 40.94% \|

	---
	license: gpl-3.0
	datasets:
	- Mxode/IndustryCorpus-Subset-zh-en
	---

	# Bilingual Tokenizer


	A portion of the data from the [IndustryCorpus-Subset-zh-en](https://huggingface.co/datasets/Mxode/IndustryCorpus-Subset-zh-en) dataset was used for training.

	This dataset consists of Chinese and English bilingual text.

	10,000 samples were taken from the untrained portion to test the compression rate of the tokenizer.

	Compression rate formula:

	$$
	\text{Compression rate} = \frac{\text{length after tokenization}}{\text{character length of the original corpus}}
	$$

	Here is the test result:

	\| Model \| Tokenizer Size \| Compression Rate \|
	\| :----------------------------------------------------------: \| :------------: \| :------------: \|
	\| [deepseek-llm-7b-base](https://huggingface.co/deepseek-ai/deepseek-llm-7b-base) \| 100015 \| 36.63% \|
	\| [deepseek-coder-33b-base](https://huggingface.co/deepseek-ai/deepseek-coder-33b-base) \| 32022 \| 41.75% \|
	\| [gemma-2-27b](https://huggingface.co/google/gemma-2-27b) \| 256000 \| 37.75% \|
	\| [glm-4-9b](https://huggingface.co/THUDM/glm-4-9b) \| 151343 \| 34.26% \|
	\| [internlm2_5-7b-chat](https://huggingface.co/internlm/internlm2_5-7b-chat) \| 92550 \| 35.15% \|
	\| [Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) \| 32000 \| 63.33% \|
	\| [Meta-Llama-3.1-8B](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B) \| 128256 \| 41.48% \|
	\| [Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3) \| 32768 \| 52.43% \|
	\| [Phi-3.5-mini-instruct](https://huggingface.co/microsoft/Phi-3.5-mini-instruct) \| 32011 \| 63.29% \|
	\| [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) \| 151646 \| 35.91% \|
	\| [Yi-1.5-9B](https://huggingface.co/01-ai/Yi-1.5-9B) \| 63992 \| 36.86% \|
	\| BilingualTokenizer-1K \| 1000 \| 75.61% \|
	\| BilingualTokenizer-2K \| 2000 \| 62.26% \|
	\| BilingualTokenizer-4K \| 4000 \| 52.81% \|
	\| BilingualTokenizer-8K \| 8000 \| 45.92% \|
	\| BilingualTokenizer-16K \| 16000 \| 40.94% \|