File size: 9,148 Bytes
57f1fd6
428b731
751936e
 
 
57f1fd6
 
 
 
 
 
 
428b731
 
 
 
 
2d550af
 
 
11379e2
 
f4973d4
 
 
 
 
 
11379e2
 
 
 
f4973d4
 
814ee6b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
---
title: Tokenizer Arena
emoji: 
colorFrom: red
colorTo: gray
sdk: gradio
sdk_version: 3.41.2
app_file: app.py
pinned: false
---

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference


## ss


## TODO


- 搜索栏
- 



## 统计


## vocabsize

- 增大能提到压缩率,副作用是增大计算量和内存 (getting the most out of your tokenizer for pre-training and)
- 


https://huggingface.co/spaces/yenniejun/tokenizers-languages


## gradio app

- https://arena.lmsys.org/


## lang



## number



## diff






## Compress Rate


**简介**
we tokenize in cc-100

| tokenizer                   |   vocab_size |   g_bytes/b_tokens |   t_bytes/t_tokens |   b_tokens/g_bytes |
|:----------------------------|-------------:|-------------------:|-------------------:|-------------------:|
| amber                       |        32000 |               1.84 |               1.8  |               0.54 |
| aya_101                     |       250100 |               3.89 |               3.79 |               0.26 |
| baichuan                    |        64000 |               3.92 |               3.82 |               0.26 |
| baichuan2                   |       125696 |               4.53 |               4.42 |               0.22 |
| bert_base_cased             |        28996 |               2.73 |               2.66 |               0.37 |
| bert_base_chinese           |        21128 |               2.74 |               2.67 |               0.37 |
| bert_base_uncased           |        30522 |               2.73 |               2.67 |               0.37 |
| bloom                       |       250680 |               4.28 |               4.18 |               0.23 |
| byt5_small                  |          256 |               0.93 |               0.91 |               1.08 |
| character_glm_6b            |        64794 |               4.2  |               4.1  |               0.24 |
| chatglm2_6b                 |        64794 |               4.2  |               4.1  |               0.24 |
| chatglm3_6b                 |        64798 |               4.2  |               4.1  |               0.24 |
| chatglm_6b                  |       150344 |               4.65 |               4.54 |               0.22 |
| chatyuan_large_v2           |        32128 |               4.34 |               4.24 |               0.23 |
| chinese_llama               |        49953 |               3.93 |               3.84 |               0.25 |
| chinese_llama2              |        55296 |               3.92 |               3.83 |               0.26 |
| code_davinci_002            |        50281 |               1.31 |               1.28 |               0.77 |
| crystal_coder               |        32000 |               1.86 |               1.81 |               0.54 |
| deepseek_coder_33b_instruct |        32000 |               3.4  |               3.32 |               0.29 |
| deepseek_llm_7b_base        |       100000 |               4.05 |               3.96 |               0.25 |
| falcon_180b                 |        65024 |               2.18 |               2.13 |               0.46 |
| falcon_7b                   |        65024 |               2.18 |               2.13 |               0.46 |
| fastchat_t5_3b              |        32000 |              13.7  |              13.38 |               0.07 |
| flan_t5_base                |        32100 |              14.13 |              13.8  |               0.07 |
| gemma_7b                    |       256000 |               3.82 |               3.73 |               0.26 |
| gpt2                        |        50257 |               1.31 |               1.28 |               0.77 |
| gpt2_chinese                |        21128 |               2.73 |               2.66 |               0.37 |
| gpt_35_turbo                |       100277 |               2.26 |               2.21 |               0.44 |
| gpt_4                       |       100277 |               2.26 |               2.21 |               0.44 |
| gpt_nexo_20b                |        50254 |               2.01 |               1.96 |               0.5  |
| internlm2_chat_7b           |        92544 |               4.23 |               4.13 |               0.24 |
| internlm2_math_7b           |        92544 |               4.23 |               4.13 |               0.24 |
| internlm_chat_7b            |       103168 |               4.23 |               4.14 |               0.24 |
| internlm_xcomposer_7b       |       103168 |               4.23 |               4.14 |               0.24 |
| kplug                       |        10261 |               2.72 |               2.65 |               0.37 |
| llama                       |        32000 |               1.84 |               1.8  |               0.54 |
| llama2                      |        32000 |               1.84 |               1.8  |               0.54 |
| mistral_7b                  |        32000 |               2.36 |               2.3  |               0.42 |
| mixtral_8_7b                |        32000 |               2.36 |               2.3  |               0.42 |
| mobilebert_uncased          |        30522 |               2.73 |               2.67 |               0.37 |
| moss                        |       106029 |               4.4  |               4.3  |               0.23 |
| mt5_large                   |       250100 |               3.89 |               3.79 |               0.26 |
| olmo_7b                     |        50280 |               2.01 |               1.96 |               0.5  |
| orion_14b_chat              |        84608 |               4.63 |               4.52 |               0.22 |
| phi_1                       |        50257 |               1.31 |               1.28 |               0.77 |
| phi_2                       |        50257 |               1.31 |               1.28 |               0.77 |
| pko_t5_large                |        50258 |               0.97 |               0.95 |               1.03 |
| prompt_clue                 |        32128 |               4.34 |               4.24 |               0.23 |
| qwen1_5_14b_chat            |       151643 |               4.16 |               4.06 |               0.24 |
| qwen_1_8b_chat              |       151851 |               4.16 |               4.06 |               0.24 |
| qwen_72b_chat               |       151851 |               4.16 |               4.06 |               0.24 |
| qwen_7b_chat                |       151851 |               4.16 |               4.06 |               0.24 |
| roberta_chinese_clue        |         8021 |               2.7  |               2.64 |               0.37 |
| skywork_13b_base            |        65519 |               3.69 |               3.61 |               0.27 |
| skywork_13b_math            |        65519 |               3.69 |               3.61 |               0.27 |
| solar_10_7b                 |        32000 |               2.36 |               2.3  |               0.42 |
| starchat_alpha              |        49152 |               2.78 |               2.72 |               0.36 |
| switch_c_2048               |        32100 |              14.13 |              13.8  |               0.07 |
| t5_base                     |        32100 |              14.13 |              13.8  |               0.07 |
| t5_large                    |        32100 |              14.13 |              13.8  |               0.07 |
| t5_small                    |        32100 |              14.13 |              13.8  |               0.07 |
| text_davinci_003            |        50281 |               1.31 |               1.28 |               0.77 |
| tigerbot_13b_chat_v2        |        60512 |               4.25 |               4.15 |               0.24 |
| tigerbot_70b_chat_v4_4k     |        65107 |               4.25 |               4.15 |               0.24 |
| wizardcoder_15b_v1          |        49152 |               2.78 |               2.72 |               0.36 |
| wizardcoder_python_7b_v1    |        32000 |               1.84 |               1.8  |               0.54 |
| wizardlm_7b_v1              |        32000 |               1.84 |               1.8  |               0.54 |
| wizardmath_70b_v1           |        32000 |               1.84 |               1.8  |               0.54 |
| xlm_roberta                 |       250002 |               3.96 |               3.86 |               0.25 |
| yi_34b                      |        64000 |               4.17 |               4.07 |               0.24 |
| yi_6b                       |        64000 |               4.17 |               4.07 |               0.24 |
| yi_vl34b                    |        64000 |               4.11 |               4.02 |               0.24 |
| zephyr_7b_beta              |        32000 |               2.36 |               2.3  |               0.42 |


**结论**
larger vocabulary sizes 



## Reference

- Getting the most out of your tokenizer for pre-training and domain adaptation
- Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca
- https://huggingface.co/spaces/Xenova/the-tokenizer-playground