Spaces:
Running
Running
File size: 9,148 Bytes
57f1fd6 428b731 751936e 57f1fd6 428b731 2d550af 11379e2 f4973d4 11379e2 f4973d4 814ee6b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 |
---
title: Tokenizer Arena
emoji: ⚡
colorFrom: red
colorTo: gray
sdk: gradio
sdk_version: 3.41.2
app_file: app.py
pinned: false
---
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
## ss
## TODO
- 搜索栏
-
## 统计
## vocabsize
- 增大能提到压缩率,副作用是增大计算量和内存 (getting the most out of your tokenizer for pre-training and)
-
https://huggingface.co/spaces/yenniejun/tokenizers-languages
## gradio app
- https://arena.lmsys.org/
## lang
## number
## diff
## Compress Rate
**简介**
we tokenize in cc-100
| tokenizer | vocab_size | g_bytes/b_tokens | t_bytes/t_tokens | b_tokens/g_bytes |
|:----------------------------|-------------:|-------------------:|-------------------:|-------------------:|
| amber | 32000 | 1.84 | 1.8 | 0.54 |
| aya_101 | 250100 | 3.89 | 3.79 | 0.26 |
| baichuan | 64000 | 3.92 | 3.82 | 0.26 |
| baichuan2 | 125696 | 4.53 | 4.42 | 0.22 |
| bert_base_cased | 28996 | 2.73 | 2.66 | 0.37 |
| bert_base_chinese | 21128 | 2.74 | 2.67 | 0.37 |
| bert_base_uncased | 30522 | 2.73 | 2.67 | 0.37 |
| bloom | 250680 | 4.28 | 4.18 | 0.23 |
| byt5_small | 256 | 0.93 | 0.91 | 1.08 |
| character_glm_6b | 64794 | 4.2 | 4.1 | 0.24 |
| chatglm2_6b | 64794 | 4.2 | 4.1 | 0.24 |
| chatglm3_6b | 64798 | 4.2 | 4.1 | 0.24 |
| chatglm_6b | 150344 | 4.65 | 4.54 | 0.22 |
| chatyuan_large_v2 | 32128 | 4.34 | 4.24 | 0.23 |
| chinese_llama | 49953 | 3.93 | 3.84 | 0.25 |
| chinese_llama2 | 55296 | 3.92 | 3.83 | 0.26 |
| code_davinci_002 | 50281 | 1.31 | 1.28 | 0.77 |
| crystal_coder | 32000 | 1.86 | 1.81 | 0.54 |
| deepseek_coder_33b_instruct | 32000 | 3.4 | 3.32 | 0.29 |
| deepseek_llm_7b_base | 100000 | 4.05 | 3.96 | 0.25 |
| falcon_180b | 65024 | 2.18 | 2.13 | 0.46 |
| falcon_7b | 65024 | 2.18 | 2.13 | 0.46 |
| fastchat_t5_3b | 32000 | 13.7 | 13.38 | 0.07 |
| flan_t5_base | 32100 | 14.13 | 13.8 | 0.07 |
| gemma_7b | 256000 | 3.82 | 3.73 | 0.26 |
| gpt2 | 50257 | 1.31 | 1.28 | 0.77 |
| gpt2_chinese | 21128 | 2.73 | 2.66 | 0.37 |
| gpt_35_turbo | 100277 | 2.26 | 2.21 | 0.44 |
| gpt_4 | 100277 | 2.26 | 2.21 | 0.44 |
| gpt_nexo_20b | 50254 | 2.01 | 1.96 | 0.5 |
| internlm2_chat_7b | 92544 | 4.23 | 4.13 | 0.24 |
| internlm2_math_7b | 92544 | 4.23 | 4.13 | 0.24 |
| internlm_chat_7b | 103168 | 4.23 | 4.14 | 0.24 |
| internlm_xcomposer_7b | 103168 | 4.23 | 4.14 | 0.24 |
| kplug | 10261 | 2.72 | 2.65 | 0.37 |
| llama | 32000 | 1.84 | 1.8 | 0.54 |
| llama2 | 32000 | 1.84 | 1.8 | 0.54 |
| mistral_7b | 32000 | 2.36 | 2.3 | 0.42 |
| mixtral_8_7b | 32000 | 2.36 | 2.3 | 0.42 |
| mobilebert_uncased | 30522 | 2.73 | 2.67 | 0.37 |
| moss | 106029 | 4.4 | 4.3 | 0.23 |
| mt5_large | 250100 | 3.89 | 3.79 | 0.26 |
| olmo_7b | 50280 | 2.01 | 1.96 | 0.5 |
| orion_14b_chat | 84608 | 4.63 | 4.52 | 0.22 |
| phi_1 | 50257 | 1.31 | 1.28 | 0.77 |
| phi_2 | 50257 | 1.31 | 1.28 | 0.77 |
| pko_t5_large | 50258 | 0.97 | 0.95 | 1.03 |
| prompt_clue | 32128 | 4.34 | 4.24 | 0.23 |
| qwen1_5_14b_chat | 151643 | 4.16 | 4.06 | 0.24 |
| qwen_1_8b_chat | 151851 | 4.16 | 4.06 | 0.24 |
| qwen_72b_chat | 151851 | 4.16 | 4.06 | 0.24 |
| qwen_7b_chat | 151851 | 4.16 | 4.06 | 0.24 |
| roberta_chinese_clue | 8021 | 2.7 | 2.64 | 0.37 |
| skywork_13b_base | 65519 | 3.69 | 3.61 | 0.27 |
| skywork_13b_math | 65519 | 3.69 | 3.61 | 0.27 |
| solar_10_7b | 32000 | 2.36 | 2.3 | 0.42 |
| starchat_alpha | 49152 | 2.78 | 2.72 | 0.36 |
| switch_c_2048 | 32100 | 14.13 | 13.8 | 0.07 |
| t5_base | 32100 | 14.13 | 13.8 | 0.07 |
| t5_large | 32100 | 14.13 | 13.8 | 0.07 |
| t5_small | 32100 | 14.13 | 13.8 | 0.07 |
| text_davinci_003 | 50281 | 1.31 | 1.28 | 0.77 |
| tigerbot_13b_chat_v2 | 60512 | 4.25 | 4.15 | 0.24 |
| tigerbot_70b_chat_v4_4k | 65107 | 4.25 | 4.15 | 0.24 |
| wizardcoder_15b_v1 | 49152 | 2.78 | 2.72 | 0.36 |
| wizardcoder_python_7b_v1 | 32000 | 1.84 | 1.8 | 0.54 |
| wizardlm_7b_v1 | 32000 | 1.84 | 1.8 | 0.54 |
| wizardmath_70b_v1 | 32000 | 1.84 | 1.8 | 0.54 |
| xlm_roberta | 250002 | 3.96 | 3.86 | 0.25 |
| yi_34b | 64000 | 4.17 | 4.07 | 0.24 |
| yi_6b | 64000 | 4.17 | 4.07 | 0.24 |
| yi_vl34b | 64000 | 4.11 | 4.02 | 0.24 |
| zephyr_7b_beta | 32000 | 2.36 | 2.3 | 0.42 |
**结论**
larger vocabulary sizes
## Reference
- Getting the most out of your tokenizer for pre-training and domain adaptation
- Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca
- https://huggingface.co/spaces/Xenova/the-tokenizer-playground |