ss
20B
"vocab-file": "./20B_checkpoints/20B_tokenizer.json",
Vocab size: 50277 self.padded_vocab_size = 50304
padded vocab (size: 50277) with 27 dummy tokens (new size: 50304)
词典
见 convert_vocab_to_txt.py
{"id": 13609, "token": "\u00e4\u00b8\u0143", "token_decode": "\u4e2d"} 中
# 多个符号拼接在一起的
{"id": 13663, "token": ".*]{}", "token_decode": ".*]{}"} .*]{}
# ss
special_tokens
https://huggingface.co/EleutherAI/gpt-neox-20b/blob/main/special_tokens_map.json
{"bos_token": "<|endoftext|>", "eos_token": "<|endoftext|>", "unk_token": "<|endoftext|>"}
unk_token="<|endoftext|>",
bos_token="<|endoftext|>",
eos_token="<|endoftext|>",
中文支持
基本没有OOV。
gpt-neox是在800G英文数据集上训练的,为啥词典支持中文?因为是byte-level BPE
丁 [3218, 212]
七 [3218, 214]
万 [3218, 218]
诀 [11894, 211]
证 [11894, 212]
编码长度统计: Counter({2: 4190, 3: 1295, 1: 285}) 平均编码长度: 2.1750433275563257
完整性
build tokenizer
merge
"ard less",