|
|
|
|
|
``` |
|
added vocab (size: 54634) with 22 dummy tokens (new size: 54656) |
|
Vocab size: 54634 |
|
|
|
训练数据 |
|
``` |
|
|
|
|
|
https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt_neox_japanese/tokenization_gpt_neox_japanese.py |
|
|
|
|
|
## 20B |
|
|
|
[configs/20B.yml](https://github.com/EleutherAI/gpt-neox/blob/main/configs/20B.yml#L7) |
|
``` |
|
"vocab-file": "./20B_checkpoints/20B_tokenizer.json", |
|
``` |
|
|
|
Vocab size: 50277 |
|
self.padded_vocab_size = 50304 |
|
|
|
|
|
padded vocab (size: 50277) with 27 dummy tokens (new size: 50304) |
|
|
|
## 词典 |
|
|
|
见 convert_vocab_to_txt.py |
|
|
|
``` |
|
{"id": 13609, "token": "\u00e4\u00b8\u0143", "token_decode": "\u4e2d"} 中 |
|
|
|
# 多个符号拼接在一起的 |
|
{"id": 13663, "token": ".*]{}", "token_decode": ".*]{}"} .*]{} |
|
|
|
# ss |
|
|
|
``` |
|
|
|
|
|
## 中文支持 |
|
|
|
基本没有OOV。 |
|
|
|
gpt-neox是在800G英文数据集上训练的,为啥词典支持中文?因为是byte-level BPE |
|
|
|
``` |
|
丁 [3218, 212] |
|
七 [3218, 214] |
|
万 [3218, 218] |
|
诀 [11894, 211] |
|
证 [11894, 212] |
|
``` |
|
|
|
|
|
编码长度统计: Counter({2: 4190, 3: 1295, 1: 285}) |
|
平均编码长度: 2.1750433275563257 |
|
|
|
|
|
## ss |
|
|
|
|
|
|
|
|