|
|
|
|
|
## ss |
|
|
|
## 完整性 |
|
|
|
以下 256个字符保证了词典的完整性 |
|
``` |
|
"vocab": { |
|
"<0x00>": 3, |
|
"<0x01>": 4, |
|
... |
|
"<0xFE>": 257, |
|
"<0xFF>": 258, |
|
``` |
|
|
|
|
|
## |
|
|
|
|
|
```json |
|
"normalizer": { |
|
"type": "Sequence", |
|
"normalizers": [ |
|
{ |
|
"type": "Prepend", |
|
"prepend": "▁" |
|
}, |
|
{ |
|
"type": "Replace", |
|
"pattern": { |
|
"String": " " |
|
}, |
|
"content": "▁" |
|
} |
|
] |
|
}, |
|
|
|
"post_processor": { |
|
"type": "TemplateProcessing", |
|
"single": [ |
|
{ |
|
"SpecialToken": { |
|
"id": "<s>", |
|
"type_id": 0 |
|
} |
|
}, |
|
{ |
|
"Sequence": { |
|
"id": "A", |
|
"type_id": 0 |
|
} |
|
} |
|
], |
|
"pair": [ |
|
{ |
|
"SpecialToken": { |
|
"id": "<s>", |
|
"type_id": 0 |
|
} |
|
}, |
|
{ |
|
"Sequence": { |
|
"id": "A", |
|
"type_id": 0 |
|
} |
|
}, |
|
{ |
|
"Sequence": { |
|
"id": "B", |
|
"type_id": 0 |
|
} |
|
} |
|
], |
|
"special_tokens": { |
|
"<s>": { |
|
"id": "<s>", |
|
"ids": [ |
|
1 |
|
], |
|
"tokens": [ |
|
"<s>" |
|
] |
|
} |
|
} |
|
}, |
|
"decoder": { |
|
"type": "Sequence", |
|
"decoders": [ |
|
{ |
|
"type": "Replace", |
|
"pattern": { |
|
"String": "▁" |
|
}, |
|
"content": " " |
|
}, |
|
{ |
|
"type": "ByteFallback" |
|
}, |
|
{ |
|
"type": "Fuse" |
|
}, |
|
{ |
|
"type": "Strip", |
|
"content": " ", |
|
"start": 1, |
|
"stop": 0 |
|
} |
|
] |
|
}, |
|
|
|
``` |
|
|
|
## issues |
|
|
|
1. https://github.com/LianjiaTech/BELLE/issues/45 |
|
llama 700个中文只是显式支持的数量,隐含支持的unicode中文字远超700, |
|
你可以随便用一个bert的词表做实验。不过恶心的是这样一个中文字就会encode成4,5个unicode toekn,长度一下就上去了,所以还是哈工大做中文词表增强的靠谱。 |
|
|
|
2. https://github.com/LianjiaTech/BELLE/issues/43 |
|
请问各位llama在中文上使用需要对词表做额外操作吗? |
|
应该是要的,我测了一下llama词表和常用汉字3500个的交集,只有600多个。增加词表可参考https://github.com/ymcui/Chinese-LLaMA-Alpaca |