来源:
- https://github.com/THUDM/GLM/tree/main/chinese_sentencepiece
- https://huggingface.co/THUDM/glm-10b-chinese/
HF
tokenizer = AutoTokenizer.from_pretrained("THUDM/glm-10b", trust_remote_code=True)
分词器
tokenizer_config.json
"AutoTokenizer": [
"tokenization_glm.GLMChineseTokenizer",
null
]
其中 GLMChineseTokenizer
https://huggingface.co/THUDM/glm-10b-chinese/blob/main/tokenization_glm.py
词典
来自