THUDM/glm-4-9b-chat-1m · Update tokenization

Jun 5, 2024

当运行如下代码：

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("/home/oneway/ssd2t/model/ZhipuAI/glm-4-9b-chat", trust_remote_code=True)
new_str = tokenizer.decode(198)
print(new_str)

报错：TypeError: token should only be of type types or str

原因是glm4的词表中的key是以bytes类型存储，而bytes类型在transformers的_decode函数中被遍历会变成int类型。

对tokenization_chatglm.py中的convert_tokens_to_string函数作如下修改即可解决该问题：

def convert_tokens_to_string(tokens: List[Union[bytes, str, int]]) -> str:
    """
    Converts a sequence of tokens in a single string.
    """
    text = ""
    temp = b""
    for t in tokens:
        if isinstance(t, int):
            t = chr(t)
        if isinstance(t, str):
            if temp:
                text += temp.decode("utf-8", errors="replace")
                temp = b""
            text += t
        elif isinstance(t, bytes):
            temp += t
        else:
            raise TypeError("token should only be of type int, bytes or str")
    if temp:
        text += temp.decode("utf-8", errors="replace")
    return text

Update tokenization_chatglm.py3f7e063e

zRzRzRzRzRzRzR changed pull request status to merged Jun 16, 2024