Sapce and Newline in same token

#13
by Kyriota - opened
model_id = "lmsys/fastchat-t5-3b-v1.0"
tokenizer = AutoTokenizer.from_pretrained(model_id, legacy=False)
tokenizer.encode(' ') == tokenizer.encode('\n')
>>> True

I've seen space and newline in 'added_tokens.json',
they should be 32106 and 32103 separatly.

But in my code, they are the same token.

I'm wondering why.

Sign up or log in to comment