Sapce and Newline in same token

#13

by Kyriota - opened Dec 3, 2023

Dec 3, 2023

model_id = "lmsys/fastchat-t5-3b-v1.0"
tokenizer = AutoTokenizer.from_pretrained(model_id, legacy=False)
tokenizer.encode(' ') == tokenizer.encode('\n')
>>> True

I've seen space and newline in 'added_tokens.json',
they should be 32106 and 32103 separatly.

But in my code, they are the same token.

I'm wondering why.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment