Tokenizer incorrectly removes newline character

#4
by hellodanylo - opened

Hi,

I am seeing strange behavior where the tokenizer removes new line characters in certain examples.
Here is the smallest reproducible example:

from transformers import CodeGenTokenizer
tokenizer = CodeGenTokenizer.from_pretrained("Salesforce/codegen-350M-multi")

# new line (10), space (32), space (32)
text = "\n  "
print([ord(c) for c in text])
# [10, 32, 32]

encoded = tokenizer.encode(text)
print(encoded)
# [50286]

decoded = tokenizer.decode(encoded)
print([ord(c) for c in decoded])
# actual: [32, 32]
# expected: [10, 32, 32]

Note that the decoded string is missing the leading newline symbol.
This affects the downstream uses, because the LM completes the prompt without newline symbols as well.
Also, it seems to be an issue with other variants of CodeGen model as well.

Sign up or log in to comment