clean_up_tokenization_spaces=True causes formatting issues, why is it set?
#44
by
dzhulgakov
- opened
Setting clean_up_tokenization_spaces=True in tokenizer_config.json causes weird output space formatting issues and makes tokenizer encode+decode lossy. This is especially pronounced for code. Minimal repro:
from transformers import AutoTokenizer
t = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-70B-Instruct")
s = "foo ?? bar"
ids = t.encode(s)
s_cleanup = t.decode(ids)
s_no_cleanup = t.decode(ids, clean_up_tokenization_spaces=False)
print(ids)
print(s)
print(s_cleanup)
print(s_no_cleanup)
outputs
[128000, 8134, 9602, 3703]
foo ?? bar
<|begin_of_text|>foo?? bar
<|begin_of_text|>foo ?? bar
Notice the missing space in the first output.
FWIW, Llama2 had it as False.
Official Meta's repo (https://github.com/meta-llama/llama3/blob/main/llama/tokenizer.py) doesn't have any special sauce around TikToken either, and for the text above it preserves the space. Why was this setting turned on for Llama3?