clean_up_tokenization_spaces=True causes formatting issues, why is it set?
Setting clean_up_tokenization_spaces=True in tokenizer_config.json causes weird output space formatting issues and makes tokenizer encode+decode lossy. This is especially pronounced for code. Minimal repro:
from transformers import AutoTokenizer
t = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-70B-Instruct")
s = "foo ?? bar"
ids = t.encode(s)
s_cleanup = t.decode(ids)
s_no_cleanup = t.decode(ids, clean_up_tokenization_spaces=False)
print(ids)
print(s)
print(s_cleanup)
print(s_no_cleanup)
outputs
[128000, 8134, 9602, 3703]
foo ?? bar
<|begin_of_text|>foo?? bar
<|begin_of_text|>foo ?? bar
Notice the missing space in the first output.
FWIW, Llama2 had it as False.
Official Meta's repo (https://github.com/meta-llama/llama3/blob/main/llama/tokenizer.py) doesn't have any special sauce around TikToken either, and for the text above it preserves the space. Why was this setting turned on for Llama3?
"bos_token": "<|begin_of_text|>",
"clean_up_tokenization_spaces": true,
"eos_token": "<|end_of_text|>",
"model_input_names": [
"input_ids",
"attention_mask"
],
"model_max_length": 1000000000000000019884624838656,
"tokenizer_class": "PreTrainedTokenizerFast"
}
ISTA-DASLab/Meta-Llama-3.1-8B-AQLM-PV-1Bit-1x16-hf
When running the output is not clear.
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("ISTA-DASLab/Meta-Llama-3-70B-AQLM-PV-1Bit-1x16")
model = AutoModelForCausalLM.from_pretrained(
"ISTA-DASLab/Meta-Llama-3-70B-AQLM-PV-1Bit-1x16", device_map="auto"
)
prompt = """How many helicopters can a human eat in one sitting? Reply as a thug."""
model_inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
input_length = model_inputs.input_ids.shape[1]
generated_ids = model.generate(**model_inputs, max_new_tokens=20)
print(tokenizer.batch_decode(generated_ids[:, input_length:], skip_special_tokens=True)[0])
Flores Flores Flores Flores Flores Flores Flores Floresigneigneigneigneigneigneigneigneigneigneigneigne