apply_chat_template method not working correctly for llama 3 tokenizer
I noticed that the apply_chat_template for the PreTrainedTokenizerBase class does not work correctly when return_assistant_tokens_mask=True. We would expect to get back a list of indices for each example where 1 indicates the token is part of an assistant message and 0 otherwise. This is the case for the Llama 2 tokenizer for example. I am sharing a minimal example to reproduce this issue.
Looking deeper into the apply_chat_template method it seems the issue is related to the char_to_token method of the tokenizers.Embedding class and could be related to the fact that the Llama 3 tokenizer was trained with tiktoken as apposed to sentencepiece.
from transformers import AutoTokenizer
from datasets import load_dataset
dataset_name = "m-a-p/Code-Feedback"
model_name = "meta-llama/Meta-Llama-3.1-8B" # apply_chat_template does not work correctly
#model_name = "meta-llama/Llama-2-7b-hf" # apply_chat_template works correctly
chat_template = """{% if messages[0]['role'] == 'system' %}
{% set offset = 1 %}
{% else %}
{% set offset = 0 %}
{% endif %}
{% for message in messages %}
{% if (message['role'] == 'user') != (loop.index0 % 2 == offset) %}
{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}
{% endif %}
{{ '### ' + message['role'] + ':\n'}}
{% if (message['role'] == 'assistant') %}
{% generation %} {{ message['content'] | trim + eos_token }} {% endgeneration %}
{% else %}
{{ message['content'] | trim + eos_token }}
{% endif %}
{% endfor %}
{% if add_generation_prompt %}
{{ '### ' + 'assistant' + ':\n' }}
{% endif %}"""
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.chat_template = chat_template
datasets = load_dataset(dataset_name, trust_remote_code=True)
# assistant_mask is all zeros for llama3 tokenizer
chat = tokenizer.apply_chat_template(
datasets["train"][0]["messages"],
add_generation_prompt=False,
return_dict=True,
tokenize=True,
return_assistant_tokens_mask=True
)
print("assistant_masks", chat["assistant_masks"])
If we assume that the entire chat is 10 characters and the assistant tokens occur at indices 4-6 and 8-9 we would have an expected output that looks like this
[0, 0, 0, 1, 1, 1, 0, 1, 1, 0]
The actual output for the llama 3 tokenizer looks like this
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
# executing the steps to get the assistant mask in the apply chat template method
# shows that the char_to_token method of the tokenizers.Embedding class seems to be not working correctly
# One possibile reason could be that the llama3 tokenizer was trained with tiktoken instead of sentencepiece
compiled_template = tokenizer._compile_jinja_template(open("chat_template.jinja").read())
template_kwargs = {**tokenizer.special_tokens_map}
rendered_chat, generation_indices = tokenizer._render_with_assistant_indices(
compiled_template=compiled_template,
messages=datasets["train"][0]["messages"],
tools=[],
documents=None,
add_generation_prompt=False,
**tokenizer.special_tokens_map
)
out = tokenizer(
rendered_chat,
padding=False,
truncation=False,
max_length=None,
add_special_tokens=False,
return_tensors=None
)
first_assistant_start_char, first_assistant_end_char = generation_indices[0]
# returns None for llama3
print("char_to_token", out[0].char_to_token(0, first_assistant_start_char))
Expected output of out[0].char_to_token(0, first_assistant_start_char):
token index of the token encoding the character at the position "first_assistant_start_char" in the string
Actual output: None
same issue!
same issue!
fyi, I opened an issue here: https://github.com/huggingface/transformers/issues/33091