Align tokenizer with mistral-common

#225
by Rocketknight1 HF staff - opened
No description provided.

This PR should align the Hugging Face tokenizer with the tokenization in mistral-common. You can test it with the following script:

from mistral_common.protocol.instruct.request import ChatCompletionRequest
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from transformers import AutoTokenizer

chat = [
    {"role": "system", "content": "You are a helpful bot"},
    {"role": "user", "content": "Hello"},
    {"role": "assistant", "content": "Hi there!"},
    {"role": "user", "content": "How are you?"},
    {"role": "assistant", "content": "Fine and you?"},
    {"role": "user", "content": "Fine thank you."},
]

mistral_tok = MistralTokenizer.v1()
hf_tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1", revision="pr/120")

hf_text = hf_tokenizer.apply_chat_template(chat, tokenize=False)
hf_tokens = hf_tokenizer.apply_chat_template(chat, tokenize=True)

mistral_encode = mistral_tok.encode_chat_completion(
  ChatCompletionRequest(messages=chat)
)
mistral_text = mistral_encode.text
mistral_tokens = mistral_encode.tokens

print(hf_tokens == mistral_tokens)
print(hf_text == mistral_text.replace("▁", " ").replace("<0x0A>", "\n"))

Hey @Rocketknight1
Thank you for the quick fix!
It passes our assertion tests, excluding the system prompt, which was a problem before the issue anyway.
The mistral common library adds the system prompt with /n/n at the start of the first user message if it is provided.
Is there a reason why you are not adding it to the chat template, in the same manner?

Hi @ChristianPalaArtificialy , no reason! We didn't know that was the preferred behaviour, because mistral-common didn't exist when those templates were written. If you'd like, I can amend the templates to do that.

To be clear, this means that you want something like this, correct? (obviously without destructively editing the input chat)

if messages[0]['role'] == 'system':
    messages[1]['content'] = messages[0]['content'] + '\n\n' + messages[1]['content']
    messages = messages[1:]
for message in messages:
    # Render the messages as normal here

Also, is this the behaviour you want for models using both the v3 and v1 tokenizers?

Hi @Rocketknight1 ,
True, we also speculated about how the system message was added during fine-tuning before the mistral-common library was published.
There is no change in that part of the code between v3 and v1, so yes that would be our preferred behavior!

@ChristianPalaArtificialy updated the template for the v1 models! I'll work on the v3 models in a sec. Please let me know if it's passing all your tests!

v3 models also updated

Hey @Rocketknight1
/n/n is encoded as a hexadecimal <0x0A><0x0A> by the mistral-common library, so I'm getting failures on the text comparisons between the two methods now (but that's just an artifact of how the mistral-common library returns the text).

The assertions on the token comparisons are all good on our end.

All PRs should be open now - I think everything's ready to merge, but let me know if there's anything else you need to do first.

patrickvonplaten changed pull request status to merged

Sign up or log in to comment