Align tokenizer with mistral-common

from mistral_common.protocol.instruct.request import ChatCompletionRequest
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from transformers import AutoTokenizer

chat = [
    {"role": "system", "content": "You are a helpful bot"},
    {"role": "user", "content": "Hello"},
    {"role": "assistant", "content": "Hi there!"},
    {"role": "user", "content": "How are you?"},
    {"role": "assistant", "content": "Fine and you?"},
    {"role": "user", "content": "Fine thank you."},
]

mistral_tok = MistralTokenizer.v1()
hf_tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1", revision="pr/120")

hf_text = hf_tokenizer.apply_chat_template(chat, tokenize=False)
hf_tokens = hf_tokenizer.apply_chat_template(chat, tokenize=True)

mistral_encode = mistral_tok.encode_chat_completion(
  ChatCompletionRequest(messages=chat)
)
mistral_text = mistral_encode.text
mistral_tokens = mistral_encode.tokens

print(hf_tokens == mistral_tokens)
print(hf_text == mistral_text.replace("▁", " ").replace("<0x0A>", "\n"))

Defend the honour of the HF tokenizere3f3eb8f

ChristianPalaArtificialy

Jun 26, 2024

Hey @Rocketknight1
Thank you for the quick fix!
It passes our assertion tests, excluding the system prompt, which was a problem before the issue anyway.
The mistral common library adds the system prompt with /n/n at the start of the first user message if it is provided.
Is there a reason why you are not adding it to the chat template, in the same manner?

Rocketknight1

Jun 26, 2024

•

edited Jun 26, 2024

Hi @ChristianPalaArtificialy , no reason! We didn't know that was the preferred behaviour, because mistral-common didn't exist when those templates were written. If you'd like, I can amend the templates to do that.

To be clear, this means that you want something like this, correct? (obviously without destructively editing the input chat)

if messages[0]['role'] == 'system':
    messages[1]['content'] = messages[0]['content'] + '\n\n' + messages[1]['content']
    messages = messages[1:]
for message in messages:
    # Render the messages as normal here

Also, is this the behaviour you want for models using both the v3 and v1 tokenizers?

ChristianPalaArtificialy

Jun 26, 2024

Hi @Rocketknight1 ,
True, we also speculated about how the system message was added during fine-tuning before the mistral-common library was published.
There is no change in that part of the code between v3 and v1, so yes that would be our preferred behavior!

Update chat template to handle sys messages5d1db7f3

Rocketknight1

Jun 26, 2024

@ChristianPalaArtificialy updated the template for the v1 models! I'll work on the v3 models in a sec. Please let me know if it's passing all your tests!

Rocketknight1

Jun 26, 2024

v3 models also updated

ChristianPalaArtificialy

Jun 26, 2024

•

edited Jun 26, 2024

Hey @Rocketknight1
/n/n is encoded as a hexadecimal <0x0A><0x0A> by the mistral-common library, so I'm getting failures on the text comparisons between the two methods now (but that's just an artifact of how the mistral-common library returns the text).

The assertions on the token comparisons are all good on our end.

Rocketknight1

Jun 27, 2024

Nice!

Rocketknight1

Jun 27, 2024

All PRs should be open now - I think everything's ready to merge, but let me know if there's anything else you need to do first.

patrickvonplaten

Mistral AI_ org Jul 3, 2024

•

edited Jul 3, 2024

Ok!

patrickvonplaten changed pull request status to merged Jul 3, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment