No [PREFIX] and [SUFFIX] in tokenizer vocab

#10

by Vokturz - opened May 30, 2024

May 30, 2024

Hi, I was trying to use the FIM feature with no success. After playing with the tokenizer MistralTokenizer.v3() I found that both [PREFIX] and [SUFFIX] tokens points to (id 0):

from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.tokens.tokenizers.base import FIMRequest
tokenizer = MistralTokenizer.v3()
tokenizer.encode_fim(FIMRequest(prompt="def f(", suffix="return a + b")).text
>>> '<s><unk>return▁a▁+▁b<unk>▁def▁f('

tokenizer.instruct_tokenizer.tokenizer.get_control_token('[INST]')
>>> 3
tokenizer.instruct_tokenizer.tokenizer.get_control_token('[PREFIX]')
>>> 0
tokenizer.instruct_tokenizer.tokenizer.get_control_token('[SUFFIX]')
>>> 0
tokenizer.instruct_tokenizer.tokenizer._vocab[:5]
>>> ['<unk>', '<s>', '</s>', '[INST]', '[/INST]']

I found this test in mistral/mistral-common repository:

from mistral_common.tokens.tokenizers.base import FIMRequest
from mistral_common_private.tokens.tokenizers.mistral import MistralTokenizer
tokenizer =  MistralTokenizer.v3()
tokenized = tokenizer.encode_fim(FIMRequest(prompt="def f(", suffix="return a + b"))
assert tokenized.text == "<s>[SUFFIX]return▁a▁+▁b[PREFIX]▁def▁f("

It must exists a privated tokenizer related to mistral_common_private 🤔. Hence, the public tokenizer has no option to do FIM?

patrickvonplaten

Mistral AI_ org May 30, 2024

Great catch @Vokturz ! We rushed that code from mistral/mistral-common a bit too much yesterday - it's indeed wrong!

The tokenizer will need to be updated as well - bear with me, should be done in 30min!

If you just process the generated text as shown here: https://huggingface.co/mistralai/Codestral-22B-v0.1#fill-in-the-middle-fim it shouldn't have made a difference, but it's indeed better to have the correct tokens set for [SUFFIX] and [PREFIX]

legraphista

May 30, 2024

hey @patrickvonplaten

using the provided code:

from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.tokens.tokenizers.base import FIMRequest
tokenizer = MistralTokenizer.v3()

print(tokenizer.encode_fim(FIMRequest(prompt="def f(", suffix="return a + b")).text)
print(tokenizer.encode_fim(FIMRequest(prompt="def f(", suffix="return a + b")).tokens)

prints

'<s><unk>return▁a▁+▁b<unk>▁def▁f('
[1, 0, 1575, 1032, 1416, 1055, 0, 1569, 1053, 29500]

By the looks of it, even the encoding it not setting the right token

patrickvonplaten

Mistral AI_ org May 30, 2024

We just did a patch release with mistral-common==1.2.1: https://github.com/mistralai/mistral-common/releases/tag/v1.2.1

and uploaded a new tokenizer: https://huggingface.co/mistralai/Codestral-22B-v0.1/commit/c5f230adeebf56c7ff3bf8620a3678fbdc393516

vanshils

Jun 3, 2024

•

edited Jun 3, 2024

Even after the upload of new tokenizer, any reason that I am getting the following output if i download the latest hf commit.

from transformers import AutoTokenizer
tokenizer= AutoTokenizer.from_pretrained(".")
>>> tokenizer.convert_tokens_to_id("[SUFFIX]")
 0
>>> tokenizer.convert_tokens_to_ids("[PREFIX]")
0
>>> tokenizer.convert_tokens_to_ids("[INST]")
3

winddude

Jun 8, 2024

because they seem to be using their own tokenizer format... tokenizer.model.v3 rather than the hf formats tokenizer.json, etc. Why? i dunno... seems strange, maybe push people to use their code and become more dependant on mistral...

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment