Tokenizer issues with "Ő" and "Ű" characters

#82
by LouiSeHU - opened

We're having a problem with Hungarian texts that include "Ő" and "Ű" characters. This issue is occurring due to the tokenizer, as demonstrated in the following test. The lowercase "ő" and "ű" characters are not affected.

Test:

./llama-tokenize -m ../models/mistral-nemo-instruct-2407-q8_0.gguf -p "VEVŐ"

Results:

     1 -> '<s>'
 16578 -> 'VE'
  1086 -> 'V'
  1197 -> '?'
  1176 -> '?'

Test:

./llama-tokenize -m ../models/mistral-nemo-instruct-2407-q8_0.gguf -p "HŰSÉG"

Results:

     1 -> '<s>'
 29537 -> 'H'
   968 -> '?'
   947 -> '?'
 29503 -> 'S'
 29669 -> 'É'
 29545 -> 'G'

Hi, this looks like an issue with the quantized models or the script you are using. After testing with prompts such as "repeat this: HŰSÉG" the model (and of course the tokenizer) is capable of both understanding and outputting it, did you try with the hf tokenizer and our mistral-common tokenizer?

Thanks for the quick answer!

The prompt "repeat this: HŰSÉG" responds "HŰSÉG", it's ok.

Here is a more detailed test for the difference, see the word "VEVŐ" (buyer) in the prompts:

Prompt with "Ő" letter:

[INST] Here is the text:

\`\`\`
ELADÓ                                                         VEVŐ

John Doe                                                      Sample Company
Budapest                                                      London
Any street 112                                                Sesame street 8
1185                                                          123444
Magyarország                                                  England
\`\`\`

Who is the sender and the recipient?
[/INST]

Wrong result (same results with Le Chat):

Based on the provided text, which appears to be a header of an email or letter, here are the sender and recipient details:

Sender (From):
- Name: John Doe
- Company: Sample Company
- Address: Any street 112, 1185 Budapest, Magyarország

Recipient (To):
- Name: ELADÓ VEVŐ
- Company: Sample Company
- Address: Sesame street 8, 123444 London, England

Le Chat response:

Based on the provided text, which appears to be a mock invoice or letterhead, here are the sender and recipient details:

Sender (From):

    Name: John Doe
    Company: Sample Company
    Address: Sesame street 8, 123444, London, England

Recipient (To):

    Name: ELADÓ
    Company: VEVŐ
    Address: Any street 112, 1185, Budapest, Magyarország (Hungary)

Same prompt with lowercase "ő" letter:

[INST] Here is the text:

\`\`\`
eladó                                                         vevő

John Doe                                                      Sample Company
Budapest                                                      London
Any street 112                                                Sesame street 8
1185                                                          123444
Magyarország                                                  England
\`\`\`

Who is the sender and the recipient?
[/INST]

Results:

Based on the provided text, which appears to be a header for a business letter or invoice, here are the sender and recipient details:

Sender (Eladó):
- Name: John Doe
- Address: Any street 112, 1185 Budapest, Magyarország

Recipient (Vevő):
- Name: Sample Company
- Address: Sesame street 8, 123444 London, England

Le Chat response:

Based on the text provided, here's who the sender and recipient are:

    Sender (eladó): John Doe
        Address: Any street 112, 1185 Budapest, Magyarország

    Recipient (vevő): Sample Company
        Address: Sesame street 8, 123444 London, England
LouiSeHU changed discussion status to closed

I've compared the tokenizer for text "HŰSÉG VEVŐ" in Mistral Nemo and Mistral Small, see the results:

  • Mistral Nemo Instruct 2407
     1 -> '<s>'
  1072 -> 'H'
  1197 -> '?'
  1176 -> '?'
  1083 -> 'S'
  7904 -> 'É'
  1071 -> 'G'
 42981 -> ' VE'
  1086 -> 'V'
  1197 -> '?'
  1144 -> '?'
  • Mistral Small Instruct 2409
     1 -> '<s>'
 29537 -> 'H'
   968 -> '?'
   947 -> '?'
 29503 -> 'S'
 29669 -> 'É'
 29545 -> 'G'
  1318 -> ' V'
  8089 -> 'EV'
 31033 -> 'Ő'

The HF tokenizer playground results: https://huggingface.co/spaces/Xenova/the-tokenizer-playground

  • Mistral v3
repeat this: "HŰSÉG and VEVŐ"
<s> repeat this: "H��SÉG and VEVŐ"
LouiSeHU changed discussion status to open
Mistral AI_ org

@LouiSeHU I dont understand, how is it wrong?

Based on the provided text, which appears to be a mock invoice or letterhead, here are the sender and recipient details:

Sender (From):

    Name: John Doe
    Company: Sample Company
    Address: Sesame street 8, 123444, London, England

Recipient (To):

    Name: ELADÓ
    Company: VEVŐ
    Address: Any street 112, 1185, Budapest, Magyarország (Hungary)

Doesnt the model properly tokenize the special characters you mentionned?

For the playground you tagged, I cannot see Mistral v3 Tekken in the options, but I will take a look and reproduce via transformers and mistral-common to try to understand 👍

@pandora-s Please take a look at the difference in the above example. If "VEVŐ" (means buyer in Hungarian) is in uppercase, then there is an error. I don't know if this is caused by the tokenizer, but I also checked it with a smaller Mistral model (7B v0.3), and in that one, the tokenizer recognizes the "Ő" character, and the response is also correct.

  1318 -> ' V'
  8089 -> 'EV'
 31033 -> 'Ő'
Based on the provided text, it appears that the sender is "John Doe" from Budapest, Hungary (Magyarország), at Any street 112, postal code 1185. The recipient is "Sample Company" in London, England, at Sesame street 8, postal code 123444.

@pandora-s Was it possible to verify the same result at your environment?

Can we expect a new version where these tokens also work properly like in the 7B and Small models?

@pandora-s
When you updated mistralai/Mistral-Nemo-Instruct-2407 2 weeks ago, it looks like you broke it. Can you please see about fixing it for everyone? Thanks

Sign up or log in to comment