01-ai/Yi-Coder-9B-Chat · Add <|im_start|> as a special token to tokenizer

bartowski

Sep 5, 2024

This fixes tokenization of the im_start token

Add <|im_start|> as a special token to tokenizer_config.jsond20f4927

haijian06

Sep 5, 2024

Hi 👋 @bartowski Hello, thank you very much! Could I see how you are specifically using it (for example, the inference code)? This would help us accurately reproduce your issue. Thanks again!

bartowski

Sep 5, 2024

Ah interesting, I actually downloaded the model and using AutoTokenizer from transformers it does tokenize correctly..

However, with GGUF, with this missing it causes <|im_start|> to tokenize as:

59666 -> '<'
59705 -> '|'
  622 -> 'im'
59593 -> '_'
 5858 -> 'start'
46826 -> '|>'

which causes degraded generation. After this change, GGUF seems to be happy to tokenize im_start as token 6.. Not sure why it breaks llama.cpp and not transformers, but there you have it! Up to you if you want to include it or not then :)

haijian06

Sep 5, 2024

Thank you very much for your contribution, @bartowski .

haijian06 changed pull request status to merged Sep 5, 2024