Add <|im_start|> as a special token to tokenizer_config.json
#4
by
bartowski
- opened
This fixes tokenization of the im_start token
Hi π @bartowski Hello, thank you very much! Could I see how you are specifically using it (for example, the inference code)? This would help us accurately reproduce your issue. Thanks again!
Ah interesting, I actually downloaded the model and using AutoTokenizer from transformers it does tokenize correctly..
However, with GGUF, with this missing it causes <|im_start|> to tokenize as:
59666 -> '<'
59705 -> '|'
622 -> 'im'
59593 -> '_'
5858 -> 'start'
46826 -> '|>'
which causes degraded generation. After this change, GGUF seems to be happy to tokenize im_start as token 6.. Not sure why it breaks llama.cpp and not transformers, but there you have it! Up to you if you want to include it or not then :)
Thank you very much for your contribution, @bartowski .
haijian06
changed pull request status to
merged