`tokenizer_config.json` has a duplicate entry for `clean_up_tokenization_spaces`

#17
by polarathene - opened

tokenizer_config.json has a duplicate entry for clean_up_tokenization_spaces, the first occurrence at the end of the chat_template line:

  "chat_template": "{{bos_token}}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",  "clean_up_tokenization_spaces": true,
  "clean_up_tokenization_spaces": false,
  • 1st occurrence is true
  • 2nd occurrence is false

I'm not sure which is the intended value here, however mistral.rs will refuse to load the model due to the duplicate clean_up_tokenization_spaceskey. Other software that accepts it presumably uses either the 1st ignoring the 2nd, or treats the 2nd as an override.

Could you please correct this?

thank you for pointing it out

the second occurrence of clean_up_tokenization_spaces has been removed
https://huggingface.co/NousResearch/Hermes-2-Pro-Mistral-7B/commit/c9005c2d51dc3e0ff3399c59951b2353767d1d15

Thanks for getting that sorted! ❀️

polarathene changed discussion status to closed

Sign up or log in to comment