`tokenizer_config.json` has a duplicate entry for `clean_up_tokenization_spaces`

#17
by polarathene - opened

tokenizer_config.json has a duplicate entry for clean_up_tokenization_spaces, the first occurrence at the end of the chat_template line:

  "chat_template": "{{bos_token}}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",  "clean_up_tokenization_spaces": true,
  "clean_up_tokenization_spaces": false,
  • 1st occurrence is true
  • 2nd occurrence is false

I'm not sure which is the intended value here, however mistral.rs will refuse to load the model due to the duplicate clean_up_tokenization_spaceskey. Other software that accepts it presumably uses either the 1st ignoring the 2nd, or treats the 2nd as an override.

Could you please correct this?

NousResearch org
β€’
edited Jun 5

thank you for pointing it out

the second occurrence of clean_up_tokenization_spaces has been removed
https://huggingface.co/NousResearch/Hermes-2-Pro-Mistral-7B/commit/c9005c2d51dc3e0ff3399c59951b2353767d1d15

Thanks for getting that sorted! ❀️

polarathene changed discussion status to closed

Sign up or log in to comment