Why is "bos_token": null, in tokenizer_config.json?

#15
by 3Simplex - opened

Why is the "bos_token": null, in tokenizer_config.json?
I don't understand the reason for this line "bos_token": null, within tokenizer_config.json

Please help us understand your reasons. I would simply replace it with the BOS as expected but I do not want to assume this would be the correct choice.
@hzhwcmhf
@yangapku

I also saw a discrepancy between the bos_token and bos_token_id settings in the tokenizer_config.json and config.json files during continued pre-training, which led to an error. I resolved the issue by setting the bos_token in tokenizer_config.json like you described. I've also reported the issue in the discussion here.

Iโ€™m also interested in hearing the official response from the Qwen team. If this is a bug, it would be better to have it addressed.

I did a lot of research and tests today and nullis valid, depending on what you want to do with the model. For instance, if you quantize the model with llama.cpp, and it encounters null as eos_tokenor bos_token in tokenizer_config.json, it will automatically fall back to bos_token_id and eos_token_id in the config.json file.

The fall-back hierarchy seems to be thus: tokenizer_config.json > config.json > generation_config.json. At least for quantizing with llama.cpp.

Note, that based on https://github.com/huggingface/transformers/issues/25395#issuecomment-1671075332, it is known that "in the past, the model config held both model parameters (like number of layers) and generate parameterization (like forcing tokens at generate time). That is suboptimal, as you may e.g. wish to have several generation configurations for the same model.", unless this info is already out of date.

There is no bos_token for Qwen models. It is not necessary to prepend a control token to every input sequence. However, there are many frameworks assuming that there is a bos_token and they indeed prepend a control token to every input sequence. If that is the case, we recommend setting it to <|endoftext|> because most of time it takes no effect so it does less harm. However, if one is willing to investigate, it is better to check the data processing procedure to make sure no other assumptions are there and modify it so that a bos_token is not needed.

that's to say:

  • as a standalone model as well as a standalone tokenizer (tokenizer_config.json), the bos_token should be null or None, which is the original
  • as a part of a framework, as in transformers (generation_config.json or the legacy config.json) which requires the bos_token to function, the bos_token is recommended to set to <|endoftext|>; this is purely for compatibility

@tanliboy trl has supported models without a bos_token in this PR.

@ThiloteE there is a meta field called tokenizer.ggml.add_bos_token in the GGUF files, and when converting Qwen models, you should set it to false.

Thank you, @jklj077 ! It is great to know that this compatibility problem has been fixed in the recent release.

@jklj077 Thank you for the clarification. I suspected this might be the case.

For the record:

adding "add_bos_token": falseto the tokenizer_config.json sets tokenizer.ggml.add_bos_token to false during quantization to a GGUF file with the convert_hf_to_gguf.pyscript as provided by llama.cpp.

I have provided a GGUF with the corrected config at https://huggingface.co/GPT4All-Community/Qwen2-7B-Instruct-GGUF

Sign up or log in to comment