model._vocab_size vs len(tokenizer) Mismatch

#2
by delinqu - opened

Thank your team for the amazing open-source!
I’ve encountered a situation where model._vocab_size and len(tokenizer) are inconsistent. Specifically, model._vocab_size is larger than len(tokenizer) by 63 tokens. I’m wondering if these 63 extra tokens (which seem to be present in model._vocab_size but not in len(tokenizer)) are involved in training, and if so, what their purpose is? Are they related to special tokens or reserved for future extensions?
I would appreciate any clarification on this discrepancy!
image.png

Google org

cc @Molbap

Google org

Hi @delinqu ,

The discrepancy between model._vocab_size and len(tokenizer) is likely due to reserved tokens, such as special-purpose tokens for images (num_image_tokens), unused tokens for future extensions.

Special Tokens like bos_token_id, eos_token_id, pad_token_id, and potentially image_token_index or others are included in model._vocab_size but may not be directly accessible via tokenizer.

These extra tokens are likely included in the model's embedding table for consistency but are not actively used during standard text-based tokenization. Whether they are involved in training depends on the task.
For example: if the model processes multimodal inputs, these tokens may play a role.

For an more reference, could you please refer to this reference

Thank you.

Thanks for your reply. But in gemma2, the bos_token_id, eos_token_id, pad_token_id, and potentially image_token_index are 2, 1, 0 and 257152, respectively.

delinqu changed discussion status to closed

Sign up or log in to comment