mt5-base embedding size and tokenizer size don't match?
#2
by
echau18
- opened
Hello! I'm trying load mt5-base (encoder-only) with transformers
, and I'm finding that the config and checkpoint have more input embeddings than there are items in the vocabulary. Specifically:
from transformers import MT5TokenizerFast, MT5Config, MT5EncoderModel
cfg = MT5Config.from_pretrained("google/mt5-base")
tok = MT5TokenizerFast.from_pretrained("google/mt5-base")
mdl = MT5EncoderModel.from_pretrained("google/mt5-base", config=cfg)
print(cfg.vocab_size == mdl.get_input_embeddings().num_embeddings)
print(cfg.vocab_size == len(tok))
print(cfg.vocab_size)
print(len(tok))
prints
True
False
250112
250100
(this happens with both fast and standard tokenizers)
Is this expected? If so, what are the extra 12 tokens for?
Hey @echau18 - good question that many people had :-) See: https://github.com/huggingface/transformers/issues/4875
Thanks @patrickvonplaten ! Seems like this is a common issue that resurfaces frequently. Any chance something could be directly added to the model card as an FYI, rather than needing to route through the transformers repo?