Different size between tokenizer vocab and embedding

#1
by demharters - opened

There seems to be a discrepancy between vocab length and embedding size. Any ideas why?

model_name='qilowoq/AbLang_heavy'
tokenizer = AutoTokenizer.from_pretrained(model_name, revision='c451857')
model = AutoModelForMaskedLM.from_pretrained(model_name, trust_remote_code=True, revision = 'c451857')

embedding_size = model.roberta.embeddings.word_embeddings.weight.size(0)
print(f"Embedding size: {embedding_size}")

vocab_size = tokenizer.vocab_size
print(f"Vocabulary size: {vocab_size}")```

"Embedding size: 24
Vocabulary size: 25"

Yes. That was intentional.

Tokenizer needs [UNK] token, but there were no such token in original model. So [UNK] token is 25th token. It would not affect model unless there is unknown animoacid in sequence.

It's just that I got an error when starting fine tuning due to the discrepancy. Thanks for clarifying.

qilowoq changed discussion status to closed

Sign up or log in to comment