Tokenizer Details

by qingy2024 - opened 11 days ago

qingy2024

11 days ago

Great work on this!

I was wondering, how did you get the tokenizer to have the same vocab size as QwQ 32B Preview? I would like to do this for some other models too!

If you have a script or just a set of steps to do this, I'd appreciate if you could share it :)

tugstugi

Owner 11 days ago

The tokenizer is actually the same, you only need to change the embedding layer size.

model.resize_token_embeddings(152064)

tugstugi changed discussion status to closed 11 days ago

qingy2024

3 days ago

So, one more thing – when you do this: model.resize_token_embeddings(152064), does it affect the model's performance in any way, or is the vocabulary just filled up with pad tokens?

Just wondering since I got this warning while running it myself:

The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`

djuna

3 days ago

@qingy2024 As long as it can be divided by 64, it should not be a problem right?

qingy2024

3 days ago

•

edited 3 days ago

@djuna I don't know too much about this and how the model architecture uses the vocab; please enlighten me :)

djuna

3 days ago

•

edited 3 days ago

@qingy2024 I can only find this
https://x.com/karpathy/status/1621578354024677377

tugstugi

Owner 3 days ago

•

edited 3 days ago

@qingy2024 If the tokenizer is not changed, resize_token_embeddings should have no effect. Here is a test for the embedding:

import torch

with torch.no_grad():
    embed_dim = 5
    old_num_embed, new_num_embed = 5, 8
    old_embeddings = torch.nn.Embedding(old_num_embed, embed_dim)
    new_embeddings = torch.nn.Embedding(new_num_embed, embed_dim, device=old_embeddings.weight.device, dtype=old_embeddings.weight.dtype)
    new_embeddings.weight.data[:old_num_embed, :] = old_embeddings.weight.data[:old_num_embed, :]

    out1 = new_embeddings(torch.LongTensor([range(old_num_embed)]))
    out2 = old_embeddings(torch.LongTensor([range(old_num_embed)]))
    assert torch.allclose(out1, out2)

qingy2024

3 days ago

Oh, very interesting; thanks for clarifying!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment