Actual number of parameters?
Why is this model called -1M when it appears the actual number of parameters is 3745984?
model = AutoModelForCausalLM.from_pretrained("roneneldan/TinyStories-1M")
sum(p.numel() for p in model.parameters())
3745984
Or if you exclude the 3216448-parameter token embedding matrix (by far the bulk of the total parameters), the number of other parameters is 529536. But that's more like 500k than 1M. So why isn't this named either TinyStories-4M or TinyStories-500k? What does the -1M refer to?
Oh, I think I may have figured it out!
From the TinyStories paper:
"Our models are available on Huggingface named TinyStories-1M/3M/9M/28M/33M/1Layer/2Layer and TinyStories-Instruct-β. We
use GPT-Neo architecture with window size 256 and context length 512. We use GPT-Neo tokenizer but only keep the top 10K most
common tokens."
So, they only kept the top 10K most common tokens for the training. But the models here have the full vocabulary size 50257 for their embedding matrices. So I guess for distribution the trained models were sort of filled out (with what, zeros? garbage?) to plug-and-play into a much more common tokenizer?
The math works out, since instead of 3216448 (embedding matrix) + 529536 = 3745984 we would then have 640000 + 529536 = 1169536. This makes a lot more sense to me as a "-1M" model so I bet this is how it was trained.