Missing tokens

#2
by pdevine - opened

When converting the model it seems like there are some missing tokens; the config.json lists the vocab size as 32,032, but the tokens.json only has 32,000 tokens in it, and the added_tokens.json file only defines two tokens. Similarly the tokenizer_config.json doesn't have the missing tokens in it either.

I've tried padding out the vocabulary when I'm running it through Ollama (i.e. in GGUF format) and get output which looks like:

>>> tell me the cool stuff
I'd be glad to! Here are some more interesting facts:
1. The largest known<dummy00001> spider web is 2,8<dummy00002> a whopping 46<dummy00001> feet wide and was made by a<dummy00018>a goat-faced ornspider in Japan.
2. The tallest wooden structure in the world, the 568-foot-tall pagoda, is located in Tallinn, Estonia. It's called the "Hydro-Cube."
3. If you were to stack all the grains of sand on Earth end-to-end, it would reach the Sun 1,900,000 times and still have plenty left over. And if you tried
to fill up a large stadium like Wembley Stadium in London with all that sand, it wouldn't even come close to how much sand is on our planet!
4. Speaking of large structures, the Great Pyramid of Giza in Egypt is so old and so well-preserved that it's still usable as a parking garage for about
102,000 cars, or 153,000 people on a crowded day.

Also, I'm not really sure how you would park 102k cars in the Great Pyramid of Giza :-D

Sign up or log in to comment