token.json size

#3
by robbiemu - opened

Why is the token.json half the size of the base model's?

Language Technologies Unit @ Barcelona Supercomputing Center org
edited Oct 15

Hi,

I believe this is because the tokenizer.json file for the instructed models was generated using a version of the tokenizer library prior to this PR: https://github.com/huggingface/tokenizers/pull/909. In contrast, the tokenizer.json files for the base models were created more recently. We believe that the way merge operations are written to the file now causes it to take up twice the space.

For example, what was previously written like this:

"▁profesor a"

is now written like this:
[
"▁profesor",
"a"
]

robbiemu changed discussion status to closed

Just fyi: https://github.com/ollama/ollama/issues/7188#issuecomment-2414666523

Using llama.cpp I didn’t have any issues, but ollama users did.

Sign up or log in to comment