token.json size
#3
by
robbiemu
- opened
Why is the token.json half the size of the base model's?
Hi,
I believe this is because the tokenizer.json file for the instructed models was generated using a version of the tokenizer library prior to this PR: https://github.com/huggingface/tokenizers/pull/909. In contrast, the tokenizer.json files for the base models were created more recently. We believe that the way merge operations are written to the file now causes it to take up twice the space.
For example, what was previously written like this:
"▁profesor a"
is now written like this:
[
"▁profesor",
"a"
]
robbiemu
changed discussion status to
closed
Just fyi: https://github.com/ollama/ollama/issues/7188#issuecomment-2414666523
Using llama.cpp I didn’t have any issues, but ollama users did.