token.json size

by robbiemu - opened Oct 14, 2024

Discussion

robbiemu

Oct 14, 2024

Why is the token.json half the size of the base model's?

joanllop

Language Technologies Unit @ Barcelona Supercomputing Center org Oct 15, 2024

•

edited Oct 15, 2024

Hi,

I believe this is because the tokenizer.json file for the instructed models was generated using a version of the tokenizer library prior to this PR: https://github.com/huggingface/tokenizers/pull/909. In contrast, the tokenizer.json files for the base models were created more recently. We believe that the way merge operations are written to the file now causes it to take up twice the space.

For example, what was previously written like this:

"▁profesor a"

is now written like this:
[
"▁profesor",
"a"
]

robbiemu changed discussion status to closed Oct 15, 2024

robbiemu

Oct 15, 2024

•

edited Oct 15, 2024

Just fyi: https://github.com/ollama/ollama/issues/7188#issuecomment-2414666523

Using llama.cpp I didn’t have any issues, but ollama users did.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment