How did you generate these? (32000 vs 32001 token issues)
I am trying to generate a merged model using the unquantized chronos-hermes model, but whenever I try to convert it to ggml format it fails with
Exception: Vocab size mismatch (model has 32001, but MODELDIR/tokenizer.model has 32000). Most likely you are missing added_tokens.json (should be in MODELDIR).
I can get around this by adding an added_tokens.json file with e.g.
{
"[PAD]": 32000
}
but then quantization fails:
========================= Tensor sizes 5120 x 32001 are not divisible by 256
This is required to be able to use k-quants for now!
========================================================================================
I strongly suspect this model doesn't have 32001 tokens, so added_tokens.json is probably not the way to go.
That's exactly what I did to create it, and it worked at the time I did it. But recently they've put in a check for not-divisible-by-256 tensors, which you can read about here: https://github.com/ggerganov/llama.cpp/issues/1919#issuecomment-1599484900
This does rather raise the issue of whether that check is finding false positives, because this model clearly worked when I made it. Though like you I've also wondered whether models like these actually have 32,001 tokens or whether they've just inherited that config. However on at least one previous model I tried just editing config.json to set it to 32000, and it still failed because it did actually have 32001 sized tensors.
So I don't know. But right now, you won't be able to make k-quant GGMLs with this model, unless you roll back to a version of llama.cpp from about 12 days ago or so. And then there's the question of whether it will cause some issues or not.
You can make the old q4_0, q4_1, q5_0, q5_1 and q8_0 quants - though if you do make those you might want to use an even older version of llama.cpp, so as to get the greatest possible compatibility, for anyone still using a library/client not updated since June 6th. I make my q[458]_[01] quants with tag master-ffb06a3
(commit ffb06a345e3a9e30d39aaa5b46a23201a74be6de
).
Looking back at the source model: it is a merge of Chronos and Hermes. Chronos has 32,000 tokens, but Nous Hermes 13B has 32,001. So that's the source of the problem here.
@karan4d - do you recall why Nous Hermes has 32,001 tokens, and is that necessary? As you can read from the above, it's currently causing problems with k-quant GGMLs. Those issues will hopefully be resolved soon. But I'm thinking it'd probably be a good idea to avoid 32,001 token Llama models for future models, unless there's some good reason why this is included?
My memory is that the first model to add that PAD token as 32,001 was GPT4All, and it was done as a hack because they didn't set up the special tokens correctly. My impression is that that practice has then been inherited by other models, maybe without an actual need to do it.
It wasn't ever a problem before, but it is causing issues now. And even if that's resolved soon, it's probably cleaner to avoid the extra token unless it's definitely doing something