llama.cpp for llama3 moe models

#1
by WesPro - opened

Hi mradermacher,

how do you make quants for Llama3 MoE models? I tried everything i could think of but I always fail to make a f16. I tried convert.py convert-hf-to-gguf.py and imatrix for llama3 but nothing works. Normal merges of llama3 models work after copy pasting the convert-hf-to-gguf-update.py files from llama-bpe to my llama3 merge folder but somehow llama3 moe models don't work.

well, first of all, you must use convert-hf-to-gguf.py, as convert.py is officially not supported for llama 3. then, you already mastered the convert-hf-to-gguf-update technique, so thats good (I haven't :).

and... thats all I know. and all I do, basically.

what exactly does "nothing works" mean? you get crashes/errors, or a model that talks gibberish, or a just a bad model? in the latter cases, do mine actually work?

At least the i1-Q6_K of this seems to work.

well, first of all, you must use convert-hf-to-gguf.py, as convert.py is officially not supported for llama 3. then, you already mastered the convert-hf-to-gguf-update technique, so thats good (I haven't :).

and... thats all I know. and all I do, basically.

what exactly does "nothing works" mean? you get crashes/errors, or a model that talks gibberish, or a just a bad model? in the latter cases, do mine actually work?

Yes I know convert.py generates llama3 quants with the wrong bpe tokenizer because llama.cpp seems to be unable to recognize what model architecture the model has that you pointed convert.py to, so it just takes the tokenizer from another model that also uses a bpe tokenizer (afaik gpt2). So convert-py "works" but gives you quants that generate significantly different and probably worse tokens that's why even the high precision quants have big ppl difference so it should definitely be noticeable that "correct" Llama3 quants have superior quality. It's easy to make correct ones though you just need to run the convert-hf-to-gguf-update.py. It downloads the tokenizers for several models including llama3. You just need to copy the downloaded tokenizer files from models tokenizers llama3bpe into the llama3 model folder and then overwrite the old files. When you use the convert-hf-to-gguf.py on the llama3 model with updated tokenizer file it creates "right" llama3 quants. Somehow it is different with my llama3 MoE models though. I can't even generate a F16.gguf with convert.py /modelpath --vocab-type bpe and convert-hf-to-gguf.py doesn't generate me an F16.gguf either no matter if I changed the tokenizer files or not. The moe llama3 models work normal within oobabooga transformers but no matter what i try i always get the same error:
Traceback (most recent call last):
File "C:\Users\Chris\llama.cpp\convert-hf-to-gguf.py", line 2975, in
main()
File "C:\Users\Chris\llama.cpp\convert-hf-to-gguf.py", line 2969, in main
model_instance.write()
File "C:\Users\Chris\llama.cpp\convert-hf-to-gguf.py", line 179, in write
self.write_tensors()
File "C:\Users\Chris\llama.cpp\convert-hf-to-gguf.py", line 1441, in write_ten
sors
if len(experts) >= n_experts:
^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: '>=' not supported between instances of 'int' and 'NoneType'

What do you enter to make a f16.gguf of a llama3 MoE with convert-hf-to-gguf.py? I tried something like this:
"cd llama.cpp" then
"python convert-hf-to-gguf.py /modelpath" sometimes i added --outtype f16 or f32 but nothing I tried, solved the error.

I don't think it takes tokenizers form anywhere else. The issue is the pretokenizer, not the tokenizer. And the problem is that llama.cpp does not implement the support for it, but hardcodes the most common ones and then tries to "measure" which one is used. And that measuring step is simply not implemented in convert.py. You can override the pretokenizer type for models converted with convert.py, and they seem to be fine.

Copying the tokenizer from another model is wrong unless it really matches it. What convert-hf-to-gguf-update does is not copy tokenizers, but only detects the pretokenizer for a model.

Finally, the error you get. I think this is a problem with the specific model you try to convert, not a generic issue. And it's probably a mismatch between the metadata and the model. As in, "num_local_experts" is not specified in config.json maybe?

PS: convert/convert-hf-to-gguf do not, in general, generate f16's unless you explicitly force them to quantize to f16 (which obviously should not be done if you want to quantize further). By default, they try to preserve the precision (e.g. by converting bf16 to f32).

Yes you are right the issue was with the pretokenizer. I just thought it was the tokenizer because when I run the convert-hf-to-gguf-update.py it downloads 3-4 files for each different model into this folder " \llama.cpp\models\tokenizers". The convert-hf-to-gguf-update.py loads config.json tokenizer.json, tokenizer-config.json for Llama3 8B into the subfolder called llama-bpe. I copied these files into several Llama3 models and they work perfectly and I got errors when I tried to use the convert-hf-to-gguf.py without replacing the files within my llama3 models. This way I got also rid of the warning "GENERATION QUALITY WILL BE DEGRADED! CONSIDER REGENERATING THE MODEL" in llama.cpp or kobold.cpp when loading the models before there was this possibility. Maybe the config.json tokenizer.json, tokenizer-config.json of llama3 8b just work for any finetune/merge model that is exclusively based on the llama3 8b. This whole process is not something I just came up with I just red about on lewdiculous' model info for a llama3 model where he says thats the way to use the fix and make working quants. I also red about this method on reddit or maybe discord i dont remember exactly but this is only a solution for the llama3 8b model- I don't know why it also downloads the files for the other models that are simultaneously saved when running the convert-hf-to-gguf-update.py. The instructions in the update.py also say to add a model if it's not listed yet so I guess you could even load more then these that it loads automatically:

image.png

I guess I will try to merge the Llama3 MoE model from scratch and then i'll compare my json with the config.json files from other Llama3 MoE's that you already did quants for because I think your tip might be right, at least this model here has this "num_local_experts"=4 defined in the config.json and mine doesn't so hopefully this will be the solution. Thanks for your help so far and thanks for doing all those gguf's especially since it seems like The Bloke is not active anymore this is really needed. There are so many great models that I would've never found otherwise...So keep up the good work and have a nice day ;)

Well, copying files is simply wrong, even if it silences warnings. You of course can get a good result, especially if the tokenizers are compatible, but the fact remains its wrong. The way to get rid of that warning is to use convert-hf-to-gguf.py, because convert.py doesn't set the pretokenizer, and is officially not working for llama-3.

The update script only downloads those files to measure them. The update script has nothing to do with the conversion, it exists only to generate a patch for convert-hf-to-gguf.py, so it can set the pretokenizer to the correct value. The conversion does notnuse the downloaded files, nor does it need those to work.

Anyway, I just tell you how it is, it's up to you to insist on doing it the wrong way or not. If you don't understand the process, I can probably walk you through it. But that is basically how I created my hfhfix script variant, which works for all llama-3-8b models I have thrown at it.

As for the num_experts, I am confident that this is the symptom that causes the error. The question is why it doesn't end up in the config. But for that, I really am the wrong person :)

mradermacher changed discussion status to closed

I'm not claiming it's the right way of doing it. I have not nearly enough knowledge or experience with llama.cpp... It just seemed to make sense to me that it downloads those files to replace them but I believe you when you say it's not the right way. I'm not using convert.py for Llama3 models though because that wouldn't get rid of the warning. I even tried using convert.py but it didn't work and still had the warning when loading in kobold.cpp. I actually used the convert-hf-to-gguf.py but it didn't seem to work before I started to replace the config.json, tokenizer.json and tokenizer-config.json before running convert-hf-to-gguf.py but I'll try using it without replacing the files next time first.

The conversion of the MoE Llama3 model works now with a new generated config.json so that was definitley the issue. Thanks for helping me figure that out

convert.py is, without doubt, not correct for llama 3 at this point, as it lacks llama-3 support. maybe support for llama-3 will be added to it, but I don't think there are definite plans for that. One can do a hack-conversion with convert.py and then overriding kv values, with uncertain output.

The only correct way is to get convert-hf-to-gguf.py working without replacing files.

Anyway, great that you solved you n_experts problem. You might want to hold back with imatrix generation, though: https://github.com/ggerganov/llama.cpp/pull/7099 might or might not affect llama-3 moes as well.

Sign up or log in to comment