How do you convert the MOE composed of qwen1.5 models into gguf?

#6
by DisOOM - opened

How did you quantify it? I am also trying to merge MOE using qwen1.5, but I can't convert them into GGUF. What modifications did you make to llama.cpp, or did you use any special commands? I tried using convert.py but it failed, and when using convert-hf-to-gguf.py, I ended up with a corrupted model. Could you share your quantification details? I'm really looking forward to it, thank you!

Hi @DisOOM

I have never tried this convert, I just assumed it worked because I tested the fp16 and it works fine. Now that I am testing the quantized models (after your comment), I am getting:

GGML_ASSERT: ggml.c:10860: wdata == wdata_src1_end
Aborted (core dumped)

I am not sure if it's the Qwen1.5 (I see llama.cpp has support for it) or the MoE part. But something is not right. I am going to open an issue on LLama.cpp to report this and see if I can resolve it. (if you already have or seen a related issue please let me know so I can update there)

Thanks for your reply @MaziyarPanahi

There is the only PR I found related to this problem: https://github.com/ggerganov/llama.cpp/pull/6074
But it seems to be aimed at adding support for the new Qwen2MoE architecture, not fixing the conversion errors in the existing MoE architecture models based on qwen1.5. Besides, I haven't seen any other issues or PRs related to MoE composed of Qwen1.5.

So I don’t have the error while I am converting the model. In fact, the model converts to 16-bit GGUF without ang issue and it works well.
Only the quantized models get Aborted when I am trying to inference. (I made a new qaunt from the main branch, still the same error).
I’ll ask in the issues and try this PR to see what happens

@MaziyarPanahi Can you share your conversion method? I used convert.py or convert-hf-to-gguf.py and both scripts failed.

Yes, I believe this is what I execute:

python {llamabase}/convert-hf-to-gguf.py {local_cache_dir} --outtype f16 --outfile {fp16}

Your conversion fails on some Qwen-based model, or does it fail for my model as well? https://huggingface.co/MaziyarPanahi/Qwen1.5-8x7b-v0.1

I've failed with all the qwen moe I've made myself, but haven't tried your model yet, I'll give it a try.

@MaziyarPanahi The f16 version only produce garbage

That's strange, my f16 of this model works without any issue. I also experienced a similar issue (none-Qwen model) but MoE via mergekit when going with quantized. I discussed it with the creator of Llama.cpp https://twitter.com/MaziyarPanahi/status/1770787676217569292

He thinks it should be a small issue somewhere if the f16 works without any issue.

Here is my f16 of this model:

llama.cpp/main -m quantized/MaziyarPanahi/Qwen1.5-8x7b-v0.1/Qwen1.5-8x7b-v0.1.fp16.gguf -p "I need to create a presisted volume on Kubernetese and attach it to my application. Give me these two yaml files:" -n 400 -e
system_info: n_threads = 40 / 80 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 512, n_batch = 2048, n_predict = 400, n_keep = 0


I need to create a presisted volume on Kubernetese and attach it to my application. Give me these two yaml files: the `pvc.yaml` and the `deployment.yaml`. I want to use nginx as an example of the application.

Sure, here are the YAML files you requested:

1. `pvc.yaml`:

```yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pvc-nginx
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 50Mi
  1. deployment.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 3
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
        - name: nginx
          image: nginx:latest
          ports:
            - containerPort: 80

Question:

What do you need help with? [end of text]

llama_print_timings: load time = 109277.43 ms
llama_print_timings: sample time = 554.24 ms / 204 runs ( 2.72 ms per token, 368.07 tokens per second)
llama_print_timings: prompt eval time = 3818.52 ms / 26 tokens ( 146.87 ms per token, 6.81 tokens per second)
llama_print_timings: eval time = 99529.71 ms / 203 runs ( 490.29 ms per token, 2.04 tokens per second)
llama_print_timings: total time = 104257.26 ms / 229 tokens
Log end


So I will prepare an issue and tag him regarding "quantized MoE models in Mergekit" are not working properly

@MaziyarPanahi "Thank you so much, I've spent quite a bit of time making many modifications to llama.cpp myself in an attempt to solve my problem until I finally realized it was beyond my capability. If you could file an issue and resolve the problem, I would be very grateful."

@MaziyarPanahi "Thank you so much, I've spent quite a bit of time making many modifications to llama.cpp myself in an attempt to solve my problem until I finally realized it was beyond my capability. If you could file an issue and resolve the problem, I would be very grateful."

Definitely! as he stated, if the f16 works properly it should be something small we are missing somewhere. I'll create a proper issue and then update here

Hi @MaziyarPanahi , I saw this new unofficial mergekit modification(https://github.com/Aratako/mergekit-qwen2) for generating qwen1.5MoE (not the official latest release haha, I'm referring to the kind we previously did with mergekit), it uses a custom architecture similar to phixtral, and I have already modified my llamacpp to be able to convert this new mergekit-produced qwen1.5-MoE into the gguf format, both f16 and quantization are working fine. If I'm sure that my modifications haven't introduced any other issues, I'll submit a PR to llama.cpp.

Hi @MaziyarPanahi , I saw this new unofficial mergekit modification(https://github.com/Aratako/mergekit-qwen2) for generating qwen1.5MoE (not the official latest release haha, I'm referring to the kind we previously did with mergekit), it uses a custom architecture similar to phixtral, and I have already modified my llamacpp to be able to convert this new mergekit-produced qwen1.5-MoE into the gguf format, both f16 and quantization are working fine. If I'm sure that my modifications haven't introduced any other issues, I'll submit a PR to llama.cpp.

This is a great news! Thank you so much for testing it and looking forward to the PR. As a reference, there are some works under going for MoE support in Llama.cpp:

@MaziyarPanahi This is my modified version of llama.cpp(https://github.com/DisOOM/llama.cpp-qwenmoe) that can convert qwen1.5-MoE created with mergekit-qwen2 into gguf format, and both f16 and the quantized versions work well. However, it's not up-to-date, and I haven't been able to implement it in the latest version of llama.cpp. In the versions after https://github.com/ggerganov/llama.cpp/pull/6122/files, I encounter an assertion error "every model that can must skip unused outputs" in llama.cpp related to the newly added check in #6122, so I haven't submitted a PR yet. I don't understand the principles behind it, and my coding skills are actually quite poor. Could you help me review it and try to solve this issue?

@MaziyarPanahi This is my modified version of llama.cpp(https://github.com/DisOOM/llama.cpp-qwenmoe) that can convert qwen1.5-MoE created with mergekit-qwen2 into gguf format, and both f16 and the quantized versions work well. However, it's not up-to-date, and I haven't been able to implement it in the latest version of llama.cpp. In the versions after https://github.com/ggerganov/llama.cpp/pull/6122/files, I encounter an assertion error "every model that can must skip unused outputs" in llama.cpp related to the newly added check in #6122, so I haven't submitted a PR yet. I don't understand the principles behind it, and my coding skills are actually quite poor. Could you help me review it and try to solve this issue?

Of course! This is a good find. I am going to review what these changes are, however, would it be OK if you also opening that PR with detail about the error you are encountering? There is good community behind Llama.cpp so we can share there and get the PR ready/merge.

Looking forward to the GGUF support for Qwen MoE models!

Sign up or log in to comment