Something wrong

#3
by wcde - opened

Are quants broken? Something is definitely wrong, I don't observe performance anywhere near what Qwen team claims. It thinks for 2k tokens and produces code worse than Qwen-Coder 32B. And I even got few loops with Chinese characters on Q8 without sampling.

I'm having good luck in initial testing and it seems comparable at least in a single test refactoring 250 line python app to R1 671B UD-Q2_K_XL quant and is much faster. I get over 30 tok/sec on my 3090TI with 32k context like so:

./llama-server \
    --model "../models/bartowski/Qwen_QwQ-32B-GGUF/Qwen_QwQ-32B-IQ4_XS.gguf" \
    --n-gpu-layers 65 \
    --ctx-size 32768 \
    --parallel 1 \
    --cache-type-k q8_0 \
    --cache-type-v q8_0 \
    --threads 16 \
    --flash-attn \
    --mlock \
    --n-predict -1 \
    --host 127.0.0.1 \
    --port 8080

$ ./llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
version: 4831 (5e43f104)
built with cc (GCC) 14.2.1 20250128 for x86_64-pc-linux-gnu

$ uname -a
Linux bigfan 6.13.1-arch1-1 #1 SMP PREEMPT_DYNAMIC Sun, 02 Feb 2025 01:02:29 +0000 x86_64 GNU/Linux

Be careful to remove <think></think> tags that might appear in your code as it will stop prematurely.

It is also pretty good so far at translating github issues from ktransformers etc...

You have to give some details like what inference engine, OS, etc if you want some help.

Using Ollama, I changed my context length to 15k (for the Q6 quant) and it thinks for 14 minutes, fills the entire context length, then gives a none working code. I must be doing something wrong.

What inference configs are you using? I'm using litellm with llama.cpp API endpoint as such:

  API_CONFIG = APIConfig(
      base="http://127.0.0.1:8080/v1",
      key="n/a",
      model="openai/some/gguf",
      top_p=0.95,
      temperature=0.6,
      max_completion_tokens=-1,
)

I am not providing any system prompt (not even a system: "") blank one in the chat thread.

It begins immediately with <think> and continues like normal as I would expect. Confirm against their usage guidelines that your parameters are roughly like this.

If you've not used R1 671B before, it will for quite a while depending on the question, 8k+ of thinking is not out of expectations for complex tasks. So I presume Qwen is doing something similar.

Its so new, maybe some other folks will chime in with their results.

I have it writing flappy bird right now, it did about 6k thinking and is dumping out code now. Running about 30~34 tok/sec depending on length.

Works good! Flappy bird in one shot!

Odd... I asked it the question that R1 and even code Qwen2.5 32B gets right. I'm using

ollama run hf.co/bartowski/Qwen_QwQ-32B-GGUF:Q4_K_L
/set parameter num_ctx 15000
write a Python program that shows a ball bouncing inside a spinning hexagon. The ball should be affected by gravity and friction, and it must bounce off the rotating walls realistically

It thinks for about 8 minutes (10K context), then spits out a code and barely gives a working code:

image.png

I am doubtful the benchmarks were correct (or maybe it's not as good at coding). I tried both Q6 and this Q4_K_L, both should have been able to code this, right?

Seems likely that it'll be sensitive to temperature like many recent thinking models are

Also I imagine since it's not THAT big a model, most things it'll either be REALLY good or REALLY bad, these reasoning models definitely have a chance to go off on a tangent and never really know how to come back to reality

Really impressive model no matter what, but definitely still some room to grow! I'm sure Qwen team will continue to astound us :)

Yeah, I agree. Every new model that comes out is pushing us more forward. Qwen2.5-coder:32B is still my daily for now.

I have my usual test questions (about 30), and this model has no problem answering them correctly. I guess it's best model i tested so far. However it's almost first time that I'm testing a 32b model, because of slow hardware. Perhaps that's why it appears smart to me. Also first time ever i get correct answer to this question >>>
Tell me the name of a country whose name ends with 'lia'. Give me the capital city of that country as well.
Answer: Australia
(i bet qwen coder32b and many other models won't answer this 😊)
bartowski/Qwen_QwQ-32B-Q4_K_S.gguf

Sign up or log in to comment