Qwen/QwQ-32B · 我的3090TI 24GB显存运行非常愉快！感谢开发团队！

哇，QwQ-32B 对于这样一个小模型来说真是令人印象深刻！我之前一直在依赖 R1 671B 的 UD-Q2_K_XL 量化模型，并通过 ktransformers 工具结合部分 CPU/GPU 显存卸载技术来应对 NUMA 节点问题，才勉强能重构我的 Python 应用。但现在，我居然可以直接将整个 QwQ-32B 的 IQ4_XS 模型（支持 32k 上下文长度）完整加载到 3090TI 的 24GB 显存中，并以超过 30 token/秒的速度运行！从初步测试来看，它在重构一个约 250 行的 Python LLM 聊天应用时表现与之前方案相当。我将继续测试！感谢开发团队！

My 3090TI 24GB VRAM is very happy. Thank you.

Wow, QwQ-32B is impressive for such a small model. I've been relying on R1 671B UD-Q2_K_XL quant partial CPU/GPU offload with ktransformers battling NUMA node issues just to refactor my python apps, but now I can load the entire QwQ-32B IQ4_XS with 32k context into 3090TI 24GB VRAM and watch it rip at over 30 tok/sec. In my initial test it seems comparable at refactoring a ~250 line python LLM chat app. I'll keep testing! Thanks!

./llama-server \
    --model "../models/bartowski/Qwen_QwQ-32B-GGUF/Qwen_QwQ-32B-IQ4_XS.gguf" \
    --n-gpu-layers 65 \
    --ctx-size 32768 \
    --parallel 1 \
    --cache-type-k q8_0 \
    --cache-type-v q8_0 \
    --threads 16 \
    --flash-attn \
    --mlock \
    --n-predict -1 \
    --host 127.0.0.1 \
    --port 8080

./llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
version: 4831 (5e43f104)
built with cc (GCC) 14.2.1 20250128 for x86_64-pc-linux-gnu

uname -a
Linux bigfan 6.13.1-arch1-1 #1 SMP PREEMPT_DYNAMIC Sun, 02 Feb 2025 01:02:29 +0000 x86_64 GNU/Linux