我的3090TI 24GB显存运行非常愉快!感谢开发团队!

#11
by ubergarm - opened

哇,QwQ-32B 对于这样一个小模型来说真是令人印象深刻!我之前一直在依赖 R1 671B 的 UD-Q2_K_XL 量化模型,并通过 ktransformers 工具结合部分 CPU/GPU 显存卸载技术来应对 NUMA 节点问题,才勉强能重构我的 Python 应用。但现在,我居然可以直接将整个 QwQ-32B 的 IQ4_XS 模型(支持 32k 上下文长度)完整加载到 3090TI 的 24GB 显存中,并以超过 30 token/秒的速度运行!从初步测试来看,它在重构一个约 250 行的 Python LLM 聊天应用时 表现与之前方案相当。我将继续测试!感谢开发团队!


My 3090TI 24GB VRAM is very happy. Thank you.

Wow, QwQ-32B is impressive for such a small model. I've been relying on R1 671B UD-Q2_K_XL quant partial CPU/GPU offload with ktransformers battling NUMA node issues just to refactor my python apps, but now I can load the entire QwQ-32B IQ4_XS with 32k context into 3090TI 24GB VRAM and watch it rip at over 30 tok/sec. In my initial test it seems comparable at refactoring a ~250 line python LLM chat app. I'll keep testing! Thanks!

./llama-server \
    --model "../models/bartowski/Qwen_QwQ-32B-GGUF/Qwen_QwQ-32B-IQ4_XS.gguf" \
    --n-gpu-layers 65 \
    --ctx-size 32768 \
    --parallel 1 \
    --cache-type-k q8_0 \
    --cache-type-v q8_0 \
    --threads 16 \
    --flash-attn \
    --mlock \
    --n-predict -1 \
    --host 127.0.0.1 \
    --port 8080

./llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
version: 4831 (5e43f104)
built with cc (GCC) 14.2.1 20250128 for x86_64-pc-linux-gnu

uname -a
Linux bigfan 6.13.1-arch1-1 #1 SMP PREEMPT_DYNAMIC Sun, 02 Feb 2025 01:02:29 +0000 x86_64 GNU/Linux

你好,请问你是使用VLLM部署的么?我最近也想要在本地显卡部署这个,请问有推荐的教程么?谢谢

Sign up or log in to comment