when i use vllm v0.7.2 to deploy r1 awq, i got empty content
curl http://localhost:23336/v1/chat/completions
-H "Content-Type: application/json"
-d '{"model": "deepseek-reasoner",
"messages": [
{"role": "user", "content": "你是谁"}
],
"stream":true,
"temperature":1.2
}'
data: {"id":"chatcmpl-c7e88282efa547cfba27b429df7df593","object":"chat.completion.chunk","created":1739440234,"model":"deepseek-reasoner","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}]}
data: {"id":"chatcmpl-c7e88282efa547cfba27b429df7df593","object":"chat.completion.chunk","created":1739440234,"model":"deepseek-reasoner","choices":[{"index":0,"delta":{"content":""},"logprobs":null,"finish_reason":null}]}
data: {"id":"chatcmpl-c7e88282efa547cfba27b429df7df593","object":"chat.completion.chunk","created":1739440234,"model":"deepseek-reasoner","choices":[{"index":0,"delta":{"content":""},"logprobs":null,"finish_reason":null}]}
data: {"id":"chatcmpl-c7e88282efa547cfba27b429df7df593","object":"chat.completion.chunk","created":1739440234,"model":"deepseek-reasoner","choices":[{"index":0,"delta":{"content":""},"logprobs":null,"finish_reason":null}]}
data: {"id":"chatcmpl-c7e88282efa547cfba27b429df7df593","object":"chat.completion.chunk","created":1739440234,"model":"deepseek-reasoner","choices":[{"index":0,"delta":{"content":""},"logprobs":null,"finish_reason":null}]}
python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 23333 --max-model-len 60000 --trust-remote-code --tensor-parallel-size 8 --quantization moe_wna16 --gpu-memory-utilization 0.92 --kv-cache-dtype fp8_e5m2 --calculate-kv-scales --served-model-name deepseek-reasoner --model ${LLM_MODEL_DIR}
Same errors. And if you set "skip_special_tokens" to false when sampling, you'll find it's not empty content but repeated <|begin_of_sentence|> tokens. If you want to see logprobs, the server would yield an error because of NaN value.
Looking for someone's help...
Please disable kv cache quantization
Please disable kv cache quantization
tried, but still the same bug
Try build from source.
I use SGLang to deploy r1 awq on 1 node A800*8 and get same empty content for some questions too.
My command is below:
python3 -m sglang.launch_server --host 0.0.0.0 --port 30000 --model-path models/DeepSeek-R1-AWQ --tp 8 --enable-p2p-check --trust-remote-code --dtype float16 --mem-fraction-static 0.9 --served-model-name deepseek-r1-awq --disable-cuda-graph
so , did anyone delpoy successful?
This might be related to the float16 overflow issue, please try the moe_wna16
kernel with bfloat16.
i deeploy success in vllm 0.7.2 with --enforce-eager, but very slow.