when i use vllm v0.7.2 to deploy r1 awq, i got empty content

#10
by bupalinyu - opened

curl http://localhost:23336/v1/chat/completions
-H "Content-Type: application/json"
-d '{"model": "deepseek-reasoner",
"messages": [
{"role": "user", "content": "你是谁"}
],
"stream":true,
"temperature":1.2
}'
data: {"id":"chatcmpl-c7e88282efa547cfba27b429df7df593","object":"chat.completion.chunk","created":1739440234,"model":"deepseek-reasoner","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-c7e88282efa547cfba27b429df7df593","object":"chat.completion.chunk","created":1739440234,"model":"deepseek-reasoner","choices":[{"index":0,"delta":{"content":""},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-c7e88282efa547cfba27b429df7df593","object":"chat.completion.chunk","created":1739440234,"model":"deepseek-reasoner","choices":[{"index":0,"delta":{"content":""},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-c7e88282efa547cfba27b429df7df593","object":"chat.completion.chunk","created":1739440234,"model":"deepseek-reasoner","choices":[{"index":0,"delta":{"content":""},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-c7e88282efa547cfba27b429df7df593","object":"chat.completion.chunk","created":1739440234,"model":"deepseek-reasoner","choices":[{"index":0,"delta":{"content":""},"logprobs":null,"finish_reason":null}]}

python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 23333 --max-model-len 60000 --trust-remote-code --tensor-parallel-size 8 --quantization moe_wna16 --gpu-memory-utilization 0.92 --kv-cache-dtype fp8_e5m2 --calculate-kv-scales --served-model-name deepseek-reasoner --model ${LLM_MODEL_DIR}

Same errors. And if you set "skip_special_tokens" to false when sampling, you'll find it's not empty content but repeated <|begin_of_sentence|> tokens. If you want to see logprobs, the server would yield an error because of NaN value.
Looking for someone's help...

Please disable kv cache quantization

Please disable kv cache quantization

tried, but still the same bug

Cognitive Computations org

Try build from source.

I use SGLang to deploy r1 awq on 1 node A800*8 and get same empty content for some questions too.
My command is below:
python3 -m sglang.launch_server --host 0.0.0.0 --port 30000 --model-path models/DeepSeek-R1-AWQ --tp 8 --enable-p2p-check --trust-remote-code --dtype float16 --mem-fraction-static 0.9 --served-model-name deepseek-r1-awq --disable-cuda-graph

so , did anyone delpoy successful?

Cognitive Computations org

This might be related to the float16 overflow issue, please try the moe_wna16 kernel with bfloat16.

i deeploy success in vllm 0.7.2 with --enforce-eager, but very slow.

Sign up or log in to comment