cognitivecomputations/DeepSeek-R1-AWQ · when i use vllm v0.7.2 to deploy r1 awq, i got empty content

Feb 13

curl http://localhost:23336/v1/chat/completions
-H "Content-Type: application/json"
-d '{"model": "deepseek-reasoner",
"messages": [
{"role": "user", "content": "你是谁"}
],
"stream":true,
"temperature":1.2
}'
data: {"id":"chatcmpl-c7e88282efa547cfba27b429df7df593","object":"chat.completion.chunk","created":1739440234,"model":"deepseek-reasoner","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-c7e88282efa547cfba27b429df7df593","object":"chat.completion.chunk","created":1739440234,"model":"deepseek-reasoner","choices":[{"index":0,"delta":{"content":""},"logprobs":null,"finish_reason":null}]}

bupalinyu

Feb 13

python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 23333 --max-model-len 60000 --trust-remote-code --tensor-parallel-size 8 --quantization moe_wna16 --gpu-memory-utilization 0.92 --kv-cache-dtype fp8_e5m2 --calculate-kv-scales --served-model-name deepseek-reasoner --model ${LLM_MODEL_DIR}

Saaiet

Feb 13

Same errors. And if you set "skip_special_tokens" to false when sampling, you'll find it's not empty content but repeated <|begin_of_sentence|> tokens. If you want to see logprobs, the server would yield an error because of NaN value.
Looking for someone's help...

mgoin

Feb 13

Please disable kv cache quantization

Saaiet

Feb 14

Please disable kv cache quantization

tried, but still the same bug

v2ray

Cognitive Computations org Feb 17

Try build from source.

Eric108

Feb 18

I use SGLang to deploy r1 awq on 1 node A800*8 and get same empty content for some questions too.
My command is below:
python3 -m sglang.launch_server --host 0.0.0.0 --port 30000 --model-path models/DeepSeek-R1-AWQ --tp 8 --enable-p2p-check --trust-remote-code --dtype float16 --mem-fraction-static 0.9 --served-model-name deepseek-r1-awq --disable-cuda-graph

xueshuai

Feb 18

so , did anyone delpoy successful?

v2ray

Cognitive Computations org Feb 18

This might be related to the float16 overflow issue, please try the moe_wna16 kernel with bfloat16.

Excp

Feb 19

•

edited Feb 19

i deeploy success in vllm 0.7.2，use 2 * 8 A100(40G). But for any chinese query, without thinking and the replay is very simple.

qinyuenlp

Feb 20

i deeploy success in vllm 0.7.2，use 2 * 8 A100(40G). But for any chinese query, without thinking and the replay is very simple.

Try to download https://huggingface.co/deepseek-ai/DeepSeek-R1/tokenizer_config.json and replace your DeepSeek-R1-awq/tokenizer_config.json。
If it works, you should face the problem that "model output without the '' label".
In DeepSeek-R1's document，their advice is To ensure that the model engages in thorough reasoning, we recommend enforcing the model to initiate its response with "<think>\n" at the beginning of every output.

Eric108

Feb 21

i deeploy success in vllm 0.7.2，use 2 * 8 A100(40G). But for any chinese query, without thinking and the replay is very simple.

Try to download https://huggingface.co/deepseek-ai/DeepSeek-R1/tokenizer_config.json and replace your DeepSeek-R1-awq/tokenizer_config.json。
If it works, you should face the problem that "model output without the '' label".
In DeepSeek-R1's document，their advice is To ensure that the model engages in thorough reasoning, we recommend enforcing the model to initiate its response with "<think>\n" at the beginning of every output.

@v2ray Hi , why is there so big difference between DeepSeek-R1/tokenizer_config.json and DeepSeek-R1-awq/tokenizer_config.json, thanks

xiaolizztg

Feb 21

目前遇到类似的问题，但是只是偶发性的，在上下文较长时才触发，这是因为权重问题导致的吗，应该如何修复呢，有小伙伴解决了吗。

v2ray

Cognitive Computations org Feb 21

@Eric108 It's not much difference, I just modified the chat template to include prefill ability. You can ignore the rest of the difference as they don't actually matter, they're just some formatting.

@xiaolizztg You can force it to reason by prefilling <think>\n.

v2ray changed discussion status to closed Feb 21