Vllm OpenAI Server Request Problem

#1
by GokhanAI - opened

Hello, I wish you good work.

When I use the Qwen2-72B-Instruction-GPT-Int4 model, when the model works on Vllm, it collects all the requests at first and then responds when receiving multiple requests. But when I use your Qwen2-7B-Instruction model, it receives them in a scattered manner.

While I do not have such a problem with other models, I see a problem with Qwen2-72B-Instruction and other quantization models receiving the requests completely and then responding. I would be glad if you help.

CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server --model /opt/GPT/MODEL/Qwen2-72B-Instruct-GPTQ-Int4 --host 10.12.112.160 --port 9001 --max-model-len 8192 --tensor-parallel-size 1

Sign up or log in to comment