Inference GPU Ram requirement >60GB

#1
by Ksgk-fy - opened

Hello, thanks for quantizing the model so quickly. I have an issue when running the vLLM with:
python -m vllm.entrypoints.openai.api_server --model neuralmagic/Mistral-7B-Instruct-v0.3-GPTQ-4bit
When I check the GPU RAM in use with nvidia-smi, the occupied memory is ~60GB. Can you please indicate the cause of this?

Neural Magic org

Hi @Ksgk-fy this is intended behavior by vLLM for performance. It will use 90% of the total GPU memory by default to reserve space for kv cache after loading the model - this is optimal for performance to pre-allocate. This is directly controlled by the gpu_memory_utilization=0.9 parameter. You should see in the logs how much memory is used just by the model weights in a log like INFO 05-23 14:17:22 model_runner.py:175] Loading model weights took 14.9595 GB

mgoin changed discussion status to closed

Sign up or log in to comment