Inference GPU Ram requirement >60GB

by Ksgk-fy - opened May 23

May 23

Hello, thanks for quantizing the model so quickly. I have an issue when running the vLLM with:
python -m vllm.entrypoints.openai.api_server --model neuralmagic/Mistral-7B-Instruct-v0.3-GPTQ-4bit
When I check the GPU RAM in use with nvidia-smi, the occupied memory is ~60GB. Can you please indicate the cause of this?

mgoin

Neural Magic org May 23

Hi @Ksgk-fy this is intended behavior by vLLM for performance. It will use 90% of the total GPU memory by default to reserve space for kv cache after loading the model - this is optimal for performance to pre-allocate. This is directly controlled by the gpu_memory_utilization=0.9 parameter. You should see in the logs how much memory is used just by the model weights in a log like INFO 05-23 14:17:22 model_runner.py:175] Loading model weights took 14.9595 GB

mgoin changed discussion status to closed May 23

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment