google/gemma-2b · GPU utlisation high on Gemma-2b-it

Feb 26, 2024

•

edited Feb 26, 2024

Hi, I m comparing latencies of Gemma-2b-it-GGUF and Phi-2b-GGUF. Both models have been loaded using the llama-cpp-python library.
When fed with same prompts I notice high GPU utilisation and PCIe bus usage in case of Gemma-2b-it-GGUF whereas in Phi-2b-GGUF GPU utilisation is low. Moreover there is no PCIe bus usage in case of Phi-2b-GGUF. Any reason why this might be happening?

I have one NVIDIA A10G GPU with 23 GB vRAM. Both models have been completely loaded inside this GPU. I have passed these params to llama-cpp-python:
n_gpu_layers: -1
use_mlock: False
n_ctx: 512
n_batch: 512
n_threads: null
n_threads_batch: null
offload_kqv: True

Models:
phi-2.Q4_K_M.gguf
gemma-2b-it.gguf

sharad07 changed discussion status to closed Feb 26, 2024

sharad07 changed discussion status to open Feb 26, 2024