google/gemma-2-9b · CUDA usage is low

Hi @Max545 ,

I executed both the models in GPU type NVIDIA_TESLA_A100 x 1. When running models like google/gemma-2b and meta-llama/Llama-2-7b-hf, if the device is not specified as "auto", the models will use system RAM instead of the GPU. However, if you explicitly set device="cuda", the models will automatically run on the GPU, utilizing its computational power for faster processing. Please refer to the following gist for more details: link to gist.

The difference in GPU usage between Gemma2 and LLaMA during fine-tuning with LoRA can be attributed to several factors:

  Model architecture: LLaMA is more optimized for efficient GPU usage, while Gemma2 may not be as well-tuned for GPU-heavy tasks.
  Memory bottlenecks: Inefficient memory management or slow data transfer between CPU and GPU in Gemma2 can result in lower GPU usage.
  Framework support: LLaMA has better support in the PEFT library and related tools, which could lead to better GPU utilization compared to Gemma2.

Thank you.