Issue with GPU Utilization in Colab Notebook
Hi
I'm encountering an issue with my Google Colab notebook where it doesn't seem to utilize the GPU, despite having the GPU runtime enabled. I've been working on a project that requires GPU acceleration, and this issue is hindering my progress.
To respond with a simple Hi, it is taking 11 minutes, something is wrong, can someone help, please?
Here is the link to my Colab notebook for reference: https://colab.research.google.com/drive/1331XPrqg4wKvT5ymQOwG4QY3Xk_kNLtl?usp=sharing
To provide more context:
I've ensured that the notebook settings are set to use GPU.
I've tried restarting the runtime and resetting all runtimes, but the issue persists.
model I have used:- mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf
I'm New to using these llama-cpp and gguf files.
Thank you in advance for your time and assistance.
@Sagar3745 I think the problem might be with n_gpu_layers set to zero. If i understand correctly this the number of layers you are offloading from CPU to GPU. try setting n_gpu_layers to a number and test it from there. the higher the number the faster inference works but you might run out of VRAM and the notebook will crash.
Hi
@hammad93
thanks for the help.
I tried with some value for n_gpu_layers, but did not help. I have executed the same code in a server, it is working in the CPU, the inference is slow, if the prompt tokens are high it is taking a bit longer.
Now I'm working on running it on GPU. But no result.
But Extremely thanks for the help.
@Sagar3745 try adding this to CMAKE_ARGS LLAMA_CUBLAS=1
Hi @hammad93 Tried That But no use.
This is what the model is printing in my terminal.
I'm using the standard template same as mentioned in the repo.[INST] {prmopt} [/INST]
@Sagar3745
thats weird i got it to work on the 2xT4 gpus on kaggle using LLAMA_CUBLAS=1 and n_gpu_layers and its using both GPUs while running inference, try pulling the repository llama.cpp and building it from the repo.
github repo: https://github.com/ggerganov/llama.cpp
make command: make LLAMA_CUBLAS=1
@hammad93 Thanks I got it, was able to load it in GPU but crashed because of only 15GPU, in Colab. But I'm not sure how you did in Kaggel, as the model is 24gb the allowed storage is only 19.5 GB I'm not able to download the model.
Can you tell me how you can download the model in Kaggel?
model:- mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf
for now I have tried with smaller model:-
@Sagar3745 Great! to use the full 73GB change directory to /kaggle . Also try playing around the n_gpu_layers number to try to fit the model in VRAM,and check out the docs as there are other options that should help with memory allocation.
How did you guys did it in kaggle? which accelerator? can you guys be kind enough to share notebook?
@LilWonga
@ianuvrat
repo:https://github.com/mth93/mixtral_llama_cpp
you can use main branch instead of mixtral as it's already merged into main. i tested this notebook with many of the opensource LLMs in GGUF format and it works great and utilizes both GPUs in the 2xT4 gpus notebook. you'll have to play around with the context size (-c) and the n-gpu-layers with each LLM as this(gpu layers) is how you utilize more vram opposed to ram if i understand correctly. Also there are other options for the llama.cpp server that might improve performance that i'm still experimenting with. and the bigger the context the better as it allows the LLM to remember bigger parts of the history of the conversation.
this works for launching a chatbot and an openai compatible api.
also check the model you're going to use if it needs a prompt template you'll find it on the original model's page on huggingface. there is no way to add the prompt template to the llama.cpp server currently so you'll have to add it to the prompt in the request(If you're using the API) or add it to the chat in the chatbot.
most LLMs will give you very weird responses if it requires a prompt template and you don't use it based on what i tested.
hope this helps!
Also guys i'd appreciate it if anyone can explain the difference between GGUF,GGML,AWQ etc. if i understand correctly these are different quantization algorithms, but i have no idea what's the difference between them and how that affects performance and model size.
An update on the GPU issue, could not be solved in my server, so I downloaded the original model mistralai/Mixtral-8x7B-Instruct-v0.1, and loaded it as a quantized.
GPU usage:- 24GB (without inference).
This works perfectly fine for me.
@Sagar3745 , sorry but i did not understood. What and how you did? I want to use yhis model with langchain agents for inferencing
@ianuvrat
, initially I faced issues in loading the gguf model in GPU in my server. it worked in Kaggel Notebook and Google Colab I was able to load it in the GPU.
So instead of the .gguf model, I downloaded the original mixtral instruct model from mistralai huggingface and just loaded it as a quantized model which only takes 24GB GPU.
There is nothing new I did, before gguf models, I used to load the models like llama2 13b, and orca2 13b as quantized models, in the same way, I did for this https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1 model.
Hope this helps.
@Sagar3745 interesting. So uou downloded the original mode adn loaded it as quantizd? Right.
Can you be kind enough to share the colab notebook how you did this? I’ll also try the same and see if it works for me or not.
@ianuvrat
, Sure.
here is the Code, which uses the same model loading method as I did.
loading the model:-
................................................................................................................................
import torch
import transformers
from torch import bfloat16
from langchain import HuggingFacePipeline, PromptTemplate, LLMChain
import re
model_path = 'microsoft/Orca-2-13b'
hf_key = "hugging_api"
import time
def load_model():
device = f"cuda:{torch.cuda.current_device()}" if torch.cuda.is_available() else "cpu"
bnb_config = transformers.BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=bfloat16,
)
model_config = transformers.AutoConfig.from_pretrained(
model_path, use_auth_token=hf_key
)
model = transformers.AutoModelForCausalLM.from_pretrained(
model_path,
trust_remote_code=True,
config=model_config,
quantization_config=bnb_config,
device_map="auto",
use_auth_token=hf_key,
)
model.eval()
print(f"Model loaded on {device}:- ")
tokenizer = transformers.AutoTokenizer.from_pretrained(
model_path, use_auth_token=hf_key
)
generate_text = transformers.pipeline(
model=model,
tokenizer=tokenizer,
return_full_text=True,
task="text-generation",
temperature=0.0,
max_new_tokens=400,
repetition_penalty=1.1,
do_sample=False,
top_k=5,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
)
return HuggingFacePipeline(
pipeline=generate_text, model_kwargs={"temperature": 0}
)
llm_model = load_model()
....................................................................................................................
requirements:-
....................................................................................
pip install -U llama-cpp-python
pip install -U transformers
pip install -U accelerate
pip install -U bitsandbytes
pip install -U langchain
pip install -U sentencepiece
....................................................................................
remember the original model is too huge to download in the Colab or Kaggel, unless you are a pro user. I did it on my server which has enough disk space to download the model.
colab link:- https://colab.research.google.com/drive/1oHRk8dHYhGc9z6Olrx4pmFymf-LwhF_F?usp=sharing
Thanks mate. Will try with this!