Text Generation
Transformers
PyTorch
Italian
English
mistral
conversational
text-generation-inference
Inference Endpoints

Unable to load model in GPU

#1
by Federic - opened

Hi I'm trying to play with this model, but i cannot load it in gpu (T4 16GB provided by Colab), even if I specify device_map="cuda:0" it still loads in RAM. Any advice? I have another question why the model weights so much ~ 30GB despite having 7B parameters?

import transformers
quantization_config = transformers.BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_use_double_quant=True,
#bnb_4bit_compute_dtype=bfloat16
)

llm = AutoModelForCausalLM.from_pretrained(
"galatolo/cerbero-7b",
quantization_config = quantization_config,
device_map="cuda:0"
)

Hi, it weighs that much because the weights are in the float32 format (rather than the more common float16).
I attempted to load the model using Google Colab, and it appears to crash due to insufficient RAM.
I will upload a float16 variant, maybe it will solve this issue

I uploaded the float16 variant, and you can load it using the following code:

model = AutoModelForCausalLM.from_pretrained("galatolo/cerbero-7b", revision="float16")

However, it appears that Colab does not have enough RAM to handle this. I believe the best option is to use the llama.cpp version, which I have already quantized to 4 bits.

galatolo changed discussion status to closed

Sign up or log in to comment