galatolo/cerbero-7b · Unable to load model in GPU

Dec 5, 2023

Hi I'm trying to play with this model, but i cannot load it in gpu (T4 16GB provided by Colab), even if I specify device_map="cuda:0" it still loads in RAM. Any advice? I have another question why the model weights so much ~ 30GB despite having 7B parameters?

import transformers
quantization_config = transformers.BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_use_double_quant=True,
#bnb_4bit_compute_dtype=bfloat16
)

llm = AutoModelForCausalLM.from_pretrained(
"galatolo/cerbero-7b",
quantization_config = quantization_config,
device_map="cuda:0"
)

galatolo

Owner Dec 5, 2023

Hi, it weighs that much because the weights are in the float32 format (rather than the more common float16).
I attempted to load the model using Google Colab, and it appears to crash due to insufficient RAM.
I will upload a float16 variant, maybe it will solve this issue

galatolo

Owner Dec 5, 2023

•

edited Dec 5, 2023

I uploaded the float16 variant, and you can load it using the following code:

model = AutoModelForCausalLM.from_pretrained("galatolo/cerbero-7b", revision="float16")

However, it appears that Colab does not have enough RAM to handle this. I believe the best option is to use the llama.cpp version, which I have already quantized to 4 bits.

galatolo changed discussion status to closed Dec 7, 2023