Quantized version of Mistral 7B (4bit or 8 bit)
Can't run on colab (free tier) . Can anyone guide me how to run Mistral 8 bit?
Same issue.
After installing requirements using these two lines:
!pip install git+https://github.com/huggingface/transformers
!pip install accelerate bitsandbytes
I always use Use load_in_4bit=True
and device_map='cuda'
while loading model:
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1", load_in_8bit=True, device_map='cuda')
But on colab CPU memory is getting OOM. I dont know why this is loading on CPU memory instead of GPU memory!!!
from transformers import AutoConfig, AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, TextStreamer
import torch
model_name_or_path = "mistralai/Mistral-7B-Instruct-v0.1"
config = AutoConfig.from_pretrained(model_name_or_path, trust_remote_code=True)
config.max_position_embeddings = 8096
quantization_config = BitsAndBytesConfig(
llm_int8_enable_fp32_cpu_offload=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
load_in_4bit=True
)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
model_name_or_path,
config=config,
trust_remote_code=True,
quantization_config=quantization_config,
device_map="auto",
offload_folder="./offload"
)
prompt = "[INST]your prompt[/INST]"
print("\n\n*** Generate:")
inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).to("cuda")
streamer = TextStreamer(tokenizer, skip_prompt= True)
output = model.generate(**inputs,
streamer=streamer,
max_new_tokens=512,
temperature=0.3,
top_k=20,
top_p=0.4,
repetition_penalty=1.1, do_sample=True)
For anyone using Colab, remove device_map='cuda'
and it will load it onto the GPU correctly