YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
UPDATE: Official version is out, use it instead: https://huggingface.co/mistralai/Mistral-7B-v0.1
mistral-7B-v0.1-hf
Huggingface compatible version of Mistral's 7B model: https://twitter.com/MistralAI/status/1706877320844509405
Usage
Load in bfloat16 (16GB VRAM or higher)
import torch
from transformers import LlamaForCausalLM, LlamaTokenizer, pipeline, TextStreamer
tokenizer = LlamaTokenizer.from_pretrained("kittn/mistral-7B-v0.1-hf")
model = LlamaForCausalLM.from_pretrained(
"kittn/mistral-7B-v0.1-hf",
torch_dtype=torch.bfloat16,
device_map={"": 0}
)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
pipe("Hi, my name", streamer=TextStreamer(tokenizer), max_new_tokens=128)
Load in bitsandbytes nf4 (6GB VRAM or higher, maybe less with double_quant)
import torch
from transformers import LlamaForCausalLM, LlamaTokenizer, pipeline, TextStreamer, BitsAndBytesConfig
tokenizer = LlamaTokenizer.from_pretrained("kittn/mistral-7B-v0.1-hf")
model = LlamaForCausalLM.from_pretrained(
"kittn/mistral-7B-v0.1-hf",
device_map={"": 0},
quantization_config=BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=False, # set to True to save more VRAM at the cost of some speed/accuracy
),
)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
pipe("Hi, my name", streamer=TextStreamer(tokenizer), max_new_tokens=128)
Load in bitsandbytes int8 (8GB VRAM or higher). Quite slow; not recommended.
import torch
from transformers import LlamaForCausalLM, LlamaTokenizer, pipeline, TextStreamer, BitsAndBytesConfig
tokenizer = LlamaTokenizer.from_pretrained("kittn/mistral-7B-v0.1-hf")
model = LlamaForCausalLM.from_pretrained(
"kittn/mistral-7B-v0.1-hf",
device_map={"": 0},
quantization_config=BitsAndBytesConfig(
load_in_8bit=True,
),
)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
pipe("Hi, my name", streamer=TextStreamer(tokenizer), max_new_tokens=128)
Notes
- The original huggingface conversion script converts the model from bf16 to fp16 before saving it. This script doesn't
- The tokenizer is created with
legacy=False
, more about this here - Saved in safetensors format
Conversion script [link]
Unlike meta-llama/Llama-2-7b, this model uses GQA. This breaks some assumptions in the original conversion script, requiring a few changes.
Conversion script: link
Original conversion script: link
- Downloads last month
- 1,847
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.