|
--- |
|
base_model: mistralai/Mixtral-8x22B-Instruct-v0.1 |
|
model_creator: mistralai |
|
quantized_by: jartine |
|
license: apache-2.0 |
|
prompt_template: | |
|
[INST] {{prompt}} [/INST] |
|
tags: |
|
- llamafile |
|
language: |
|
- en |
|
--- |
|
|
|
# Mixtral 8x22B Instruct v0.1 - llamafile |
|
|
|
This repository contains executable weights (which we call |
|
[llamafiles](https://github.com/Mozilla-Ocho/llamafile)) that run on |
|
Linux, MacOS, Windows, FreeBSD, OpenBSD, and NetBSD for AMD64 and ARM64. |
|
|
|
- Model creator: [Mistral AI](https://mistral.ai/) |
|
- Original model: [mistralai/Mixtral-8x22B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1) |
|
|
|
## Quickstart |
|
|
|
Assuming your system has at least 128GB of RAM, you can try running the |
|
following command which download, concatenate, and execute the model. |
|
|
|
``` |
|
( curl -L https://huggingface.co/jartine/Mixtral-8x22B-Instruct-v0.1-llamafile/resolve/main/Mixtral-8x22B-Instruct-v0.1.Q4_0.llamafile.cat0 |
|
curl -L https://huggingface.co/jartine/Mixtral-8x22B-Instruct-v0.1-llamafile/resolve/main/Mixtral-8x22B-Instruct-v0.1.Q4_0.llamafile.cat1 |
|
) > Mixtral-8x22B-Instruct-v0.1.Q4_0.llamafile |
|
chmod +x Mixtral-8x22B-Instruct-v0.1.Q4_0.llamafile |
|
./Mixtral-8x22B-Instruct-v0.1.Q4_0.llamafile --help # view manual |
|
./Mixtral-8x22B-Instruct-v0.1.Q4_0.llamafile # launch web gui + oai api |
|
./Mixtral-8x22B-Instruct-v0.1.Q4_0.llamafile -p ... # cli interface (scriptable) |
|
``` |
|
|
|
Alternatively, you may download an official `llamafile` executable from |
|
Mozilla Ocho on GitHub, in which case you can use the Mixtral llamafiles |
|
as a simple weights data file. |
|
|
|
``` |
|
llamafile -m Mixtral-8x22B-Instruct-v0.1.Q4_0.llamafile ... |
|
``` |
|
|
|
For further information, please see the [llamafile |
|
README](https://github.com/mozilla-ocho/llamafile/). |
|
|
|
Having **trouble?** See the ["Gotchas" |
|
section](https://github.com/mozilla-ocho/llamafile/?tab=readme-ov-file#gotchas-and-troubleshooting) |
|
of the README. |
|
|
|
## Prompting |
|
|
|
Prompt template: |
|
|
|
``` |
|
[INST] {{prompt}} [/INST] |
|
``` |
|
|
|
Command template: |
|
|
|
``` |
|
./Mixtral-8x22B-Instruct-v0.1.Q4_0.llamafile -p "[INST]{{prompt}}[/INST]" |
|
``` |
|
|
|
## About llamafile |
|
|
|
llamafile is a new format introduced by Mozilla Ocho on Nov 20th 2023. |
|
It uses Cosmopolitan Libc to turn LLM weights into runnable llama.cpp |
|
binaries that run on the stock installs of six OSes for both ARM64 and |
|
AMD64. |
|
|
|
In addition to being executables, llamafiles are also zip archives. Each |
|
llamafile contains a GGUF file, which you can extract using the `unzip` |
|
command. If you want to change or add files to your llamafiles, then the |
|
`zipalign` command (distributed on the llamafile github) should be used |
|
instead of the traditional `zip` command. |
|
|
|
## About Upload Limits |
|
|
|
Files which exceed the Hugging Face 50GB upload limit have a .cat𝑋 |
|
extension. You need to use the `cat` command locally to turn them back |
|
into a single file, using the same order. |
|
|
|
## About Quantization Formats (General Advice) |
|
|
|
Your choice of quantization format depends on three things: |
|
|
|
1. Will it fit in RAM or VRAM? |
|
2. Is your use case reading (e.g. summarization) or writing (e.g. chatbot)? |
|
3. llamafiles bigger than 4.30 GB are hard to run on Windows (see [gotchas](https://github.com/mozilla-ocho/llamafile/?tab=readme-ov-file#gotchas-and-troubleshooting)) |
|
|
|
Good quants for writing (prediction speed) are Q5\_K\_M, and Q4\_0. Text |
|
generation is bounded by memory speed, so smaller quants help, but they |
|
cause the LLM to hallucinate more. However that doesn't mean they can't |
|
think correctly. A highly degraded quant like `Q2_K` may not make a |
|
great encyclopedia, but it's still capable of logical reasoning and |
|
the emergent capabilities LLMs exhibit. |
|
|
|
Good quants for reading (evaluation speed) are BF16, F16, Q8\_0, and |
|
Q4\_0 (ordered from fastest to slowest). Prompt evaluation is bounded by |
|
flop count, which means perf can be improved through software |
|
engineering alone, e.g. BLAS algorithms, in which case quantization |
|
starts hurting more than it helps, since it competes for CPU resources |
|
and makes it harder for the compiler to parallelize instructions. You |
|
want to ideally use the simplest smallest floating point format that's |
|
natively implemented by your hardware. In most cases, that's BF16 or |
|
FP16. However, llamafile is able to still offer respectable tinyBLAS |
|
speedups for llama.cpp's simplest quants: Q8\_0 and Q4\_0. |
|
|
|
## Hardware Choices (Mixtral 8x22B Specific) |
|
|
|
This model is very large. Even at Q2 quantization, it's still well-over |
|
twice as large the highest tier NVIDIA gaming GPUs. llamafile supports |
|
splitting models over multiple GPUs (for NVIDIA only currently) if you |
|
have such a system. The easiest way to have one, if you don't, is to pay |
|
a few bucks an hour to rent a 4x RTX 4090 rig off vast.ai. |
|
|
|
Mac Studio is a good option for running this model locally. An M2 Ultra |
|
desktop from Apple is affordable and has 128GB of unified RAM+VRAM. If |
|
you have one, then llamafile will use your Metal GPU. Try starting out |
|
with the `Q4_0` quantization level. |
|
|
|
Another good option for running large, large language models locally and |
|
fully under your control is to just use CPU inference. We developed new |
|
tensor multiplication kernels on the llamafile project specifically to |
|
speed up "mixture of experts" LLMs like Mixtral. On a AMD Threadripper |
|
Pro 7995WX with 256GB of 5200 MT/s RAM, llamafile v0.8 runs Mixtral |
|
8x22B Q4\_0 on Linux at 98 tokens per second for evaluation, and it |
|
predicts 9.44 tokens per second. |
|
|
|
--- |
|
|
|
# Model Card for Mixtral-8x22B-Instruct-v0.1 |
|
The Mixtral-8x22B-Instruct-v0.1 Large Language Model (LLM) is an instruct fine-tuned version of the [Mixtral-8x22B-v0.1](https://huggingface.co/mistralai/Mixtral-8x22B-v0.1). |
|
|
|
## Run the model |
|
```python |
|
from transformers import AutoModelForCausalLM |
|
from mistral_common.protocol.instruct.messages import ( |
|
AssistantMessage, |
|
UserMessage, |
|
) |
|
from mistral_common.protocol.instruct.tool_calls import ( |
|
Tool, |
|
Function, |
|
) |
|
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer |
|
from mistral_common.tokens.instruct.normalize import ChatCompletionRequest |
|
|
|
device = "cuda" # the device to load the model onto |
|
|
|
tokenizer_v3 = MistralTokenizer.v3() |
|
|
|
mistral_query = ChatCompletionRequest( |
|
tools=[ |
|
Tool( |
|
function=Function( |
|
name="get_current_weather", |
|
description="Get the current weather", |
|
parameters={ |
|
"type": "object", |
|
"properties": { |
|
"location": { |
|
"type": "string", |
|
"description": "The city and state, e.g. San Francisco, CA", |
|
}, |
|
"format": { |
|
"type": "string", |
|
"enum": ["celsius", "fahrenheit"], |
|
"description": "The temperature unit to use. Infer this from the users location.", |
|
}, |
|
}, |
|
"required": ["location", "format"], |
|
}, |
|
) |
|
) |
|
], |
|
messages=[ |
|
UserMessage(content="What's the weather like today in Paris"), |
|
], |
|
model="test", |
|
) |
|
|
|
encodeds = tokenizer_v3.encode_chat_completion(mistral_query).tokens |
|
model = AutoModelForCausalLM.from_pretrained("mistralai/Mixtral-8x22B-Instruct-v0.1") |
|
model_inputs = encodeds.to(device) |
|
model.to(device) |
|
|
|
generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True) |
|
sp_tokenizer = tokenizer_v3.instruct_tokenizer.tokenizer |
|
decoded = sp_tokenizer.decode(generated_ids[0]) |
|
print(decoded) |
|
``` |
|
|
|
# Instruct tokenizer |
|
The HuggingFace tokenizer included in this release should match our own. To compare: |
|
`pip install mistral-common` |
|
|
|
```py |
|
from mistral_common.protocol.instruct.messages import ( |
|
AssistantMessage, |
|
UserMessage, |
|
) |
|
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer |
|
from mistral_common.tokens.instruct.normalize import ChatCompletionRequest |
|
|
|
from transformers import AutoTokenizer |
|
|
|
tokenizer_v3 = MistralTokenizer.v3() |
|
|
|
mistral_query = ChatCompletionRequest( |
|
messages=[ |
|
UserMessage(content="How many experts ?"), |
|
AssistantMessage(content="8"), |
|
UserMessage(content="How big ?"), |
|
AssistantMessage(content="22B"), |
|
UserMessage(content="Noice 🎉 !"), |
|
], |
|
model="test", |
|
) |
|
hf_messages = mistral_query.model_dump()['messages'] |
|
|
|
tokenized_mistral = tokenizer_v3.encode_chat_completion(mistral_query).tokens |
|
|
|
tokenizer_hf = AutoTokenizer.from_pretrained('mistralai/Mixtral-8x22B-Instruct-v0.1') |
|
tokenized_hf = tokenizer_hf.apply_chat_template(hf_messages, tokenize=True) |
|
|
|
assert tokenized_hf == tokenized_mistral |
|
``` |
|
|
|
# Function calling and special tokens |
|
This tokenizer includes more special tokens, related to function calling : |
|
- [TOOL_CALLS] |
|
- [AVAILABLE_TOOLS] |
|
- [/AVAILABLE_TOOLS] |
|
- [TOOL_RESULTS] |
|
- [/TOOL_RESULTS] |
|
|
|
If you want to use this model with function calling, please be sure to apply it similarly to what is done in our [SentencePieceTokenizerV3](https://github.com/mistralai/mistral-common/blob/main/src/mistral_common/tokens/tokenizers/sentencepiece.py#L299). |
|
|
|
# The Mistral AI Team |
|
Albert Jiang, Alexandre Sablayrolles, Alexis Tacnet, Antoine Roux, |
|
Arthur Mensch, Audrey Herblin-Stoop, Baptiste Bout, Baudouin de Monicault, |
|
Blanche Savary, Bam4d, Caroline Feldman, Devendra Singh Chaplot, |
|
Diego de las Casas, Eleonore Arcelin, Emma Bou Hanna, Etienne Metzger, |
|
Gianna Lengyel, Guillaume Bour, Guillaume Lample, Harizo Rajaona, |
|
Jean-Malo Delignon, Jia Li, Justus Murke, Louis Martin, Louis Ternon, |
|
Lucile Saulnier, Lélio Renard Lavaud, Margaret Jennings, Marie Pellat, |
|
Marie Torelli, Marie-Anne Lachaux, Nicolas Schuhl, Patrick von Platen, |
|
Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, |
|
Thibaut Lavril, Timothée Lacroix, Théophile Gervet, Thomas Wang, |
|
Valera Nemychnikova, William El Sayed, William Marshall |
|
|