Mozilla
/

Meta-Llama-3.1-8B-Instruct-llamafile

@@ -41,10 +41,13 @@ call [llamafiles](https://github.com/Mozilla-Ocho/llamafile). This gives
 you the easiest fastest way to use the model on Linux, MacOS, Windows,
 FreeBSD, OpenBSD and NetBSD systems you control on both AMD64 and ARM64.
 ## Quickstart
-Running the following on a desktop OS will launch a tab in your web
-browser with a chatbot interface.
 ```
 wget https://huggingface.co/Mozilla/Meta-Llama-3.1-8B-Instruct-llamafile/resolve/main/Meta-Llama-3.1-8B-Instruct.Q6_K.llamafile
@@ -52,51 +55,139 @@ chmod +x Meta-Llama-3.1-8B-Instruct.Q6_K.llamafile
 ./Meta-Llama-3.1-8B-Instruct.Q6_K.llamafile
 ```
-You then need to fill out the prompt / history template (see below).
-This model has a max context window size of 128k tokens. By default, a
-context window size of 8192 tokens is used. You can use a larger context
-window by passing the `-c 131072` flag.
-On GPUs with sufficient RAM, the `-ngl 999` flag may be passed to use
-the system's NVIDIA or AMD GPU(s). On Windows, only the graphics card
-driver needs to be installed. If the prebuilt DSOs should fail, the CUDA
-or ROCm SDKs may need to be installed, in which case llamafile builds a
-native module just for your system.
 For further information, please see the [llamafile
 README](https://github.com/mozilla-ocho/llamafile/).
 Having **trouble?** See the ["Gotchas"
 section](https://github.com/mozilla-ocho/llamafile/?tab=readme-ov-file#gotchas-and-troubleshooting)
 of the README.
-## Prompting
-To have a good working chat experience when using the web GUI, you need
-to fill out the text fields with the following values.
-Prompt template:
 ```
-<|begin_of_text|><|start_header_id|>system<|end_header_id|>
-{{prompt}}<|eot_id|>{{history}}<|start_header_id|>{{char}}<|end_header_id|>
 ```
-History template:
-```
-<|start_header_id|>{{name}}<|end_header_id|>
-{{message}}<|eot_id|>
-```
 ## About llamafile
-llamafile is a new format introduced by Mozilla Ocho on Nov 20th 2023.
-It uses Cosmopolitan Libc to turn LLM weights into runnable llama.cpp
 binaries that run on the stock installs of six OSes for both ARM64 and
 AMD64.
 ---
 ## Model Information

 you the easiest fastest way to use the model on Linux, MacOS, Windows,
 FreeBSD, OpenBSD and NetBSD systems you control on both AMD64 and ARM64.
+*Software last updated: 2024-11-01*
 ## Quickstart
+To get started, you need both the LLaMA weights, and the llamafile
+software. Both of them are included in a single file, which can be
+downloaded and run as follows:
 ```
 wget https://huggingface.co/Mozilla/Meta-Llama-3.1-8B-Instruct-llamafile/resolve/main/Meta-Llama-3.1-8B-Instruct.Q6_K.llamafile
 ./Meta-Llama-3.1-8B-Instruct.Q6_K.llamafile
 ```
+The default mode of operation for these llamafiles is our new command
+line chatbot interface.
+![Screenshot of Gemma 2b llamafile on MacOS](llamafile-gemma.png)
+Having **trouble?** See the ["Gotchas"
+section](https://github.com/mozilla-ocho/llamafile/?tab=readme-ov-file#gotchas-and-troubleshooting)
+of the README.
+## Usage
+By default, llamafile launches a chatbot in the terminal, and a server
+in the background. The chatbot is mostly self-explanatory. You can type
+`/help` for further details. See the [llamafile v0.8.15 release
+notes](https://github.com/Mozilla-Ocho/llamafile/releases/tag/0.8.15)
+for documentation on our newest chatbot features.
+To instruct Gemma to do role playing, you can customize the system
+prompt as follows:
+```
+./Meta-Llama-3.1-8B-Instruct.Q6_K.llamafile --chat -p "you are mosaic's godzilla"
+```
+To view the man page, run:
+```
+./Meta-Llama-3.1-8B-Instruct.Q6_K.llamafile --help
+```
+To send a request to the OpenAI API compatible llamafile server, try:
+```
+curl http://localhost:8080/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+     "model": "llama",
+     "messages": [{"role": "user", "content": "Say this is a test!"}],
+     "temperature": 0.0
+   }'
+```
+If you don't want the chatbot and you only want to run the server:
+```
+./Meta-Llama-3.1-8B-Instruct.Q6_K.llamafile --server --nobrowser --host 0.0.0.0
+```
+An advanced CLI mode is provided that's useful for shell scripting. You
+can use it by passing the `--cli` flag. For additional help on how it
+may be used, pass the `--help` flag.
+```
+./Meta-Llama-3.1-8B-Instruct.Q6_K.llamafile --cli -p 'four score and seven' --log-disable
+```
+You then need to fill out the prompt / history template (see below).
 For further information, please see the [llamafile
 README](https://github.com/mozilla-ocho/llamafile/).
+## Troubleshooting
 Having **trouble?** See the ["Gotchas"
 section](https://github.com/mozilla-ocho/llamafile/?tab=readme-ov-file#gotchas-and-troubleshooting)
 of the README.
+On Linux, the way to avoid run-detector errors is to install the APE
+interpreter.
+```sh
+sudo wget -O /usr/bin/ape https://cosmo.zip/pub/cosmos/bin/ape-$(uname -m).elf
+sudo chmod +x /usr/bin/ape
+sudo sh -c "echo ':APE:M::MZqFpD::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"
+sudo sh -c "echo ':APE-jart:M::jartsr::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"
+```
+On Windows there's a 4GB limit on executable sizes. This means you
+should download the Q2\_K llamafile. For better quality, consider
+instead downloading the official llamafile release binary from
+<https://github.com/Mozilla-Ocho/llamafile/releases>, renaming it to
+have the .exe file extension, and then saying:
 ```
+.\llamafile-0.8.16.exe -m Meta-Llama-3.1-8B-Instruct.Q6_K.llamafile
 ```
+That will overcome the Windows 4GB file size limit, allowing you to
+benefit from bigger better models.
+## Context Window
+This model has a max context window size of 128k tokens. By default, a
+context window size of 8192 tokens is used. You may limit the context
+window size by passing the `-c N` flag.
+## GPU Acceleration
+On GPUs with sufficient RAM, the `-ngl 999` flag may be passed to use
+the system's NVIDIA or AMD GPU(s). On Windows, only the graphics card
+driver needs to be installed if you own an NVIDIA GPU. On Windows, if
+you have an AMD GPU, you should install the ROCm SDK v6.1 and then pass
+the flags `--recompile --gpu amd` the first time you run your llamafile.
+On NVIDIA GPUs, by default, the prebuilt tinyBLAS library is used to
+perform matrix multiplications. This is open source software, but it
+doesn't go as fast as closed source cuBLAS. If you have the CUDA SDK
+installed on your system, then you can pass the `--recompile` flag to
+build a GGML CUDA library just for your system that uses cuBLAS. This
+ensures you get maximum performance.
+For further information, please see the [llamafile
+README](https://github.com/mozilla-ocho/llamafile/).
 ## About llamafile
+llamafile is a new format introduced by Mozilla on Nov 20th 2023. It
+uses Cosmopolitan Libc to turn LLM weights into runnable llama.cpp
 binaries that run on the stock installs of six OSes for both ARM64 and
 AMD64.
+## About Quantization Formats
+This model works well with any quantization format. Q6\_K is the best
+choice overall here.
+## License
+The llamafile software is open source and permissively licensed. However
+the weights embedded inside the llamafiles are governed by the Meta
+LLaMA 3.1 Community License Agreement and Acceptable Use Policy. See the
+[LICENSE](LICENSE) file for further details.
 ---
 ## Model Information