Update README.md
Browse files
README.md
CHANGED
@@ -41,10 +41,13 @@ call [llamafiles](https://github.com/Mozilla-Ocho/llamafile). This gives
|
|
41 |
you the easiest fastest way to use the model on Linux, MacOS, Windows,
|
42 |
FreeBSD, OpenBSD and NetBSD systems you control on both AMD64 and ARM64.
|
43 |
|
|
|
|
|
44 |
## Quickstart
|
45 |
|
46 |
-
|
47 |
-
|
|
|
48 |
|
49 |
```
|
50 |
wget https://huggingface.co/Mozilla/Meta-Llama-3.1-8B-Instruct-llamafile/resolve/main/Meta-Llama-3.1-8B-Instruct.Q6_K.llamafile
|
@@ -52,51 +55,139 @@ chmod +x Meta-Llama-3.1-8B-Instruct.Q6_K.llamafile
|
|
52 |
./Meta-Llama-3.1-8B-Instruct.Q6_K.llamafile
|
53 |
```
|
54 |
|
55 |
-
|
|
|
56 |
|
57 |
-
|
58 |
-
context window size of 8192 tokens is used. You can use a larger context
|
59 |
-
window by passing the `-c 131072` flag.
|
60 |
|
61 |
-
|
62 |
-
|
63 |
-
|
64 |
-
|
65 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
66 |
|
67 |
For further information, please see the [llamafile
|
68 |
README](https://github.com/mozilla-ocho/llamafile/).
|
69 |
|
|
|
|
|
70 |
Having **trouble?** See the ["Gotchas"
|
71 |
section](https://github.com/mozilla-ocho/llamafile/?tab=readme-ov-file#gotchas-and-troubleshooting)
|
72 |
of the README.
|
73 |
|
74 |
-
|
|
|
75 |
|
76 |
-
|
77 |
-
|
|
|
|
|
|
|
|
|
78 |
|
79 |
-
|
|
|
|
|
|
|
|
|
80 |
|
81 |
```
|
82 |
-
|
83 |
-
{{prompt}}<|eot_id|>{{history}}<|start_header_id|>{{char}}<|end_header_id|>
|
84 |
```
|
85 |
|
86 |
-
|
|
|
87 |
|
88 |
-
|
89 |
-
|
90 |
-
|
91 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
92 |
|
93 |
## About llamafile
|
94 |
|
95 |
-
llamafile is a new format introduced by Mozilla
|
96 |
-
|
97 |
binaries that run on the stock installs of six OSes for both ARM64 and
|
98 |
AMD64.
|
99 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
100 |
---
|
101 |
|
102 |
## Model Information
|
|
|
41 |
you the easiest fastest way to use the model on Linux, MacOS, Windows,
|
42 |
FreeBSD, OpenBSD and NetBSD systems you control on both AMD64 and ARM64.
|
43 |
|
44 |
+
*Software last updated: 2024-11-01*
|
45 |
+
|
46 |
## Quickstart
|
47 |
|
48 |
+
To get started, you need both the LLaMA weights, and the llamafile
|
49 |
+
software. Both of them are included in a single file, which can be
|
50 |
+
downloaded and run as follows:
|
51 |
|
52 |
```
|
53 |
wget https://huggingface.co/Mozilla/Meta-Llama-3.1-8B-Instruct-llamafile/resolve/main/Meta-Llama-3.1-8B-Instruct.Q6_K.llamafile
|
|
|
55 |
./Meta-Llama-3.1-8B-Instruct.Q6_K.llamafile
|
56 |
```
|
57 |
|
58 |
+
The default mode of operation for these llamafiles is our new command
|
59 |
+
line chatbot interface.
|
60 |
|
61 |
+
![Screenshot of Gemma 2b llamafile on MacOS](llamafile-gemma.png)
|
|
|
|
|
62 |
|
63 |
+
Having **trouble?** See the ["Gotchas"
|
64 |
+
section](https://github.com/mozilla-ocho/llamafile/?tab=readme-ov-file#gotchas-and-troubleshooting)
|
65 |
+
of the README.
|
66 |
+
|
67 |
+
## Usage
|
68 |
+
|
69 |
+
By default, llamafile launches a chatbot in the terminal, and a server
|
70 |
+
in the background. The chatbot is mostly self-explanatory. You can type
|
71 |
+
`/help` for further details. See the [llamafile v0.8.15 release
|
72 |
+
notes](https://github.com/Mozilla-Ocho/llamafile/releases/tag/0.8.15)
|
73 |
+
for documentation on our newest chatbot features.
|
74 |
+
|
75 |
+
To instruct Gemma to do role playing, you can customize the system
|
76 |
+
prompt as follows:
|
77 |
+
|
78 |
+
```
|
79 |
+
./Meta-Llama-3.1-8B-Instruct.Q6_K.llamafile --chat -p "you are mosaic's godzilla"
|
80 |
+
```
|
81 |
+
|
82 |
+
To view the man page, run:
|
83 |
+
|
84 |
+
```
|
85 |
+
./Meta-Llama-3.1-8B-Instruct.Q6_K.llamafile --help
|
86 |
+
```
|
87 |
+
|
88 |
+
To send a request to the OpenAI API compatible llamafile server, try:
|
89 |
+
|
90 |
+
```
|
91 |
+
curl http://localhost:8080/v1/chat/completions \
|
92 |
+
-H "Content-Type: application/json" \
|
93 |
+
-d '{
|
94 |
+
"model": "llama",
|
95 |
+
"messages": [{"role": "user", "content": "Say this is a test!"}],
|
96 |
+
"temperature": 0.0
|
97 |
+
}'
|
98 |
+
```
|
99 |
+
|
100 |
+
If you don't want the chatbot and you only want to run the server:
|
101 |
+
|
102 |
+
```
|
103 |
+
./Meta-Llama-3.1-8B-Instruct.Q6_K.llamafile --server --nobrowser --host 0.0.0.0
|
104 |
+
```
|
105 |
+
|
106 |
+
An advanced CLI mode is provided that's useful for shell scripting. You
|
107 |
+
can use it by passing the `--cli` flag. For additional help on how it
|
108 |
+
may be used, pass the `--help` flag.
|
109 |
+
|
110 |
+
```
|
111 |
+
./Meta-Llama-3.1-8B-Instruct.Q6_K.llamafile --cli -p 'four score and seven' --log-disable
|
112 |
+
```
|
113 |
+
|
114 |
+
You then need to fill out the prompt / history template (see below).
|
115 |
|
116 |
For further information, please see the [llamafile
|
117 |
README](https://github.com/mozilla-ocho/llamafile/).
|
118 |
|
119 |
+
## Troubleshooting
|
120 |
+
|
121 |
Having **trouble?** See the ["Gotchas"
|
122 |
section](https://github.com/mozilla-ocho/llamafile/?tab=readme-ov-file#gotchas-and-troubleshooting)
|
123 |
of the README.
|
124 |
|
125 |
+
On Linux, the way to avoid run-detector errors is to install the APE
|
126 |
+
interpreter.
|
127 |
|
128 |
+
```sh
|
129 |
+
sudo wget -O /usr/bin/ape https://cosmo.zip/pub/cosmos/bin/ape-$(uname -m).elf
|
130 |
+
sudo chmod +x /usr/bin/ape
|
131 |
+
sudo sh -c "echo ':APE:M::MZqFpD::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"
|
132 |
+
sudo sh -c "echo ':APE-jart:M::jartsr::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"
|
133 |
+
```
|
134 |
|
135 |
+
On Windows there's a 4GB limit on executable sizes. This means you
|
136 |
+
should download the Q2\_K llamafile. For better quality, consider
|
137 |
+
instead downloading the official llamafile release binary from
|
138 |
+
<https://github.com/Mozilla-Ocho/llamafile/releases>, renaming it to
|
139 |
+
have the .exe file extension, and then saying:
|
140 |
|
141 |
```
|
142 |
+
.\llamafile-0.8.16.exe -m Meta-Llama-3.1-8B-Instruct.Q6_K.llamafile
|
|
|
143 |
```
|
144 |
|
145 |
+
That will overcome the Windows 4GB file size limit, allowing you to
|
146 |
+
benefit from bigger better models.
|
147 |
|
148 |
+
## Context Window
|
149 |
+
|
150 |
+
This model has a max context window size of 128k tokens. By default, a
|
151 |
+
context window size of 8192 tokens is used. You may limit the context
|
152 |
+
window size by passing the `-c N` flag.
|
153 |
+
|
154 |
+
## GPU Acceleration
|
155 |
+
|
156 |
+
On GPUs with sufficient RAM, the `-ngl 999` flag may be passed to use
|
157 |
+
the system's NVIDIA or AMD GPU(s). On Windows, only the graphics card
|
158 |
+
driver needs to be installed if you own an NVIDIA GPU. On Windows, if
|
159 |
+
you have an AMD GPU, you should install the ROCm SDK v6.1 and then pass
|
160 |
+
the flags `--recompile --gpu amd` the first time you run your llamafile.
|
161 |
+
|
162 |
+
On NVIDIA GPUs, by default, the prebuilt tinyBLAS library is used to
|
163 |
+
perform matrix multiplications. This is open source software, but it
|
164 |
+
doesn't go as fast as closed source cuBLAS. If you have the CUDA SDK
|
165 |
+
installed on your system, then you can pass the `--recompile` flag to
|
166 |
+
build a GGML CUDA library just for your system that uses cuBLAS. This
|
167 |
+
ensures you get maximum performance.
|
168 |
+
|
169 |
+
For further information, please see the [llamafile
|
170 |
+
README](https://github.com/mozilla-ocho/llamafile/).
|
171 |
|
172 |
## About llamafile
|
173 |
|
174 |
+
llamafile is a new format introduced by Mozilla on Nov 20th 2023. It
|
175 |
+
uses Cosmopolitan Libc to turn LLM weights into runnable llama.cpp
|
176 |
binaries that run on the stock installs of six OSes for both ARM64 and
|
177 |
AMD64.
|
178 |
|
179 |
+
## About Quantization Formats
|
180 |
+
|
181 |
+
This model works well with any quantization format. Q6\_K is the best
|
182 |
+
choice overall here.
|
183 |
+
|
184 |
+
## License
|
185 |
+
|
186 |
+
The llamafile software is open source and permissively licensed. However
|
187 |
+
the weights embedded inside the llamafiles are governed by the Meta
|
188 |
+
LLaMA 3.1 Community License Agreement and Acceptable Use Policy. See the
|
189 |
+
[LICENSE](LICENSE) file for further details.
|
190 |
+
|
191 |
---
|
192 |
|
193 |
## Model Information
|