jartine commited on
Commit
e25cb31
1 Parent(s): b773b2d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +115 -24
README.md CHANGED
@@ -41,10 +41,13 @@ call [llamafiles](https://github.com/Mozilla-Ocho/llamafile). This gives
41
  you the easiest fastest way to use the model on Linux, MacOS, Windows,
42
  FreeBSD, OpenBSD and NetBSD systems you control on both AMD64 and ARM64.
43
 
 
 
44
  ## Quickstart
45
 
46
- Running the following on a desktop OS will launch a tab in your web
47
- browser with a chatbot interface.
 
48
 
49
  ```
50
  wget https://huggingface.co/Mozilla/Meta-Llama-3.1-8B-Instruct-llamafile/resolve/main/Meta-Llama-3.1-8B-Instruct.Q6_K.llamafile
@@ -52,51 +55,139 @@ chmod +x Meta-Llama-3.1-8B-Instruct.Q6_K.llamafile
52
  ./Meta-Llama-3.1-8B-Instruct.Q6_K.llamafile
53
  ```
54
 
55
- You then need to fill out the prompt / history template (see below).
 
56
 
57
- This model has a max context window size of 128k tokens. By default, a
58
- context window size of 8192 tokens is used. You can use a larger context
59
- window by passing the `-c 131072` flag.
60
 
61
- On GPUs with sufficient RAM, the `-ngl 999` flag may be passed to use
62
- the system's NVIDIA or AMD GPU(s). On Windows, only the graphics card
63
- driver needs to be installed. If the prebuilt DSOs should fail, the CUDA
64
- or ROCm SDKs may need to be installed, in which case llamafile builds a
65
- native module just for your system.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66
 
67
  For further information, please see the [llamafile
68
  README](https://github.com/mozilla-ocho/llamafile/).
69
 
 
 
70
  Having **trouble?** See the ["Gotchas"
71
  section](https://github.com/mozilla-ocho/llamafile/?tab=readme-ov-file#gotchas-and-troubleshooting)
72
  of the README.
73
 
74
- ## Prompting
 
75
 
76
- To have a good working chat experience when using the web GUI, you need
77
- to fill out the text fields with the following values.
 
 
 
 
78
 
79
- Prompt template:
 
 
 
 
80
 
81
  ```
82
- <|begin_of_text|><|start_header_id|>system<|end_header_id|>
83
- {{prompt}}<|eot_id|>{{history}}<|start_header_id|>{{char}}<|end_header_id|>
84
  ```
85
 
86
- History template:
 
87
 
88
- ```
89
- <|start_header_id|>{{name}}<|end_header_id|>
90
- {{message}}<|eot_id|>
91
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
92
 
93
  ## About llamafile
94
 
95
- llamafile is a new format introduced by Mozilla Ocho on Nov 20th 2023.
96
- It uses Cosmopolitan Libc to turn LLM weights into runnable llama.cpp
97
  binaries that run on the stock installs of six OSes for both ARM64 and
98
  AMD64.
99
 
 
 
 
 
 
 
 
 
 
 
 
 
100
  ---
101
 
102
  ## Model Information
 
41
  you the easiest fastest way to use the model on Linux, MacOS, Windows,
42
  FreeBSD, OpenBSD and NetBSD systems you control on both AMD64 and ARM64.
43
 
44
+ *Software last updated: 2024-11-01*
45
+
46
  ## Quickstart
47
 
48
+ To get started, you need both the LLaMA weights, and the llamafile
49
+ software. Both of them are included in a single file, which can be
50
+ downloaded and run as follows:
51
 
52
  ```
53
  wget https://huggingface.co/Mozilla/Meta-Llama-3.1-8B-Instruct-llamafile/resolve/main/Meta-Llama-3.1-8B-Instruct.Q6_K.llamafile
 
55
  ./Meta-Llama-3.1-8B-Instruct.Q6_K.llamafile
56
  ```
57
 
58
+ The default mode of operation for these llamafiles is our new command
59
+ line chatbot interface.
60
 
61
+ ![Screenshot of Gemma 2b llamafile on MacOS](llamafile-gemma.png)
 
 
62
 
63
+ Having **trouble?** See the ["Gotchas"
64
+ section](https://github.com/mozilla-ocho/llamafile/?tab=readme-ov-file#gotchas-and-troubleshooting)
65
+ of the README.
66
+
67
+ ## Usage
68
+
69
+ By default, llamafile launches a chatbot in the terminal, and a server
70
+ in the background. The chatbot is mostly self-explanatory. You can type
71
+ `/help` for further details. See the [llamafile v0.8.15 release
72
+ notes](https://github.com/Mozilla-Ocho/llamafile/releases/tag/0.8.15)
73
+ for documentation on our newest chatbot features.
74
+
75
+ To instruct Gemma to do role playing, you can customize the system
76
+ prompt as follows:
77
+
78
+ ```
79
+ ./Meta-Llama-3.1-8B-Instruct.Q6_K.llamafile --chat -p "you are mosaic's godzilla"
80
+ ```
81
+
82
+ To view the man page, run:
83
+
84
+ ```
85
+ ./Meta-Llama-3.1-8B-Instruct.Q6_K.llamafile --help
86
+ ```
87
+
88
+ To send a request to the OpenAI API compatible llamafile server, try:
89
+
90
+ ```
91
+ curl http://localhost:8080/v1/chat/completions \
92
+ -H "Content-Type: application/json" \
93
+ -d '{
94
+ "model": "llama",
95
+ "messages": [{"role": "user", "content": "Say this is a test!"}],
96
+ "temperature": 0.0
97
+ }'
98
+ ```
99
+
100
+ If you don't want the chatbot and you only want to run the server:
101
+
102
+ ```
103
+ ./Meta-Llama-3.1-8B-Instruct.Q6_K.llamafile --server --nobrowser --host 0.0.0.0
104
+ ```
105
+
106
+ An advanced CLI mode is provided that's useful for shell scripting. You
107
+ can use it by passing the `--cli` flag. For additional help on how it
108
+ may be used, pass the `--help` flag.
109
+
110
+ ```
111
+ ./Meta-Llama-3.1-8B-Instruct.Q6_K.llamafile --cli -p 'four score and seven' --log-disable
112
+ ```
113
+
114
+ You then need to fill out the prompt / history template (see below).
115
 
116
  For further information, please see the [llamafile
117
  README](https://github.com/mozilla-ocho/llamafile/).
118
 
119
+ ## Troubleshooting
120
+
121
  Having **trouble?** See the ["Gotchas"
122
  section](https://github.com/mozilla-ocho/llamafile/?tab=readme-ov-file#gotchas-and-troubleshooting)
123
  of the README.
124
 
125
+ On Linux, the way to avoid run-detector errors is to install the APE
126
+ interpreter.
127
 
128
+ ```sh
129
+ sudo wget -O /usr/bin/ape https://cosmo.zip/pub/cosmos/bin/ape-$(uname -m).elf
130
+ sudo chmod +x /usr/bin/ape
131
+ sudo sh -c "echo ':APE:M::MZqFpD::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"
132
+ sudo sh -c "echo ':APE-jart:M::jartsr::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"
133
+ ```
134
 
135
+ On Windows there's a 4GB limit on executable sizes. This means you
136
+ should download the Q2\_K llamafile. For better quality, consider
137
+ instead downloading the official llamafile release binary from
138
+ <https://github.com/Mozilla-Ocho/llamafile/releases>, renaming it to
139
+ have the .exe file extension, and then saying:
140
 
141
  ```
142
+ .\llamafile-0.8.16.exe -m Meta-Llama-3.1-8B-Instruct.Q6_K.llamafile
 
143
  ```
144
 
145
+ That will overcome the Windows 4GB file size limit, allowing you to
146
+ benefit from bigger better models.
147
 
148
+ ## Context Window
149
+
150
+ This model has a max context window size of 128k tokens. By default, a
151
+ context window size of 8192 tokens is used. You may limit the context
152
+ window size by passing the `-c N` flag.
153
+
154
+ ## GPU Acceleration
155
+
156
+ On GPUs with sufficient RAM, the `-ngl 999` flag may be passed to use
157
+ the system's NVIDIA or AMD GPU(s). On Windows, only the graphics card
158
+ driver needs to be installed if you own an NVIDIA GPU. On Windows, if
159
+ you have an AMD GPU, you should install the ROCm SDK v6.1 and then pass
160
+ the flags `--recompile --gpu amd` the first time you run your llamafile.
161
+
162
+ On NVIDIA GPUs, by default, the prebuilt tinyBLAS library is used to
163
+ perform matrix multiplications. This is open source software, but it
164
+ doesn't go as fast as closed source cuBLAS. If you have the CUDA SDK
165
+ installed on your system, then you can pass the `--recompile` flag to
166
+ build a GGML CUDA library just for your system that uses cuBLAS. This
167
+ ensures you get maximum performance.
168
+
169
+ For further information, please see the [llamafile
170
+ README](https://github.com/mozilla-ocho/llamafile/).
171
 
172
  ## About llamafile
173
 
174
+ llamafile is a new format introduced by Mozilla on Nov 20th 2023. It
175
+ uses Cosmopolitan Libc to turn LLM weights into runnable llama.cpp
176
  binaries that run on the stock installs of six OSes for both ARM64 and
177
  AMD64.
178
 
179
+ ## About Quantization Formats
180
+
181
+ This model works well with any quantization format. Q6\_K is the best
182
+ choice overall here.
183
+
184
+ ## License
185
+
186
+ The llamafile software is open source and permissively licensed. However
187
+ the weights embedded inside the llamafiles are governed by the Meta
188
+ LLaMA 3.1 Community License Agreement and Acceptable Use Policy. See the
189
+ [LICENSE](LICENSE) file for further details.
190
+
191
  ---
192
 
193
  ## Model Information