Mozilla
/

gemma-2-2b-it-llamafile

Transformers

llamafile

Inference Endpoints

Model card Files Files and versions Community

jartine commited on Oct 31, 2024

Commit

330a572

verified ·

1 Parent(s): 37d86bc

Update README.md

Browse files

Files changed (1) hide show

README.md +85 -43

README.md CHANGED Viewed

@@ -28,23 +28,13 @@ The model is packaged into executable weights, which we call
 easy to use the model on Linux, MacOS, Windows, FreeBSD, OpenBSD, and
 NetBSD for AMD64 and ARM64.
-## License
-The llamafile software is open source and permissively licensed. However
-the weights embedded inside the llamafiles are governed by Google's
-Gemma License and Gemma Prohibited Use Policy. This is not an open
-source license. It's about as restrictive as it gets. There's a great
-many things you're not allowed to do with Gemma. The terms of the
-license and its list of unacceptable uses can be changed by Google at
-any time. Therefore we wouldn't recommend using these llamafiles for
-anything other than evaluating the quality of Google's engineering.
-See the [LICENSE](LICENSE) file for further details.
 ## Quickstart
-Running the following on a desktop OS will launch a tab in your web
-browser with a chatbot interface.
 ```
 wget https://huggingface.co/jartine/gemma-2-2b-it-llamafile/resolve/main/gemma-2-2b-it.Q6_K.llamafile
@@ -52,56 +42,95 @@ chmod +x gemma-2-2b-it.Q6_K.llamafile
 ./gemma-2-2b-it.Q6_K.llamafile
 ```
-You then need to fill out the prompt / history template (see below).
-This model has a max context window size of 8k tokens. By default, a
-context window size of 512 tokens is used. You may increase this to the
-maximum by passing the `-c 0` flag.
-On GPUs with sufficient RAM, the `-ngl 999` flag may be passed to use
-the system's NVIDIA or AMD GPU(s). On Windows, only the graphics card
-driver needs to be installed. If the prebuilt DSOs should fail, the CUDA
-or ROCm SDKs may need to be installed, in which case llamafile builds a
-native module just for your system.
-For further information, please see the [llamafile
-README](https://github.com/mozilla-ocho/llamafile/).
 Having **trouble?** See the ["Gotchas"
-section](https://github.com/mozilla-ocho/llamafile/?tab=readme-ov-file#gotchas)
 of the README.
-## Prompting
-When using the browser GUI, you need to fill out the following fields.
-Prompt template (note: this is for chat; Gemma doesn't have a system role):
 ```
-{{history}}
-<start_of_turn>{{char}}
 ```
-History template:
 ```
-<start_of_turn>{{name}}
-{{message}}<end_of_turn>
 ```
-Here's an example of how to prompt Gemma v2 on the command line:
 ```
-./gemma-2-2b-it.Q6_K.llamafile --special -p '<start_of_turn>user
-The Belobog Academy has discovered a new, invasive species of algae that can double itself in one day, and in 30 days fills a whole reservoir - contaminating the water supply. How many days would it take for the algae to fill half of the reservoir?<end_of_turn>
-<start_of_turn>model
-'
 ```
 ## About llamafile
-llamafile is a new format introduced by Mozilla Ocho on Nov 20th 2023.
-It uses Cosmopolitan Libc to turn LLM weights into runnable llama.cpp
 binaries that run on the stock installs of six OSes for both ARM64 and
 AMD64.
@@ -127,6 +156,19 @@ which require more memory.
 The 9B and 27B models were released a month earlier than 2B, so they're
 packaged with an slightly older version of the llamafile software.
 ---
 # Gemma 2 model card

 easy to use the model on Linux, MacOS, Windows, FreeBSD, OpenBSD, and
 NetBSD for AMD64 and ARM64.
+*Software Last Updated: 2024-10-30*
 ## Quickstart
+To get started, you need both the Gemma weights, and the llamafile
+software. Both of them are included in a single file, which can be
+downloaded and run as follows:
 ```
 wget https://huggingface.co/jartine/gemma-2-2b-it-llamafile/resolve/main/gemma-2-2b-it.Q6_K.llamafile
 ./gemma-2-2b-it.Q6_K.llamafile
 ```
+The default mode of operation for these llamafiles is our new command
+line chatbot interface. It looks like this:
+![Screenshot of Gemma 2b llamafile on MacOS](llamafile-gemma.png)
 Having **trouble?** See the ["Gotchas"
+section](https://github.com/mozilla-ocho/llamafile/?tab=readme-ov-file#gotchas-and-troubleshooting)
 of the README.
+## Usage
+By default, llamafile launches a chatbot in the terminal, and a server
+in the background. The chatbot is mostly self-explanatory. You can type
+`/help` for further details. See the [llamafile v0.8.15 release
+notes](https://github.com/Mozilla-Ocho/llamafile/releases/tag/0.8.15)
+for documentation on our newest chatbot features.
+To instruct Gemma to do role playing, you can customize the system
+prompt of the chatbot as follows:
 ```
+./gemma-2-2b-it.Q6_K.llamafile --chat -p "you are the ghost of edgar allen poe"
 ```
+To view the man page, run:
 ```
+./gemma-2-2b-it.Q6_K.llamafile --help
 ```
+To send a request to the OpenAI API compatible llamafile server, try:
 ```
+curl http://localhost:8080/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+     "model": "gemma-2b-it",
+     "messages": [{"role": "user", "content": "Say this is a test!"}],
+     "temperature": 0.0
+   }'
+```
+If you don't want the chatbot and you only want to run the server:
+```
+./gemma-2-2b-it.Q6_K.llamafile --server --nobrowser --host 0.0.0.0
+```
+An advanced CLI mode is provided that's useful for shell scripting. You
+can use it by passing the `--cli` flag. For additional help on how it
+may be used, pass the `--help` flag.
+```
+./gemma-2-2b-it.Q6_K.llamafile --cli -p 'four score and seven' --log-disable
 ```
+You then need to fill out the prompt / history template (see below).
+For further information, please see the [llamafile
+README](https://github.com/mozilla-ocho/llamafile/).
+## Context Window
+This model has a max context window size of 8k tokens. By default, a
+context window size of 8192 tokens is used. You may limit the context
+window size by passing the `-c N` flag.
+## GPU Acceleration
+On GPUs with sufficient RAM, the `-ngl 999` flag may be passed to use
+the system's NVIDIA or AMD GPU(s). On Windows, only the graphics card
+driver needs to be installed if you own an NVIDIA GPU. On Windows, if
+you have an AMD GPU, you should install the ROCm SDK v6.1 and then pass
+the flags `--recompile --gpu amd` the first time you run your llamafile.
+On NVIDIA GPUs, by default, the prebuilt tinyBLAS library is used to
+perform matrix multiplications. This is open source software, but it
+doesn't go as fast as closed source cuBLAS. If you have the CUDA SDK
+installed on your system, then you can pass the `--recompile` flag to
+build a GGML CUDA library just for your system that uses cuBLAS. This
+ensures you get maximum performance.
+For further information, please see the [llamafile
+README](https://github.com/mozilla-ocho/llamafile/).
 ## About llamafile
+llamafile is a new format introduced by Mozilla on Nov 20th 2023. It
+uses Cosmopolitan Libc to turn LLM weights into runnable llama.cpp
 binaries that run on the stock installs of six OSes for both ARM64 and
 AMD64.
 The 9B and 27B models were released a month earlier than 2B, so they're
 packaged with an slightly older version of the llamafile software.
+## License
+The llamafile software is open source and permissively licensed. However
+the weights embedded inside the llamafiles are governed by Google's
+Gemma License and Gemma Prohibited Use Policy. This is not an open
+source license. It's about as restrictive as it gets. There's a great
+many things you're not allowed to do with Gemma. The terms of the
+license and its list of unacceptable uses can be changed by Google at
+any time. Therefore we wouldn't recommend using these llamafiles for
+anything other than evaluating the quality of Google's engineering.
+See the [LICENSE](LICENSE) file for further details.
 ---
 # Gemma 2 model card