second-state/Gemma-7b-it-GGUF · Glitch with Gemma model

I reported this on the official Github page and I'll double it here. So far no one managed to implement any Gemma quant correctly. I've been trying a step-by-step guide and self-downloaded model as well:

alex@M1 ~ % bash <(curl -sSfL 'https://code.flows.network/webhook/iwYN1SdN3AmPgR5ao5Gt/run-llm.sh')

[I] This is a helper script for deploying LlamaEdge API Server on this machine.

The following tasks will be done:
- Download GGUF model
- Install WasmEdge Runtime and the wasi-nn_ggml plugin
- Download LlamaEdge API Server

Upon the tasks done, an HTTP server will be started and it will serve the selected
model.

Please note:

- All downloaded files will be stored in the current folder
- The server will be listening on all network interfaces
- The server will run with default settings which are not always optimal
- Do not judge the quality of a model based on the results from this script
- This script is only for demonstration purposes

During the whole process, you can press Ctrl-C to abort the current process at any time.

Press Enter to continue ...

[+] Installing WasmEdge ...

1) Install the latest version of WasmEdge and wasi-nn_ggml plugin (recommended)
2) Keep the current version

[+] Select a number from the list above: 1
Using Python: /opt/homebrew/bin/python3
INFO - CUDA is only supported on Linux
INFO - CUDA is only supported on Linux
WARNING - Experimental Option Selected: plugins
WARNING - plugins option may change later
INFO - Compatible with current configuration
INFO - Running Uninstaller
WARNING - SHELL variable not found. Using zsh as SHELL
INFO - shell configuration updated
INFO - Downloading WasmEdge
|============================================================|100.00 %INFO - Downloaded
INFO - Installing WasmEdge
INFO - WasmEdge Successfully installed
INFO - Downloading Plugin: wasi_nn-ggml
|============================================================|100.00 %INFO - Downloaded
INFO - Run:
source /Users/alex/.zshenv

The WasmEdge Runtime is installed in /Users/alex/.wasmedge/bin/wasmedge.

[+] The most popular models at https://huggingface.co/second-state:

 1) Gemma-7b-it-GGUF
 2) Gemma-2b-it-GGUF
 3) Llama-2-7B-Chat-GGUF
 4) stablelm-2-zephyr-1.6b-GGUF
 5) OpenChat-3.5-0106-GGUF
 6) Yi-34B-Chat-GGUF
 7) Yi-34Bx2-MoE-60B-GGUF
 8) Deepseek-LLM-7B-Chat-GGUF
 9) Deepseek-Coder-6.7B-Instruct-GGUF
10) Mistral-7B-Instruct-v0.2-GGUF
11) dolphin-2.6-mistral-7B-GGUF
12) Orca-2-13B-GGUF
13) TinyLlama-1.1B-Chat-v1.0-GGUF
14) SOLAR-10.7B-Instruct-v1.0-GGUF

Or choose one from: https://huggingface.co/models?sort=trending&search=gguf

[+] Please select a number from the list above or enter an URL: 1
[+] Downloading the selected model from https://huggingface.co/second-state/Gemma-7b-it-GGUF/resolve/main/gemma-7b-it-Q5_K_M.gguf
######################################################################### 100.0%######################################################################### 100.0%
[+] Extracting prompt type: gemma-instruct
[+] No reverse prompt required
[+] Running mode:

 1) API Server with Chatbot web app
 2) CLI Chat

[+] Select a number from the list above: 2
[+] Selected running mode: 2 (CLI Chat)
[+] You already have llama-chat.wasm. Download the latest llama-chat.wasm? (y/n): y
[+] Downloading the latest llama-chat.wasm ...
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 2295k 100 2295k 0 0 1732k 0 0:00:01 0:00:01 --:--:-- 9793k

[+] Will run the following command to start CLI Chat:

wasmedge --dir .:. --nn-preload default:GGML:AUTO:gemma-7b-it-Q5_K_M.gguf llama-chat.wasm --prompt-template gemma-instruct

[+] Confirm to start CLI Chat? (y/n): y

********************* LlamaEdge *********************

[INFO] Model alias: default
[INFO] Prompt context size: 512
[INFO] Number of tokens to predict: 1024
[INFO] Number of layers to run on the GPU: 100
[INFO] Batch size for prompt processing: 512
[INFO] Temperature for sampling: 0.8
[INFO] Top-p sampling (1.0 = disabled): 0.9
[INFO] Penalize repeat sequence of tokens: 1.1
[INFO] presence penalty (0.0 = disabled): 0
[INFO] frequency penalty (0.0 = disabled): 0
[INFO] Use default system prompt
[INFO] Prompt template: GemmaInstruct
[INFO] Log prompts: false
[INFO] Log statistics: false
[INFO] Log all information: false
[INFO] Plugin version: b2230 (commit 89febfed)

================================== Running in interactive mode. ===================================

- Press [Ctrl+C] to interject at any time.
- Press [Return] to end the input.
- For multi-line inputs, end each line with '\' and press [Return] to get another line.

[You]:
Write me a short description of the Adidas brand

[Bot]:
^C
alex@M1 ~ % wasmedge --dir .:. --nn-preload default:GGML:AUTO:/ai/gemma-7b-it-Q4_K_M.gguf llama-chat.wasm -p gemma-instruct -c 512
[INFO] Model alias: default
[INFO] Prompt context size: 512
[INFO] Number of tokens to predict: 1024
[INFO] Number of layers to run on the GPU: 100
[INFO] Batch size for prompt processing: 512
[INFO] Temperature for sampling: 0.8
[INFO] Top-p sampling (1.0 = disabled): 0.9
[INFO] Penalize repeat sequence of tokens: 1.1
[INFO] presence penalty (0.0 = disabled): 0
[INFO] frequency penalty (0.0 = disabled): 0
[INFO] Use default system prompt
[INFO] Prompt template: GemmaInstruct
[INFO] Log prompts: false
[INFO] Log statistics: false
[INFO] Log all information: false
[2024-02-25 13:10:22.030] [error] [WASI-NN] GGML backend: Error: unable to init model.
Error: "Fail to load model into wasi-nn: Backend Error: WASI-NN Backend Error: Caller module passed an invalid argument"
alex@M1 ~ % wasmedge --dir .:. --nn-preload default:GGML:AUTO:/ai/gemma-7b-it-Q4_K_M.gguf llama-chat.wasm -p gemma-instruct -c 512
[INFO] Model alias: default
[INFO] Prompt context size: 512
[INFO] Number of tokens to predict: 1024
[INFO] Number of layers to run on the GPU: 100
[INFO] Batch size for prompt processing: 512
[INFO] Temperature for sampling: 0.8
[INFO] Top-p sampling (1.0 = disabled): 0.9
[INFO] Penalize repeat sequence of tokens: 1.1
[INFO] presence penalty (0.0 = disabled): 0
[INFO] frequency penalty (0.0 = disabled): 0
[INFO] Use default system prompt
[INFO] Prompt template: GemmaInstruct
[INFO] Log prompts: false
[INFO] Log statistics: false
[INFO] Log all information: false
[2024-02-25 13:10:57.677] [error] [WASI-NN] GGML backend: Error: unable to init model.
Error: "Fail to load model into wasi-nn: Backend Error: WASI-NN Backend Error: Caller module passed an invalid argument"