Execution and prompting
Hi,
I tried to run this for llama.cpp and seems that is generate some garbage
./main -ngl 35 -m gemma-7b.Q4_K_M.gguf --color -c 32768 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant"
also I tried this for llama-cpp-python and got same garbage result with garbage
from llama_cpp import Llama
Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
llm = Llama(
model_path="./gemma-7b.Q4_K_M.gguf", # Download the model file first
n_ctx=32768, # The max sequence length to use - note that longer sequence lengths require much more resources
n_threads=8, # The number of CPU threads to use, tailor to your system and the resulting performance
n_gpu_layers=35 # The number of layers to offload to GPU, if you have GPU acceleration available
)
Simple inference example
output = llm(
"<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant", # Prompt
max_tokens=512, # Generate up to 512 tokens
stop=[""], # Example stop token - not necessarily correct for this specific model! Please check before using.
echo=True # Whether to echo the prompt
)
Could be that prompts are different ?
Thanks
This is a base model, it has no instructions nor any chat template. It should be used only for fine-tuning. I think this is what you want to try for instruct: https://huggingface.co/MaziyarPanahi/gemma-7b-it-GGUF
Thanks, will try
@MaziyarPanahi
I tried https://huggingface.co/MaziyarPanahi/gemma-7b-it-GGUF with llama.cpp and got :
ggml_metal_graph_compute: command buffer 0 failed with status 5
ggml_metal_graph_compute: command buffer 0 failed with status 5
ggml_metal_graph_compute: command buffer 0 failed with status 5
ggml_metal_graph_compute: command buffer 0 failed with status 5
ggml_metal_graph_compute: command buffer 0 failed with status 5