Google's Gemma-2b-it GGUF

These files are GGUF format model files for Google's Gemma-2b-it.

GGUF files are for CPU + GPU inference using llama.cpp and libraries and UIs which support this format, such as:

How to run in llama.cpp

I use the following command line, adjust for your tastes and needs:

./main -t 2 -ngl 18 -m gemma-2b-it.q8_0.gguf -p '<start_of_turn>user\nWhat is love?\n<end_of_turn>\n<start_of_turn>model\n' --no-penalize-nl -e --color --temp 0.95 -c 1024 -n 512 --repeat_penalty 1.2 --top_p 0.95 --top_k 50

Change -t 2 to the number of physical CPU cores you have. For example if your system has 8 cores/16 threads, use -t 8.

Change -ngl 18 to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration.

If you want to have a chat-style conversation, replace the -p <PROMPT> argument with -i -ins, you can use --interactive-first to start in interactive mode:

./main -t 2 -ngl 18 -m gemma-2b-it.q8_0.gguf --in-prefix '<start_of_turn>user\n' --in-suffix '<end_of_turn>\n<start_of_turn>model\n' -i -ins --no-penalize-nl -e --color --temp 0.95 -c 1024 -n 512 --repeat_penalty 1.2 --top_p 0.95 --top_k 50

Compatibility

I have uploded both the original llama.cpp quant methods (q4_0, q4_1, q5_0, q5_1, q8_0) as well as the k-quant methods (q2_K, q3_K_S, q3_K_M, q3_K_L, q4_K_S, q4_K_M, q5_K_S, q6_K).

Please refer to llama.cpp and TheBloke's GGUF models for further explanation.

How to run in text-generation-webui

Further instructions here: text-generation-webui/docs/llama.cpp-models.md.

Thanks

Thanks to Google for providing checkpoints of the model.

Thanks to Georgi Gerganov and all of the awesome people in the AI community.

Downloads last month
79
GGUF
Model size
2.51B params
Architecture
gemma

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

Inference Examples
Inference API (serverless) has been turned off for this model.