metadata

license: gemma

Gemma-2-9B-CPT-SahabatAI-Instruct GGUF

This is a GGUF quantized version of Gemma 2 9B, fine-tuned with custom instructions by SahabatAI and optimized for CPU inference using Q4_K_M quantization.

Model Details

Base Model: Gemma 2 9B
Instruction Format: SahabatAI Instruct v1
Quantization: GGUF Q4_K_M (4-bit with Medium precision for Key/Value cache)
Original Size: 9B parameters
Quantized Size: ~5GB
Context Length: 8192 tokens
License: Gemma Terms of Use

Description

This model is a quantized version of Gemma 2 9B, fine-tuned with custom instruction format by SahabatAI. The Q4_K_M quantization provides a good balance between model size, speed, and quality. The instruction format is optimized for general-purpose tasks while maintaining model coherence and reliability.

Usage

oobabooga's text-generation-webui Setup

Install text-generation-webui:

git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
pip install -r requirements.txt

Download Model:

mkdir models
cd models
# Download gemma2-9B-cpt-sahabatai-instruct-v1-Q4_K_M.gguf from Hugging Face

Launch the Web UI:

python server.py --model gemma2-9B-cpt-sahabatai-instruct-v1-Q4_K_M.gguf

Recommended Launch Parameters

For optimal performance on different hardware:

CPU Only:

python server.py --model gemma2-9B-cpt-sahabatai-instruct-v1-Q4_K_M.gguf --cpu --n_ctx 8192

GPU (CUDA):

python server.py --model gemma2-9B-cpt-sahabatai-instruct-v1-Q4_K_M.gguf --n_ctx 8192 --gpu-memory 6

Recommended Generation Parameters

temperature: 0.7
top_p: 0.9
top_k: 40
repetition_penalty: 1.1
max_new_tokens: 2048

Instruction Format

The model responds best to this instruction format:

<|system|>You are a helpful AI assistant.</|system|>

<|user|>Your question here</|user|>

<|assistant|>

Performance Benchmarks

Device	Tokens/sec	Memory Usage
CPU (8 cores)	~15 t/s	6GB
NVIDIA RTX 3060 (6GB)	~40 t/s	5GB
NVIDIA RTX 4090	~100 t/s	5GB

Example Outputs

<|system|>You are a helpful AI assistant.</|system|>

<|user|>What is the capital of Indonesia?</|user|>

<|assistant|>Jakarta is the capital city of Indonesia. It is located on the northwestern coast of Java, the most populous island in Indonesia. Jakarta serves as the country's economic, cultural, and political center.

<|user|>Write a simple Python function to calculate factorial.</|user|>

<|assistant|>Here's a simple recursive function to calculate factorial:

def factorial(n):
    if n == 0 or n == 1:
        return 1
    return n * factorial(n-1)

Known Limitations

Requires minimum 6GB RAM for CPU inference
Best performance with GPU having 6GB+ VRAM
May show degraded performance on very long contexts (>4096 tokens)
Quantization may impact some mathematical and logical reasoning tasks

Fine-tuning Details

Base Model: Gemma 2 9B
Instruction Format: Custom SahabatAI format
Quantization: Q4_K_M using llama.cpp

License

This model is subject to the Gemma Terms of Use. Please refer to Google's Gemma licensing terms for commercial usage.

Acknowledgments

Google for the Gemma 2 base model
SahabatAI for instruction fine-tuning
TheBloke for GGUF conversion tools
oobabooga for text-generation-webui

Support

For issues and questions:

Open an issue in this repository
Visit our Discord: [Your Discord Link]
Email: [Your Support Email]

Updates & Versions

v1.0 (2024-03): Initial release with Q4_K_M quantization
Future updates will be listed here