ymcki
/

gemma-2-9b-it-GGUF

@@ -50,15 +50,15 @@ The perfect score is 5.00. As a reference, bartowski's gemma-2-27b-it.Q6_K.gguf
 | Filename | Quant type | File Size | ELIZA-Tasks-100 | Nvidia 3090 | Description |
 | -------- | ---------- | --------- | --------------- | ----------- | ----------- |
 | [gemma-2-9b-it.f16.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-9b-it.f16.gguf) | f16 | 18.5GB | 3.75 | 31.9t/s | Full F16 weights. |
-| [gemma-2-9b-it.Q8_0.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-2b-jpn-it.Q8_0.gguf) | Q8_0 | 9.83GB | 3.66 | 56.1t/s | Extremely high quality, *recommended for edge devices with 16GB RAM*. |
 | [gemma-2-9b-it-imatrix.Q4_0.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-9b-it-imatrix.Q4_0.gguf) | Q4_0 | 5.44GB | 3.76 | 80.6t/s | Good quality, *recommended for edge devices wth 8GB RAM*. |
-| [gemma-2-9b-it-imatrix.Q4_0_8_8.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-2b-jpn-it-imatrix.Q4_0_8_8.gguf) | Q4_0_8_8 | 5.44GB | TBD | TBD | Good quality, *recommended for edge device <8GB RAM*. |
-| [gemma-2-9b-it-imatrix.Q4_0_4_8.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-2b-jpn-it-imatrix.Q4_0_4_8.gguf) | Q4_0_4_8 | 5.44GB | TBD | TBD | Good quality, *recommended for edge device <8GB RAM*. |
-| [gemma-2-9b-it-imatrix.Q4_0_4_4.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-2b-jpn-it-imatrix.Q4_0_4_4.gguf) | Q4_0_4_4 | 5.44GB | 3.72 | 0.72t/s | Good quality, *recommended for edge devices with 8GB RAM*. |
 | [gemma-2-9b-it.Q4_0.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-9b-it.Q4_0.gguf) | Q4_0 | 5.44GB | 3.64 | 65.1t/s | Good quality, *recommended for edge device with 8GB RAM* |
-| [gemma-2-9b-it.Q4_0_8_8.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-9b-it.Q4_0_8_8.gguf) | Q4_0_8_8 | 5.44GB | TBD | TBD | Good quality, *recommended for edge device <8GB RAM* |
-| [gemma-2-9b-it.Q4_0_4_8.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-9b-it.Q4_0_4_8.gguf) | Q4_0_4_8 | 5.44GB | TBD | TBD | Good quality, *recommended for edge device <8GB RAM* |
-| [gemma-2-9b-it.Q4_0_4_4.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-9b-it.Q4_0_4_4.gguf) | Q4_0_4_4 | 5.44GB | 3.63 | 0.76ts | Good quality, *recommended for edge device <8GB RAM*. |
 ## How to check i8mm and sve support for ARM devices
@@ -66,7 +66,7 @@ ARM i8mm support is necessary to take advantage of Q4_0_4_8 gguf. All ARM archit
 ARM sve support is necessary to take advantage of Q4_0_8_8 gguf. sve is an optional feature that starts from ARMv8.2-A but majority of ARM chips doesn't implement it.
-For ARM devices without both, it is recommended to use Q4_0_4_4.
 With these support, the inference speed should be faster in the order of Q4_0_8_8 > Q4_0_4_8 > Q4_0_4_4 > Q4_0 without much effect on the quality of response.
@@ -108,20 +108,20 @@ On the other hand, Nvidia 3090 inference speed is significantly faster for Q4_0
 According to this [blog](https://sc-bakushu.hatenablog.com/entry/2024/04/20/050213), adding imatrix to low bit quant can significantly improve performance. The best dataset for Japanese is [MTFMC/imatrix-dataset-for-japanese-llm](https://huggingface.co/datasets/TFMC/imatrix-dataset-for-japanese-llm). Therefore, I also created the imatrix versions of different Q4_0 quants.
-However, based on my benchmarking results, it seems like imatrix does improve the performance of a non-Japanese optimized model.
 ## Convert safetensors to f16 gguf
 Make sure you have llama.cpp git cloned:
 ```
-python3 convert_hf_to_gguf.py gemma-2-2b-jpn-it/ --outfile gemma-2-2b-jpn-it.f16.gguf --outtype f16
 ```
 ## Convert f16 gguf to Q8_0 gguf without imatrix
 Make sure you have llama.cpp compiled:
 ```
-./llama-quantize gemma-2-2b-jpn-it.f16.gguf gemma-2-2b-jpn-it.Q8_0.gguf q8_0
 ```
 ## Convert f16 gguf to other ggufs with imatrix
@@ -129,13 +129,13 @@ Make sure you have llama.cpp compiled:
 First, prepare imatrix from f16 gguf and c4_en_ja_imatrix.txt
 ```
-./llama-imatrix -m gemma-2-2b-jpn-it.f16.gguf -f c4_en_ja_imatrix.txt -o gemma-2-2b-jpn-it.imatrix --chunks 32
 ```
 Then, convert f16 gguf with imatrix to create imatrix gguf
 ```
-./llama-quantize --imatrix gemma-2-2b-jpn-it.imatrix gemma-2-2b-jpn-it.f16.gguf gemma-2-2b-jpn-it-imatrix.Q4_0_8_8.gguf q4_0_8_8
 ```
 ## Downloading using huggingface-cli
@@ -149,7 +149,7 @@ pip install -U "huggingface_hub[cli]"
 Then, you can target the specific file you want:
 ```
-huggingface-cli download ymcki/gemma-2-2b-jpn-it-GGUF --include "gemma-2-2b-jpn-it-Q8_0.gguf" --local-dir ./
 ```
 ## Credits

 | Filename | Quant type | File Size | ELIZA-Tasks-100 | Nvidia 3090 | Description |
 | -------- | ---------- | --------- | --------------- | ----------- | ----------- |
 | [gemma-2-9b-it.f16.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-9b-it.f16.gguf) | f16 | 18.5GB | 3.75 | 31.9t/s | Full F16 weights. |
+| [gemma-2-9b-it.Q8_0.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-9b-it.Q8_0.gguf) | Q8_0 | 9.83GB | 3.66 | 56.1t/s | Extremely high quality, *recommended for edge devices with 16GB RAM*. |
 | [gemma-2-9b-it-imatrix.Q4_0.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-9b-it-imatrix.Q4_0.gguf) | Q4_0 | 5.44GB | 3.76 | 80.6t/s | Good quality, *recommended for edge devices wth 8GB RAM*. |
+| [gemma-2-9b-it-imatrix.Q4_0_8_8.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-9b-it-imatrix.Q4_0_8_8.gguf) | Q4_0_8_8 | 5.44GB | 3.74 | 0.7t/s | Good quality, *recommended for edge devices with 8GB RAM*. |
+| [gemma-2-9b-it-imatrix.Q4_0_4_8.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-9b-it-imatrix.Q4_0_4_8.gguf) | Q4_0_4_8 | 5.44GB | 3.64 | 0.7t/s | Good quality, *recommended for edge devices with 8GB RAM*. |
+| [gemma-2-9b-it-imatrix.Q4_0_4_4.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-9b-it-imatrix.Q4_0_4_4.gguf) | Q4_0_4_4 | 5.44GB | 3.72 | 0.72t/s | Good quality, *recommended for edge devices with 8GB RAM*. |
 | [gemma-2-9b-it.Q4_0.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-9b-it.Q4_0.gguf) | Q4_0 | 5.44GB | 3.64 | 65.1t/s | Good quality, *recommended for edge device with 8GB RAM* |
+| [gemma-2-9b-it.Q4_0_8_8.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-9b-it.Q4_0_8_8.gguf) | Q4_0_8_8 | 5.44GB | 3.64 | 0.57t/s | Good quality but imatrix version seems better. |
+| [gemma-2-9b-it.Q4_0_4_8.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-9b-it.Q4_0_4_8.gguf) | Q4_0_4_8 | 5.44GB | 3.68 | 0.61t/s | Good quality but imatrix version seems better. |
+| [gemma-2-9b-it.Q4_0_4_4.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-9b-it.Q4_0_4_4.gguf) | Q4_0_4_4 | 5.44GB | 3.63 | 0.76t/s | Good quality but imatrix version seems better. |
 ## How to check i8mm and sve support for ARM devices
 ARM sve support is necessary to take advantage of Q4_0_8_8 gguf. sve is an optional feature that starts from ARMv8.2-A but majority of ARM chips doesn't implement it.
+For ARM devices without both, it is recommended to use Q4_0_4_4. However, in reality, Q4_0 can perform better for some phones, so you better try both and see which one is better.
 With these support, the inference speed should be faster in the order of Q4_0_8_8 > Q4_0_4_8 > Q4_0_4_4 > Q4_0 without much effect on the quality of response.
 According to this [blog](https://sc-bakushu.hatenablog.com/entry/2024/04/20/050213), adding imatrix to low bit quant can significantly improve performance. The best dataset for Japanese is [MTFMC/imatrix-dataset-for-japanese-llm](https://huggingface.co/datasets/TFMC/imatrix-dataset-for-japanese-llm). Therefore, I also created the imatrix versions of different Q4_0 quants.
+However, based on my benchmarking results, it seems like imatrix does improve the performance of a non-Japanese optimized model but doesn't do much for a Japanese optimized model like [gemma-2-2b-jpn-it](https://huggingface.co/ymcki/gemma-2-2b-jpn-it-GGUF/).
 ## Convert safetensors to f16 gguf
 Make sure you have llama.cpp git cloned:
 ```
+python3 convert_hf_to_gguf.py gemma-2-9b-it/ --outfile gemma-2-9b-it.f16.gguf --outtype f16
 ```
 ## Convert f16 gguf to Q8_0 gguf without imatrix
 Make sure you have llama.cpp compiled:
 ```
+./llama-quantize gemma-2-9b-it.f16.gguf gemma-2-9b-it.Q8_0.gguf q8_0
 ```
 ## Convert f16 gguf to other ggufs with imatrix
 First, prepare imatrix from f16 gguf and c4_en_ja_imatrix.txt
 ```
+./llama-imatrix -m gemma-2-9b-it.f16.gguf -f c4_en_ja_imatrix.txt -o gemma-2-9b-it.imatrix --chunks 32
 ```
 Then, convert f16 gguf with imatrix to create imatrix gguf
 ```
+./llama-quantize --imatrix gemma-2-9b-it.imatrix gemma-2-9b-it.f16.gguf gemma-2-9b-it-imatrix.Q4_0_8_8.gguf q4_0_8_8
 ```
 ## Downloading using huggingface-cli
 Then, you can target the specific file you want:
 ```
+huggingface-cli download ymcki/gemma-2-9b-it-GGUF --include "gemma-2-9b-it.Q8_0.gguf" --local-dir ./
 ```
 ## Credits