Upload README.md
Browse files
README.md
CHANGED
@@ -50,15 +50,15 @@ The perfect score is 5.00. As a reference, bartowski's gemma-2-27b-it.Q6_K.gguf
|
|
50 |
| Filename | Quant type | File Size | ELIZA-Tasks-100 | Nvidia 3090 | Description |
|
51 |
| -------- | ---------- | --------- | --------------- | ----------- | ----------- |
|
52 |
| [gemma-2-9b-it.f16.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-9b-it.f16.gguf) | f16 | 18.5GB | 3.75 | 31.9t/s | Full F16 weights. |
|
53 |
-
| [gemma-2-9b-it.Q8_0.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-
|
54 |
| [gemma-2-9b-it-imatrix.Q4_0.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-9b-it-imatrix.Q4_0.gguf) | Q4_0 | 5.44GB | 3.76 | 80.6t/s | Good quality, *recommended for edge devices wth 8GB RAM*. |
|
55 |
-
| [gemma-2-9b-it-imatrix.Q4_0_8_8.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-
|
56 |
-
| [gemma-2-9b-it-imatrix.Q4_0_4_8.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-
|
57 |
-
| [gemma-2-9b-it-imatrix.Q4_0_4_4.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-
|
58 |
| [gemma-2-9b-it.Q4_0.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-9b-it.Q4_0.gguf) | Q4_0 | 5.44GB | 3.64 | 65.1t/s | Good quality, *recommended for edge device with 8GB RAM* |
|
59 |
-
| [gemma-2-9b-it.Q4_0_8_8.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-9b-it.Q4_0_8_8.gguf) | Q4_0_8_8 | 5.44GB |
|
60 |
-
| [gemma-2-9b-it.Q4_0_4_8.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-9b-it.Q4_0_4_8.gguf) | Q4_0_4_8 | 5.44GB |
|
61 |
-
| [gemma-2-9b-it.Q4_0_4_4.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-9b-it.Q4_0_4_4.gguf) | Q4_0_4_4 | 5.44GB | 3.63 | 0.
|
62 |
|
63 |
## How to check i8mm and sve support for ARM devices
|
64 |
|
@@ -66,7 +66,7 @@ ARM i8mm support is necessary to take advantage of Q4_0_4_8 gguf. All ARM archit
|
|
66 |
|
67 |
ARM sve support is necessary to take advantage of Q4_0_8_8 gguf. sve is an optional feature that starts from ARMv8.2-A but majority of ARM chips doesn't implement it.
|
68 |
|
69 |
-
For ARM devices without both, it is recommended to use Q4_0_4_4.
|
70 |
|
71 |
With these support, the inference speed should be faster in the order of Q4_0_8_8 > Q4_0_4_8 > Q4_0_4_4 > Q4_0 without much effect on the quality of response.
|
72 |
|
@@ -108,20 +108,20 @@ On the other hand, Nvidia 3090 inference speed is significantly faster for Q4_0
|
|
108 |
|
109 |
According to this [blog](https://sc-bakushu.hatenablog.com/entry/2024/04/20/050213), adding imatrix to low bit quant can significantly improve performance. The best dataset for Japanese is [MTFMC/imatrix-dataset-for-japanese-llm](https://huggingface.co/datasets/TFMC/imatrix-dataset-for-japanese-llm). Therefore, I also created the imatrix versions of different Q4_0 quants.
|
110 |
|
111 |
-
However, based on my benchmarking results, it seems like imatrix does improve the performance of a non-Japanese optimized model.
|
112 |
|
113 |
## Convert safetensors to f16 gguf
|
114 |
|
115 |
Make sure you have llama.cpp git cloned:
|
116 |
|
117 |
```
|
118 |
-
python3 convert_hf_to_gguf.py gemma-2-
|
119 |
```
|
120 |
|
121 |
## Convert f16 gguf to Q8_0 gguf without imatrix
|
122 |
Make sure you have llama.cpp compiled:
|
123 |
```
|
124 |
-
./llama-quantize gemma-2-
|
125 |
```
|
126 |
|
127 |
## Convert f16 gguf to other ggufs with imatrix
|
@@ -129,13 +129,13 @@ Make sure you have llama.cpp compiled:
|
|
129 |
First, prepare imatrix from f16 gguf and c4_en_ja_imatrix.txt
|
130 |
|
131 |
```
|
132 |
-
./llama-imatrix -m gemma-2-
|
133 |
```
|
134 |
|
135 |
Then, convert f16 gguf with imatrix to create imatrix gguf
|
136 |
|
137 |
```
|
138 |
-
./llama-quantize --imatrix gemma-2-
|
139 |
```
|
140 |
|
141 |
## Downloading using huggingface-cli
|
@@ -149,7 +149,7 @@ pip install -U "huggingface_hub[cli]"
|
|
149 |
Then, you can target the specific file you want:
|
150 |
|
151 |
```
|
152 |
-
huggingface-cli download ymcki/gemma-2-
|
153 |
```
|
154 |
|
155 |
## Credits
|
|
|
50 |
| Filename | Quant type | File Size | ELIZA-Tasks-100 | Nvidia 3090 | Description |
|
51 |
| -------- | ---------- | --------- | --------------- | ----------- | ----------- |
|
52 |
| [gemma-2-9b-it.f16.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-9b-it.f16.gguf) | f16 | 18.5GB | 3.75 | 31.9t/s | Full F16 weights. |
|
53 |
+
| [gemma-2-9b-it.Q8_0.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-9b-it.Q8_0.gguf) | Q8_0 | 9.83GB | 3.66 | 56.1t/s | Extremely high quality, *recommended for edge devices with 16GB RAM*. |
|
54 |
| [gemma-2-9b-it-imatrix.Q4_0.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-9b-it-imatrix.Q4_0.gguf) | Q4_0 | 5.44GB | 3.76 | 80.6t/s | Good quality, *recommended for edge devices wth 8GB RAM*. |
|
55 |
+
| [gemma-2-9b-it-imatrix.Q4_0_8_8.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-9b-it-imatrix.Q4_0_8_8.gguf) | Q4_0_8_8 | 5.44GB | 3.74 | 0.7t/s | Good quality, *recommended for edge devices with 8GB RAM*. |
|
56 |
+
| [gemma-2-9b-it-imatrix.Q4_0_4_8.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-9b-it-imatrix.Q4_0_4_8.gguf) | Q4_0_4_8 | 5.44GB | 3.64 | 0.7t/s | Good quality, *recommended for edge devices with 8GB RAM*. |
|
57 |
+
| [gemma-2-9b-it-imatrix.Q4_0_4_4.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-9b-it-imatrix.Q4_0_4_4.gguf) | Q4_0_4_4 | 5.44GB | 3.72 | 0.72t/s | Good quality, *recommended for edge devices with 8GB RAM*. |
|
58 |
| [gemma-2-9b-it.Q4_0.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-9b-it.Q4_0.gguf) | Q4_0 | 5.44GB | 3.64 | 65.1t/s | Good quality, *recommended for edge device with 8GB RAM* |
|
59 |
+
| [gemma-2-9b-it.Q4_0_8_8.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-9b-it.Q4_0_8_8.gguf) | Q4_0_8_8 | 5.44GB | 3.64 | 0.57t/s | Good quality but imatrix version seems better. |
|
60 |
+
| [gemma-2-9b-it.Q4_0_4_8.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-9b-it.Q4_0_4_8.gguf) | Q4_0_4_8 | 5.44GB | 3.68 | 0.61t/s | Good quality but imatrix version seems better. |
|
61 |
+
| [gemma-2-9b-it.Q4_0_4_4.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-9b-it.Q4_0_4_4.gguf) | Q4_0_4_4 | 5.44GB | 3.63 | 0.76t/s | Good quality but imatrix version seems better. |
|
62 |
|
63 |
## How to check i8mm and sve support for ARM devices
|
64 |
|
|
|
66 |
|
67 |
ARM sve support is necessary to take advantage of Q4_0_8_8 gguf. sve is an optional feature that starts from ARMv8.2-A but majority of ARM chips doesn't implement it.
|
68 |
|
69 |
+
For ARM devices without both, it is recommended to use Q4_0_4_4. However, in reality, Q4_0 can perform better for some phones, so you better try both and see which one is better.
|
70 |
|
71 |
With these support, the inference speed should be faster in the order of Q4_0_8_8 > Q4_0_4_8 > Q4_0_4_4 > Q4_0 without much effect on the quality of response.
|
72 |
|
|
|
108 |
|
109 |
According to this [blog](https://sc-bakushu.hatenablog.com/entry/2024/04/20/050213), adding imatrix to low bit quant can significantly improve performance. The best dataset for Japanese is [MTFMC/imatrix-dataset-for-japanese-llm](https://huggingface.co/datasets/TFMC/imatrix-dataset-for-japanese-llm). Therefore, I also created the imatrix versions of different Q4_0 quants.
|
110 |
|
111 |
+
However, based on my benchmarking results, it seems like imatrix does improve the performance of a non-Japanese optimized model but doesn't do much for a Japanese optimized model like [gemma-2-2b-jpn-it](https://huggingface.co/ymcki/gemma-2-2b-jpn-it-GGUF/).
|
112 |
|
113 |
## Convert safetensors to f16 gguf
|
114 |
|
115 |
Make sure you have llama.cpp git cloned:
|
116 |
|
117 |
```
|
118 |
+
python3 convert_hf_to_gguf.py gemma-2-9b-it/ --outfile gemma-2-9b-it.f16.gguf --outtype f16
|
119 |
```
|
120 |
|
121 |
## Convert f16 gguf to Q8_0 gguf without imatrix
|
122 |
Make sure you have llama.cpp compiled:
|
123 |
```
|
124 |
+
./llama-quantize gemma-2-9b-it.f16.gguf gemma-2-9b-it.Q8_0.gguf q8_0
|
125 |
```
|
126 |
|
127 |
## Convert f16 gguf to other ggufs with imatrix
|
|
|
129 |
First, prepare imatrix from f16 gguf and c4_en_ja_imatrix.txt
|
130 |
|
131 |
```
|
132 |
+
./llama-imatrix -m gemma-2-9b-it.f16.gguf -f c4_en_ja_imatrix.txt -o gemma-2-9b-it.imatrix --chunks 32
|
133 |
```
|
134 |
|
135 |
Then, convert f16 gguf with imatrix to create imatrix gguf
|
136 |
|
137 |
```
|
138 |
+
./llama-quantize --imatrix gemma-2-9b-it.imatrix gemma-2-9b-it.f16.gguf gemma-2-9b-it-imatrix.Q4_0_8_8.gguf q4_0_8_8
|
139 |
```
|
140 |
|
141 |
## Downloading using huggingface-cli
|
|
|
149 |
Then, you can target the specific file you want:
|
150 |
|
151 |
```
|
152 |
+
huggingface-cli download ymcki/gemma-2-9b-it-GGUF --include "gemma-2-9b-it.Q8_0.gguf" --local-dir ./
|
153 |
```
|
154 |
|
155 |
## Credits
|