ymcki commited on
Commit
863f044
1 Parent(s): de4600d

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +14 -14
README.md CHANGED
@@ -50,15 +50,15 @@ The perfect score is 5.00. As a reference, bartowski's gemma-2-27b-it.Q6_K.gguf
50
  | Filename | Quant type | File Size | ELIZA-Tasks-100 | Nvidia 3090 | Description |
51
  | -------- | ---------- | --------- | --------------- | ----------- | ----------- |
52
  | [gemma-2-9b-it.f16.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-9b-it.f16.gguf) | f16 | 18.5GB | 3.75 | 31.9t/s | Full F16 weights. |
53
- | [gemma-2-9b-it.Q8_0.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-2b-jpn-it.Q8_0.gguf) | Q8_0 | 9.83GB | 3.66 | 56.1t/s | Extremely high quality, *recommended for edge devices with 16GB RAM*. |
54
  | [gemma-2-9b-it-imatrix.Q4_0.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-9b-it-imatrix.Q4_0.gguf) | Q4_0 | 5.44GB | 3.76 | 80.6t/s | Good quality, *recommended for edge devices wth 8GB RAM*. |
55
- | [gemma-2-9b-it-imatrix.Q4_0_8_8.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-2b-jpn-it-imatrix.Q4_0_8_8.gguf) | Q4_0_8_8 | 5.44GB | TBD | TBD | Good quality, *recommended for edge device <8GB RAM*. |
56
- | [gemma-2-9b-it-imatrix.Q4_0_4_8.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-2b-jpn-it-imatrix.Q4_0_4_8.gguf) | Q4_0_4_8 | 5.44GB | TBD | TBD | Good quality, *recommended for edge device <8GB RAM*. |
57
- | [gemma-2-9b-it-imatrix.Q4_0_4_4.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-2b-jpn-it-imatrix.Q4_0_4_4.gguf) | Q4_0_4_4 | 5.44GB | 3.72 | 0.72t/s | Good quality, *recommended for edge devices with 8GB RAM*. |
58
  | [gemma-2-9b-it.Q4_0.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-9b-it.Q4_0.gguf) | Q4_0 | 5.44GB | 3.64 | 65.1t/s | Good quality, *recommended for edge device with 8GB RAM* |
59
- | [gemma-2-9b-it.Q4_0_8_8.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-9b-it.Q4_0_8_8.gguf) | Q4_0_8_8 | 5.44GB | TBD | TBD | Good quality, *recommended for edge device <8GB RAM* |
60
- | [gemma-2-9b-it.Q4_0_4_8.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-9b-it.Q4_0_4_8.gguf) | Q4_0_4_8 | 5.44GB | TBD | TBD | Good quality, *recommended for edge device <8GB RAM* |
61
- | [gemma-2-9b-it.Q4_0_4_4.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-9b-it.Q4_0_4_4.gguf) | Q4_0_4_4 | 5.44GB | 3.63 | 0.76ts | Good quality, *recommended for edge device <8GB RAM*. |
62
 
63
  ## How to check i8mm and sve support for ARM devices
64
 
@@ -66,7 +66,7 @@ ARM i8mm support is necessary to take advantage of Q4_0_4_8 gguf. All ARM archit
66
 
67
  ARM sve support is necessary to take advantage of Q4_0_8_8 gguf. sve is an optional feature that starts from ARMv8.2-A but majority of ARM chips doesn't implement it.
68
 
69
- For ARM devices without both, it is recommended to use Q4_0_4_4.
70
 
71
  With these support, the inference speed should be faster in the order of Q4_0_8_8 > Q4_0_4_8 > Q4_0_4_4 > Q4_0 without much effect on the quality of response.
72
 
@@ -108,20 +108,20 @@ On the other hand, Nvidia 3090 inference speed is significantly faster for Q4_0
108
 
109
  According to this [blog](https://sc-bakushu.hatenablog.com/entry/2024/04/20/050213), adding imatrix to low bit quant can significantly improve performance. The best dataset for Japanese is [MTFMC/imatrix-dataset-for-japanese-llm](https://huggingface.co/datasets/TFMC/imatrix-dataset-for-japanese-llm). Therefore, I also created the imatrix versions of different Q4_0 quants.
110
 
111
- However, based on my benchmarking results, it seems like imatrix does improve the performance of a non-Japanese optimized model.
112
 
113
  ## Convert safetensors to f16 gguf
114
 
115
  Make sure you have llama.cpp git cloned:
116
 
117
  ```
118
- python3 convert_hf_to_gguf.py gemma-2-2b-jpn-it/ --outfile gemma-2-2b-jpn-it.f16.gguf --outtype f16
119
  ```
120
 
121
  ## Convert f16 gguf to Q8_0 gguf without imatrix
122
  Make sure you have llama.cpp compiled:
123
  ```
124
- ./llama-quantize gemma-2-2b-jpn-it.f16.gguf gemma-2-2b-jpn-it.Q8_0.gguf q8_0
125
  ```
126
 
127
  ## Convert f16 gguf to other ggufs with imatrix
@@ -129,13 +129,13 @@ Make sure you have llama.cpp compiled:
129
  First, prepare imatrix from f16 gguf and c4_en_ja_imatrix.txt
130
 
131
  ```
132
- ./llama-imatrix -m gemma-2-2b-jpn-it.f16.gguf -f c4_en_ja_imatrix.txt -o gemma-2-2b-jpn-it.imatrix --chunks 32
133
  ```
134
 
135
  Then, convert f16 gguf with imatrix to create imatrix gguf
136
 
137
  ```
138
- ./llama-quantize --imatrix gemma-2-2b-jpn-it.imatrix gemma-2-2b-jpn-it.f16.gguf gemma-2-2b-jpn-it-imatrix.Q4_0_8_8.gguf q4_0_8_8
139
  ```
140
 
141
  ## Downloading using huggingface-cli
@@ -149,7 +149,7 @@ pip install -U "huggingface_hub[cli]"
149
  Then, you can target the specific file you want:
150
 
151
  ```
152
- huggingface-cli download ymcki/gemma-2-2b-jpn-it-GGUF --include "gemma-2-2b-jpn-it-Q8_0.gguf" --local-dir ./
153
  ```
154
 
155
  ## Credits
 
50
  | Filename | Quant type | File Size | ELIZA-Tasks-100 | Nvidia 3090 | Description |
51
  | -------- | ---------- | --------- | --------------- | ----------- | ----------- |
52
  | [gemma-2-9b-it.f16.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-9b-it.f16.gguf) | f16 | 18.5GB | 3.75 | 31.9t/s | Full F16 weights. |
53
+ | [gemma-2-9b-it.Q8_0.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-9b-it.Q8_0.gguf) | Q8_0 | 9.83GB | 3.66 | 56.1t/s | Extremely high quality, *recommended for edge devices with 16GB RAM*. |
54
  | [gemma-2-9b-it-imatrix.Q4_0.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-9b-it-imatrix.Q4_0.gguf) | Q4_0 | 5.44GB | 3.76 | 80.6t/s | Good quality, *recommended for edge devices wth 8GB RAM*. |
55
+ | [gemma-2-9b-it-imatrix.Q4_0_8_8.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-9b-it-imatrix.Q4_0_8_8.gguf) | Q4_0_8_8 | 5.44GB | 3.74 | 0.7t/s | Good quality, *recommended for edge devices with 8GB RAM*. |
56
+ | [gemma-2-9b-it-imatrix.Q4_0_4_8.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-9b-it-imatrix.Q4_0_4_8.gguf) | Q4_0_4_8 | 5.44GB | 3.64 | 0.7t/s | Good quality, *recommended for edge devices with 8GB RAM*. |
57
+ | [gemma-2-9b-it-imatrix.Q4_0_4_4.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-9b-it-imatrix.Q4_0_4_4.gguf) | Q4_0_4_4 | 5.44GB | 3.72 | 0.72t/s | Good quality, *recommended for edge devices with 8GB RAM*. |
58
  | [gemma-2-9b-it.Q4_0.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-9b-it.Q4_0.gguf) | Q4_0 | 5.44GB | 3.64 | 65.1t/s | Good quality, *recommended for edge device with 8GB RAM* |
59
+ | [gemma-2-9b-it.Q4_0_8_8.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-9b-it.Q4_0_8_8.gguf) | Q4_0_8_8 | 5.44GB | 3.64 | 0.57t/s | Good quality but imatrix version seems better. |
60
+ | [gemma-2-9b-it.Q4_0_4_8.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-9b-it.Q4_0_4_8.gguf) | Q4_0_4_8 | 5.44GB | 3.68 | 0.61t/s | Good quality but imatrix version seems better. |
61
+ | [gemma-2-9b-it.Q4_0_4_4.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-9b-it.Q4_0_4_4.gguf) | Q4_0_4_4 | 5.44GB | 3.63 | 0.76t/s | Good quality but imatrix version seems better. |
62
 
63
  ## How to check i8mm and sve support for ARM devices
64
 
 
66
 
67
  ARM sve support is necessary to take advantage of Q4_0_8_8 gguf. sve is an optional feature that starts from ARMv8.2-A but majority of ARM chips doesn't implement it.
68
 
69
+ For ARM devices without both, it is recommended to use Q4_0_4_4. However, in reality, Q4_0 can perform better for some phones, so you better try both and see which one is better.
70
 
71
  With these support, the inference speed should be faster in the order of Q4_0_8_8 > Q4_0_4_8 > Q4_0_4_4 > Q4_0 without much effect on the quality of response.
72
 
 
108
 
109
  According to this [blog](https://sc-bakushu.hatenablog.com/entry/2024/04/20/050213), adding imatrix to low bit quant can significantly improve performance. The best dataset for Japanese is [MTFMC/imatrix-dataset-for-japanese-llm](https://huggingface.co/datasets/TFMC/imatrix-dataset-for-japanese-llm). Therefore, I also created the imatrix versions of different Q4_0 quants.
110
 
111
+ However, based on my benchmarking results, it seems like imatrix does improve the performance of a non-Japanese optimized model but doesn't do much for a Japanese optimized model like [gemma-2-2b-jpn-it](https://huggingface.co/ymcki/gemma-2-2b-jpn-it-GGUF/).
112
 
113
  ## Convert safetensors to f16 gguf
114
 
115
  Make sure you have llama.cpp git cloned:
116
 
117
  ```
118
+ python3 convert_hf_to_gguf.py gemma-2-9b-it/ --outfile gemma-2-9b-it.f16.gguf --outtype f16
119
  ```
120
 
121
  ## Convert f16 gguf to Q8_0 gguf without imatrix
122
  Make sure you have llama.cpp compiled:
123
  ```
124
+ ./llama-quantize gemma-2-9b-it.f16.gguf gemma-2-9b-it.Q8_0.gguf q8_0
125
  ```
126
 
127
  ## Convert f16 gguf to other ggufs with imatrix
 
129
  First, prepare imatrix from f16 gguf and c4_en_ja_imatrix.txt
130
 
131
  ```
132
+ ./llama-imatrix -m gemma-2-9b-it.f16.gguf -f c4_en_ja_imatrix.txt -o gemma-2-9b-it.imatrix --chunks 32
133
  ```
134
 
135
  Then, convert f16 gguf with imatrix to create imatrix gguf
136
 
137
  ```
138
+ ./llama-quantize --imatrix gemma-2-9b-it.imatrix gemma-2-9b-it.f16.gguf gemma-2-9b-it-imatrix.Q4_0_8_8.gguf q4_0_8_8
139
  ```
140
 
141
  ## Downloading using huggingface-cli
 
149
  Then, you can target the specific file you want:
150
 
151
  ```
152
+ huggingface-cli download ymcki/gemma-2-9b-it-GGUF --include "gemma-2-9b-it.Q8_0.gguf" --local-dir ./
153
  ```
154
 
155
  ## Credits