Upload README.md
Browse files
README.md
CHANGED
@@ -27,8 +27,9 @@ Original model: https://huggingface.co/nvidia/Llama-3_1-Nemotron-51B-Instruct-GG
|
|
27 |
### Assistant:
|
28 |
|
29 |
```
|
|
|
30 |
|
31 |
-
[
|
32 |
|
33 |
This modification should support Llama-3_1-Nemotron 51B-Instruct fully. However, it may not support future DeciLMForCausalLM models that has no_op or linear ffn layers. Well, I suppose these support can be added when there are actually models using that types of layers.
|
34 |
|
@@ -36,14 +37,21 @@ Since I am a free user, so for the time being, I only upload models that might b
|
|
36 |
|
37 |
## Download a file (not the whole branch) from below:
|
38 |
|
39 |
-
|
40 |
-
|
41 |
-
|
|
42 |
-
|
|
43 |
-
| [
|
44 |
-
| [
|
45 |
-
| [
|
46 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
47 |
|
48 |
## How to check i8mm support for Apple devices
|
49 |
|
@@ -74,10 +82,39 @@ python3 convert_hf_to_gguf.py Llama-3_1-Nemotron 51B-Instruct/ --outfile Llama-3
|
|
74 |
```
|
75 |
|
76 |
## Convert f16 gguf to Q4_0 gguf without imatrix
|
|
|
77 |
Make sure you have llama.cpp compiled:
|
78 |
```
|
79 |
./llama-quantize Llama-3_1-Nemotron 51B-Instruct.f16.gguf Llama-3_1-Nemotron 51B-Instruct.Q4_0.gguf q4_0
|
80 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
81 |
|
82 |
## Downloading using huggingface-cli
|
83 |
|
|
|
27 |
### Assistant:
|
28 |
|
29 |
```
|
30 |
+
***Important*** for people who wants to do their own quantitization. There is a typo in tokenizer_config.json of the original model that mistakenly set eos_token to '<|eot_id|>' when it should be '<|end_of_text|>'. Please fix it or overwrite with the [tokenizer_config.json](https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/tokenizer_config.json) in this repository before you do the gguf conversion yourself.
|
31 |
|
32 |
+
Starting from [b4380](https://github.com/ggerganov/llama.cpp/archive/refs/tags/b4380.tar.gz) of llama.cpp, DeciLMForCausalLM's variable Grouped Query Attention is now supported.. Please download it and compile it to run the GGUFs in this repository.
|
33 |
|
34 |
This modification should support Llama-3_1-Nemotron 51B-Instruct fully. However, it may not support future DeciLMForCausalLM models that has no_op or linear ffn layers. Well, I suppose these support can be added when there are actually models using that types of layers.
|
35 |
|
|
|
37 |
|
38 |
## Download a file (not the whole branch) from below:
|
39 |
|
40 |
+
Perplexity for f16 gguf is 6.646565 ± 0.040986.
|
41 |
+
|
42 |
+
| Quant Type | imatrix | File Size | Delta Perplexity | KL Divergence | Description |
|
43 |
+
| ---------- | ------- | ----------| ---------------- | ------------- | ----------- |
|
44 |
+
| [Q6_K](https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.imatrix.Q6_K.gguf) | [calibration_datav3](https://gist.githubusercontent.com/bartowski1182/eb213dccb3571f863da82e99418f81e8/raw/b2869d80f5c16fd7082594248e80144677736635/calibration_datav3.txt) | 42.26GB | -0.002436 ± 0.001565 | 0.003332 ± 0.000014 | Good for Nvidia cards or Apple Silicon with 48GB RAM. Should perform very close to the original |
|
45 |
+
| [Q5_K_M](https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.imatrix.Q5_K_M.gguf) | [calibration_datav3](https://gist.githubusercontent.com/bartowski1182/eb213dccb3571f863da82e99418f81e8/raw/b2869d80f5c16fd7082594248e80144677736635/calibration_datav3.txt) | 36.47GB | 0.020310 ± 0.002052 | 0.005642 ± 0.000024 | Good for A100 40GB or dual 3090. Better than Q4_K_M but larger and slower. |
|
46 |
+
| [Q4_K_M](https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.imatrix.Q4_K_M.gguf) | [calibration_datav3](https://gist.githubusercontent.com/bartowski1182/eb213dccb3571f863da82e99418f81e8/raw/b2869d80f5c16fd7082594248e80144677736635/calibration_datav3.txt) | 31.04GB | 0.055444 ± 0.002982 | 0.012021 ± 0.000052 | Good for A100 40GB or dual 3090. Higher cost performance ratio than Q5_K_M. |
|
47 |
+
| IQ4_NL | calibration_datav3 | 29.30GB | 0.088279 ± 0.003944 | 0.020314 ± 0.000093 | For 32GB cards, e.g. 5090. Performance gain does justify its use over IQ4_XS |
|
48 |
+
| [IQ4_XS](https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.imatrix.IQ4_XS.gguf) | [calibration_datav3](https://gist.githubusercontent.com/bartowski1182/eb213dccb3571f863da82e99418f81e8/raw/b2869d80f5c16fd7082594248e80144677736635/calibration_datav3.txt) | 27.74GB | 0.095486 ± 0.004039 | 0.020962 ± 0.000097 | For 32GB cards, e.g. 5090. Too slow for CPU and Apple. Recommended. |
|
49 |
+
| Q4_0 | calibration_datav3 | 29.34GB | 0.543042 ± 0.009290 | 0.077602 ± 0.000389 | For 32GB cards, e.g. 5090. Too slow for CPU and Apple. |
|
50 |
+
| [Q4_0_4_8](https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.imatrix.Q4_0_4_8.gguf) | [calibration_datav3](https://gist.githubusercontent.com/bartowski1182/eb213dccb3571f863da82e99418f81e8/raw/b2869d80f5c16fd7082594248e80144677736635/calibration_datav3.txt) | 29.25GB | Same as Q4_0 assumed | Same as Q4_0 assumed | For Apple Silicon |
|
51 |
+
| [IQ3_M](https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.imatrix.IQ3_M.gguf) | [calibration_datav3](https://gist.githubusercontent.com/bartowski1182/eb213dccb3571f863da82e99418f81e8/raw/b2869d80f5c16fd7082594248e80144677736635/calibration_datav3.txt) | 23.5GB | 0.313812 ± 0.006299 | 0.054266 ± 0.000205 | Largest model that can fit a single 3090 at 4k context. Not recommeneded for CPU or Apple Silicon due to high computational cost. |
|
52 |
+
| [IQ3_S](https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.imatrix.IQ3_S.gguf) | [calibration_datav3](https://gist.githubusercontent.com/bartowski1182/eb213dccb3571f863da82e99418f81e8/raw/b2869d80f5c16fd7082594248e80144677736635/calibration_datav3.txt) | 22.7GB | 0.434774 ± 0.007162 | 0.069264 ± 0.000242 | Largest model that can fit a single 3090 at 8k context. Not recommended for CPU or Apple Silicon due to high computational cost. |
|
53 |
+
| Q3_K_S | calibration_datav3 | 22.7GB | 0.698971 ± 0.010387 | 0.089605 ± 0.000443 | Largest model that can fit a single 3090 that performs well in all platforms |
|
54 |
+
| Q3_K_S | none | 22.7GB | 2.224537 ± 0.024868 | 0.283028 ± 0.001220 | Largest model that can fit a single 3090 without imatrix |
|
55 |
|
56 |
## How to check i8mm support for Apple devices
|
57 |
|
|
|
82 |
```
|
83 |
|
84 |
## Convert f16 gguf to Q4_0 gguf without imatrix
|
85 |
+
|
86 |
Make sure you have llama.cpp compiled:
|
87 |
```
|
88 |
./llama-quantize Llama-3_1-Nemotron 51B-Instruct.f16.gguf Llama-3_1-Nemotron 51B-Instruct.Q4_0.gguf q4_0
|
89 |
```
|
90 |
+
## Convert f16 gguf to Q4_0 gguf with imatrix
|
91 |
+
|
92 |
+
Make sure you have llama.cpp compiled. Then create an imatrix with a dataset.
|
93 |
+
```
|
94 |
+
./llama-imatrix -m Llama-3_1-Nemotron-51B-Instruct.f16.gguf -f calibration_datav3.txt -o Llama-3_1-Nemotron-51B-Instruct.imatrix --chunks 32
|
95 |
+
```
|
96 |
+
|
97 |
+
Then convert with the created imatrix.
|
98 |
+
```
|
99 |
+
./llama-quantize Llama-3_1-Nemotron-51B-Instruct.f16.gguf --imatrix Llama-3_1-Nemotron-51B-Instruct.imatrix Llama-3_1-Nemotron-51B-Instruct.imatrix.Q4_0.gguf q4_0
|
100 |
+
```
|
101 |
+
|
102 |
+
## Calculate perplexity and KL divergence
|
103 |
+
|
104 |
+
First, download wikitext.
|
105 |
+
```
|
106 |
+
bash ./scripts/get-wikitext-2.sh
|
107 |
+
```
|
108 |
+
|
109 |
+
Second, find the base values of F16 gguf. Please be warned that the generated base value file is about 10GB. Adjust GPU layers depending on your VRAM.
|
110 |
+
```
|
111 |
+
./llama-perplexity --kl-divergence-base Llama-3_1-Nemotron-51B-Instruct.f16.kld -m Llama-3_1-Nemotron-51B-Instruct.f16.gguf -f wikitext-2-raw/wiki.test.raw -ngl 100
|
112 |
+
```
|
113 |
+
|
114 |
+
Finally, calculate the perplexity and KL divergence of Q4_0 gguf. Adjust GPU layers depending on your VRAM.
|
115 |
+
```
|
116 |
+
./llama-perplexity --kl-divergence-base Llama-3_1-Nemotron-51B-Instruct.f16.kld --kl_divergence -m Llama-3_1-Nemotron-51B-Instruct.Q4_0.gguf -ngl 100 >& Llama-3_1-Nemotron-51B-Instruct.Q4_0.kld
|
117 |
+
```
|
118 |
|
119 |
## Downloading using huggingface-cli
|
120 |
|