ymcki commited on
Commit
125511b
1 Parent(s): fe097c4

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +46 -9
README.md CHANGED
@@ -27,8 +27,9 @@ Original model: https://huggingface.co/nvidia/Llama-3_1-Nemotron-51B-Instruct-GG
27
  ### Assistant:
28
 
29
  ```
 
30
 
31
- [Modified llama.cpp](https://github.com/ymcki/llama.cpp-b4139) to support DeciLMForCausalLM's variable Grouped Query Attention. Please download it and compile it to run the GGUFs in this repository. I am in the process of talking to llama.cpp people to see if they can merge my code to their codebase.
32
 
33
  This modification should support Llama-3_1-Nemotron 51B-Instruct fully. However, it may not support future DeciLMForCausalLM models that has no_op or linear ffn layers. Well, I suppose these support can be added when there are actually models using that types of layers.
34
 
@@ -36,14 +37,21 @@ Since I am a free user, so for the time being, I only upload models that might b
36
 
37
  ## Download a file (not the whole branch) from below:
38
 
39
- | Filename | Quant type | File Size | Description |
40
- | -------- | ---------- | --------- | ----------- |
41
- | [Llama-3_1-Nemotron-51B-Instruct.Q6_K.gguf](https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.Q6_K.gguf) | Q6_K | 42.2GB | Good for Nvidia cards or Apple Silicon with 48GB RAM. Should perform very close to the original |
42
- | [Llama-3_1-Nemotron-51B-Instruct.Q5_K_M.gguf](https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.Q5_K_M.gguf) | Q5_K_M | 36.5GB | Good for A100 40GB or dual 3090. Better than Q4_K_M but larger and slower. |
43
- | [Llama-3_1-Nemotron-51B-Instruct.Q4_K_M.gguf](https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.Q4_K_M.gguf) | Q4_K_M | 31GB | Good for A100 40GB or dual 3090. Higher cost performance ratio than Q5_K_M. |
44
- | [Llama-3_1-Nemotron-51B-Instruct.Q4_0.gguf](https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.Q4_0.gguf) | Q4_0 | 29.3GB | For 32GB cards, e.g. 5090. |
45
- | [Llama-3_1-Nemotron-51B-Instruct.Q4_0_4_8.gguf](https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.Q4_0_4_8.gguf) | Q4_0_4_8 | 29.3GB | For Apple Silicon |
46
- | [Llama-3_1-Nemotron-51B-Instruct.Q3_K_S.gguf](https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.Q3_K_S.gguf) | Q3_K_S | 22.7GB | Largest model that can fit a single 3090 |
 
 
 
 
 
 
 
47
 
48
  ## How to check i8mm support for Apple devices
49
 
@@ -74,10 +82,39 @@ python3 convert_hf_to_gguf.py Llama-3_1-Nemotron 51B-Instruct/ --outfile Llama-3
74
  ```
75
 
76
  ## Convert f16 gguf to Q4_0 gguf without imatrix
 
77
  Make sure you have llama.cpp compiled:
78
  ```
79
  ./llama-quantize Llama-3_1-Nemotron 51B-Instruct.f16.gguf Llama-3_1-Nemotron 51B-Instruct.Q4_0.gguf q4_0
80
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
81
 
82
  ## Downloading using huggingface-cli
83
 
 
27
  ### Assistant:
28
 
29
  ```
30
+ ***Important*** for people who wants to do their own quantitization. There is a typo in tokenizer_config.json of the original model that mistakenly set eos_token to '<|eot_id|>' when it should be '<|end_of_text|>'. Please fix it or overwrite with the [tokenizer_config.json](https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/tokenizer_config.json) in this repository before you do the gguf conversion yourself.
31
 
32
+ Starting from [b4380](https://github.com/ggerganov/llama.cpp/archive/refs/tags/b4380.tar.gz) of llama.cpp, DeciLMForCausalLM's variable Grouped Query Attention is now supported.. Please download it and compile it to run the GGUFs in this repository.
33
 
34
  This modification should support Llama-3_1-Nemotron 51B-Instruct fully. However, it may not support future DeciLMForCausalLM models that has no_op or linear ffn layers. Well, I suppose these support can be added when there are actually models using that types of layers.
35
 
 
37
 
38
  ## Download a file (not the whole branch) from below:
39
 
40
+ Perplexity for f16 gguf is 6.646565 ± 0.040986.
41
+
42
+ | Quant Type | imatrix | File Size | Delta Perplexity | KL Divergence | Description |
43
+ | ---------- | ------- | ----------| ---------------- | ------------- | ----------- |
44
+ | [Q6_K](https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.imatrix.Q6_K.gguf) | [calibration_datav3](https://gist.githubusercontent.com/bartowski1182/eb213dccb3571f863da82e99418f81e8/raw/b2869d80f5c16fd7082594248e80144677736635/calibration_datav3.txt) | 42.26GB | -0.002436 ± 0.001565 | 0.003332 ± 0.000014 | Good for Nvidia cards or Apple Silicon with 48GB RAM. Should perform very close to the original |
45
+ | [Q5_K_M](https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.imatrix.Q5_K_M.gguf) | [calibration_datav3](https://gist.githubusercontent.com/bartowski1182/eb213dccb3571f863da82e99418f81e8/raw/b2869d80f5c16fd7082594248e80144677736635/calibration_datav3.txt) | 36.47GB | 0.020310 ± 0.002052 | 0.005642 ± 0.000024 | Good for A100 40GB or dual 3090. Better than Q4_K_M but larger and slower. |
46
+ | [Q4_K_M](https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.imatrix.Q4_K_M.gguf) | [calibration_datav3](https://gist.githubusercontent.com/bartowski1182/eb213dccb3571f863da82e99418f81e8/raw/b2869d80f5c16fd7082594248e80144677736635/calibration_datav3.txt) | 31.04GB | 0.055444 ± 0.002982 | 0.012021 ± 0.000052 | Good for A100 40GB or dual 3090. Higher cost performance ratio than Q5_K_M. |
47
+ | IQ4_NL | calibration_datav3 | 29.30GB | 0.088279 ± 0.003944 | 0.020314 ± 0.000093 | For 32GB cards, e.g. 5090. Performance gain does justify its use over IQ4_XS |
48
+ | [IQ4_XS](https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.imatrix.IQ4_XS.gguf) | [calibration_datav3](https://gist.githubusercontent.com/bartowski1182/eb213dccb3571f863da82e99418f81e8/raw/b2869d80f5c16fd7082594248e80144677736635/calibration_datav3.txt) | 27.74GB | 0.095486 ± 0.004039 | 0.020962 ± 0.000097 | For 32GB cards, e.g. 5090. Too slow for CPU and Apple. Recommended. |
49
+ | Q4_0 | calibration_datav3 | 29.34GB | 0.543042 ± 0.009290 | 0.077602 ± 0.000389 | For 32GB cards, e.g. 5090. Too slow for CPU and Apple. |
50
+ | [Q4_0_4_8](https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.imatrix.Q4_0_4_8.gguf) | [calibration_datav3](https://gist.githubusercontent.com/bartowski1182/eb213dccb3571f863da82e99418f81e8/raw/b2869d80f5c16fd7082594248e80144677736635/calibration_datav3.txt) | 29.25GB | Same as Q4_0 assumed | Same as Q4_0 assumed | For Apple Silicon |
51
+ | [IQ3_M](https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.imatrix.IQ3_M.gguf) | [calibration_datav3](https://gist.githubusercontent.com/bartowski1182/eb213dccb3571f863da82e99418f81e8/raw/b2869d80f5c16fd7082594248e80144677736635/calibration_datav3.txt) | 23.5GB | 0.313812 ± 0.006299 | 0.054266 ± 0.000205 | Largest model that can fit a single 3090 at 4k context. Not recommeneded for CPU or Apple Silicon due to high computational cost. |
52
+ | [IQ3_S](https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.imatrix.IQ3_S.gguf) | [calibration_datav3](https://gist.githubusercontent.com/bartowski1182/eb213dccb3571f863da82e99418f81e8/raw/b2869d80f5c16fd7082594248e80144677736635/calibration_datav3.txt) | 22.7GB | 0.434774 ± 0.007162 | 0.069264 ± 0.000242 | Largest model that can fit a single 3090 at 8k context. Not recommended for CPU or Apple Silicon due to high computational cost. |
53
+ | Q3_K_S | calibration_datav3 | 22.7GB | 0.698971 ± 0.010387 | 0.089605 ± 0.000443 | Largest model that can fit a single 3090 that performs well in all platforms |
54
+ | Q3_K_S | none | 22.7GB | 2.224537 ± 0.024868 | 0.283028 ± 0.001220 | Largest model that can fit a single 3090 without imatrix |
55
 
56
  ## How to check i8mm support for Apple devices
57
 
 
82
  ```
83
 
84
  ## Convert f16 gguf to Q4_0 gguf without imatrix
85
+
86
  Make sure you have llama.cpp compiled:
87
  ```
88
  ./llama-quantize Llama-3_1-Nemotron 51B-Instruct.f16.gguf Llama-3_1-Nemotron 51B-Instruct.Q4_0.gguf q4_0
89
  ```
90
+ ## Convert f16 gguf to Q4_0 gguf with imatrix
91
+
92
+ Make sure you have llama.cpp compiled. Then create an imatrix with a dataset.
93
+ ```
94
+ ./llama-imatrix -m Llama-3_1-Nemotron-51B-Instruct.f16.gguf -f calibration_datav3.txt -o Llama-3_1-Nemotron-51B-Instruct.imatrix --chunks 32
95
+ ```
96
+
97
+ Then convert with the created imatrix.
98
+ ```
99
+ ./llama-quantize Llama-3_1-Nemotron-51B-Instruct.f16.gguf --imatrix Llama-3_1-Nemotron-51B-Instruct.imatrix Llama-3_1-Nemotron-51B-Instruct.imatrix.Q4_0.gguf q4_0
100
+ ```
101
+
102
+ ## Calculate perplexity and KL divergence
103
+
104
+ First, download wikitext.
105
+ ```
106
+ bash ./scripts/get-wikitext-2.sh
107
+ ```
108
+
109
+ Second, find the base values of F16 gguf. Please be warned that the generated base value file is about 10GB. Adjust GPU layers depending on your VRAM.
110
+ ```
111
+ ./llama-perplexity --kl-divergence-base Llama-3_1-Nemotron-51B-Instruct.f16.kld -m Llama-3_1-Nemotron-51B-Instruct.f16.gguf -f wikitext-2-raw/wiki.test.raw -ngl 100
112
+ ```
113
+
114
+ Finally, calculate the perplexity and KL divergence of Q4_0 gguf. Adjust GPU layers depending on your VRAM.
115
+ ```
116
+ ./llama-perplexity --kl-divergence-base Llama-3_1-Nemotron-51B-Instruct.f16.kld --kl_divergence -m Llama-3_1-Nemotron-51B-Instruct.Q4_0.gguf -ngl 100 >& Llama-3_1-Nemotron-51B-Instruct.Q4_0.kld
117
+ ```
118
 
119
  ## Downloading using huggingface-cli
120