Upload README.md

ae57df3 verified 3 days ago

8.87 kB

	---
	base_model: nvidia/Llama-3_1-Nemotron-51B-Instruct-GGUF
	library_name: transformers
	language:
	- en
	tags:
	- nvidia
	- llama-3
	- pytorch
	license: other
	license_name: nvidia-open-model-license
	license_link: >-
	https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf
	pipeline_tag: text-generation
	quantized_by: ymcki
	---

	Original model: https://huggingface.co/nvidia/Llama-3_1-Nemotron-51B-Instruct-GGUF

	## Prompt Template

	```
	### System:
	{system_prompt}
	### User:
	{user_prompt}
	### Assistant:

	```
	*Important* for people who wants to do their own quantitization. There is a typo in tokenizer_config.json of the original model that mistakenly set eos_token to '<\|eot_id\|>' when it should be '<\|end_of_text\|>'. Please fix it or overwrite with the [tokenizer_config.json](https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/tokenizer_config.json) in this repository before you do the gguf conversion yourself.

	Starting from [b4380](https://github.com/ggerganov/llama.cpp/archive/refs/tags/b4380.tar.gz) of llama.cpp, DeciLMForCausalLM's variable Grouped Query Attention is now supported.. Please download it and compile it to run the GGUFs in this repository.

	This modification should support Llama-3_1-Nemotron 51B-Instruct fully. However, it may not support future DeciLMForCausalLM models that has no_op or linear ffn layers. Well, I suppose these support can be added when there are actually models using that types of layers.

	Since I am a free user, so for the time being, I only upload models that might be of interest for most people.

	## Download a file (not the whole branch) from below:

	Perplexity for f16 gguf is 6.646565 ± 0.040986.

	\| Quant Type \| imatrix \| File Size \| Delta Perplexity \| KL Divergence \| Description \|
	\| ---------- \| ------- \| ----------\| ---------------- \| ------------- \| ----------- \|
	\| [Q6_K](https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.imatrix.Q6_K.gguf) \| [calibration_datav3](https://gist.githubusercontent.com/bartowski1182/eb213dccb3571f863da82e99418f81e8/raw/b2869d80f5c16fd7082594248e80144677736635/calibration_datav3.txt) \| 42.26GB \| -0.002436 ± 0.001565 \| 0.003332 ± 0.000014 \| Good for Nvidia cards or Apple Silicon with 48GB RAM. Should perform very close to the original \|
	\| [Q5_K_M](https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.imatrix.Q5_K_M.gguf) \| [calibration_datav3](https://gist.githubusercontent.com/bartowski1182/eb213dccb3571f863da82e99418f81e8/raw/b2869d80f5c16fd7082594248e80144677736635/calibration_datav3.txt) \| 36.47GB \| 0.020310 ± 0.002052 \| 0.005642 ± 0.000024 \| Good for A100 40GB or dual 3090. Better than Q4_K_M but larger and slower. \|
	\| [Q4_K_M](https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.imatrix.Q4_K_M.gguf) \| [calibration_datav3](https://gist.githubusercontent.com/bartowski1182/eb213dccb3571f863da82e99418f81e8/raw/b2869d80f5c16fd7082594248e80144677736635/calibration_datav3.txt) \| 31.04GB \| 0.055444 ± 0.002982 \| 0.012021 ± 0.000052 \| Good for A100 40GB or dual 3090. Higher cost performance ratio than Q5_K_M. \|
	\| IQ4_NL \| calibration_datav3 \| 29.30GB \| 0.088279 ± 0.003944 \| 0.020314 ± 0.000093 \| For 32GB cards, e.g. 5090. Minor performance gain doesn't justify its use over IQ4_XS \|
	\| [IQ4_XS](https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.imatrix.IQ4_XS.gguf) \| [calibration_datav3](https://gist.githubusercontent.com/bartowski1182/eb213dccb3571f863da82e99418f81e8/raw/b2869d80f5c16fd7082594248e80144677736635/calibration_datav3.txt) \| 27.74GB \| 0.095486 ± 0.004039 \| 0.020962 ± 0.000097 \| For 32GB cards, e.g. 5090. Too slow for CPU and Apple. Recommended. \|
	\| Q4_0 \| calibration_datav3 \| 29.34GB \| 0.543042 ± 0.009290 \| 0.077602 ± 0.000389 \| For 32GB cards, e.g. 5090. Too slow for CPU and Apple. \|
	\| [Q4_0_4_8](https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.imatrix.Q4_0_4_8.gguf) \| [calibration_datav3](https://gist.githubusercontent.com/bartowski1182/eb213dccb3571f863da82e99418f81e8/raw/b2869d80f5c16fd7082594248e80144677736635/calibration_datav3.txt) \| 29.25GB \| Same as Q4_0 assumed \| Same as Q4_0 assumed \| For Apple Silicon \|
	\| [IQ3_M](https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.imatrix.IQ3_M.gguf) \| [calibration_datav3](https://gist.githubusercontent.com/bartowski1182/eb213dccb3571f863da82e99418f81e8/raw/b2869d80f5c16fd7082594248e80144677736635/calibration_datav3.txt) \| 23.5GB \| 0.313812 ± 0.006299 \| 0.054266 ± 0.000205 \| Largest model that can fit a single 3090 at 4k context. Not recommeneded for CPU or Apple Silicon due to high computational cost. \|
	\| [IQ3_S](https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.imatrix.IQ3_S.gguf) \| [calibration_datav3](https://gist.githubusercontent.com/bartowski1182/eb213dccb3571f863da82e99418f81e8/raw/b2869d80f5c16fd7082594248e80144677736635/calibration_datav3.txt) \| 22.7GB \| 0.434774 ± 0.007162 \| 0.069264 ± 0.000242 \| Largest model that can fit a single 3090 at 8k context. Not recommended for CPU or Apple Silicon due to high computational cost. \|
	\| Q3_K_S \| calibration_datav3 \| 22.7GB \| 0.698971 ± 0.010387 \| 0.089605 ± 0.000443 \| Largest model that can fit a single 3090 that performs well in all platforms \|
	\| Q3_K_S \| none \| 22.7GB \| 2.224537 ± 0.024868 \| 0.283028 ± 0.001220 \| Largest model that can fit a single 3090 without imatrix \|

	## How to check i8mm support for Apple devices

	ARM i8mm support is necessary to take advantage of Q4_0_4_8 gguf. All ARM architecture >= ARMv8.6-A supports i8mm. That means Apple Silicon from A15 and M2 works best with Q4_0_4_8.

	For Apple devices,

	```
	sysctl hw
	```

	On the other hand, Nvidia 3090 inference speed is significantly faster for Q4_0 than the other ggufs. That means for GPU inference, you better off using Q4_0.

	## Which Q4_0 model to use for Apple devices
	\| Brand \| Series \| Model \| i8mm \| sve \| Quant Type \|
	\| ----- \| ------ \| ----- \| ---- \| --- \| -----------\|
	\| Apple \| A \| A4 to A14 \| No \| No \| Q4_0_4_4 \|
	\| Apple \| A \| A15 to A18 \| Yes \| No \| Q4_0_4_8 \|
	\| Apple \| M \| M1 \| No \| No \| Q4_0_4_4 \|
	\| Apple \| M \| M2/M3/M4 \| Yes \| No \| Q4_0_4_8 \|

	## Convert safetensors to f16 gguf

	Make sure you have llama.cpp git cloned:

	```
	python3 convert_hf_to_gguf.py Llama-3_1-Nemotron 51B-Instruct/ --outfile Llama-3_1-Nemotron 51B-Instruct.f16.gguf --outtype f16
	```

	## Convert f16 gguf to Q4_0 gguf without imatrix

	Make sure you have llama.cpp compiled:
	```
	./llama-quantize Llama-3_1-Nemotron 51B-Instruct.f16.gguf Llama-3_1-Nemotron 51B-Instruct.Q4_0.gguf q4_0
	```
	## Convert f16 gguf to Q4_0 gguf with imatrix

	Make sure you have llama.cpp compiled. Then create an imatrix with a dataset.
	```
	./llama-imatrix -m Llama-3_1-Nemotron-51B-Instruct.f16.gguf -f calibration_datav3.txt -o Llama-3_1-Nemotron-51B-Instruct.imatrix --chunks 32
	```

	Then convert with the created imatrix.
	```
	./llama-quantize Llama-3_1-Nemotron-51B-Instruct.f16.gguf --imatrix Llama-3_1-Nemotron-51B-Instruct.imatrix Llama-3_1-Nemotron-51B-Instruct.imatrix.Q4_0.gguf q4_0
	```

	## Calculate perplexity and KL divergence

	First, download wikitext.
	```
	bash ./scripts/get-wikitext-2.sh
	```

	Second, find the base values of F16 gguf. Please be warned that the generated base value file is about 10GB. Adjust GPU layers depending on your VRAM.
	```
	./llama-perplexity --kl-divergence-base Llama-3_1-Nemotron-51B-Instruct.f16.kld -m Llama-3_1-Nemotron-51B-Instruct.f16.gguf -f wikitext-2-raw/wiki.test.raw -ngl 100
	```

	Finally, calculate the perplexity and KL divergence of Q4_0 gguf. Adjust GPU layers depending on your VRAM.
	```
	./llama-perplexity --kl-divergence-base Llama-3_1-Nemotron-51B-Instruct.f16.kld --kl_divergence -m Llama-3_1-Nemotron-51B-Instruct.Q4_0.gguf -ngl 100 >& Llama-3_1-Nemotron-51B-Instruct.Q4_0.kld
	```

	## Downloading using huggingface-cli

	First, make sure you have hugginface-cli installed:

	```
	pip install -U "huggingface_hub[cli]"
	```

	Then, you can target the specific file you want:

	```
	huggingface-cli download ymcki/Llama-3_1-Nemotron 51B-Instruct-GGUF --include "Llama-3_1-Nemotron 51B-Instruct.Q4_0.gguf" --local-dir ./
	```

	## Running the model using llama-cli

	First, download and compile my [Modified llama.cpp-b4139](https://github.com/ymcki/llama.cpp-b4139) v0.2. Compile it, then run
	```
	./llama-cli -m ~/Llama-3_1-Nemotron-51B-Instruct.Q3_K_S.gguf -p 'You are a European History Professor named Professor Whitman.' -cnv -ngl 100
	```

	## Credits

	Thank you bartowski for providing a README.md to get me started.