README.md · ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF at ddf5a42be17eeee18bdd066b7ea6bdd4e0b00ba3

metadata

base_model: nvidia/Llama-3_1-Nemotron-51B-Instruct-GGUF
library_name: transformers
language:
  - en
tags:
  - nvidia
  - llama-3
  - pytorch
license: other
license_name: nvidia-open-model-license
license_link: >-
  https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf
pipeline_tag: text-generation
quantized_by: ymcki

Original model: https://huggingface.co/nvidia/Llama-3_1-Nemotron-51B-Instruct-GGUF

Prompt Template

### System:
{system_prompt}
### User:
{user_prompt}
### Assistant:

Modified llama.cpp to support DeciLMForCausalLM's variable Grouped Query Attention. Please download it and compile it to run the GGUFs in this repository. I am in the process of talking to llama.cpp people to see if they can merge my code to their codebase.

This modification should support Llama-3_1-Nemotron 51B-Instruct fully. However, it may not support future DeciLMForCausalLM models that has no_op or linear ffn layers. Well, I suppose these support can be added when there are actually models using that types of layers.

Since I am a free user, so for the time being, I only upload models that might be of interest for most people.

Download a file (not the whole branch) from below:

Filename	Quant type	File Size	Description
Llama-3_1-Nemotron-51B-Instruct.Q6_K.gguf	Q6_K	42.2GB	Good for Nvidia cards or Apple Silicon with 48GB RAM. Should perform very close to the original
Llama-3_1-Nemotron-51B-Instruct.Q5_K_M.gguf	Q5_K_M	36.5GB	Good for A100 40GB or dual 3090. Better than Q4_K_M but larger and slower.
Llama-3_1-Nemotron-51B-Instruct.Q4_K_M.gguf	Q4_K_M	31GB	Good for A100 40GB or dual 3090. Higher cost performance ratio than Q5_K_M.
Llama-3_1-Nemotron-51B-Instruct.Q4_0.gguf	Q4_0	29.3GB	For 32GB cards, e.g. 5090.
Llama-3_1-Nemotron-51B-Instruct.Q4_0_4_8.gguf	Q4_0_4_8	29.3GB	For Apple Silicon
Llama-3_1-Nemotron-51B-Instruct.Q3_K_S.gguf	Q3_K_S	22.7GB	Largest model that can fit a single 3090

How to check i8mm support for Apple devices

ARM i8mm support is necessary to take advantage of Q4_0_4_8 gguf. All ARM architecture >= ARMv8.6-A supports i8mm. That means Apple Silicon from A15 and M2 works best with Q4_0_4_8.

For Apple devices,

sysctl hw

On the other hand, Nvidia 3090 inference speed is significantly faster for Q4_0 than the other ggufs. That means for GPU inference, you better off using Q4_0.

Which Q4_0 model to use for Apple devices

Brand	Series	Model	i8mm	sve	Quant Type
Apple	A	A4 to A14	No	No	Q4_0_4_4
Apple	A	A15 to A18	Yes	No	Q4_0_4_8
Apple	M	M1	No	No	Q4_0_4_4
Apple	M	M2/M3/M4	Yes	No	Q4_0_4_8

Convert safetensors to f16 gguf

Make sure you have llama.cpp git cloned:

python3 convert_hf_to_gguf.py Llama-3_1-Nemotron 51B-Instruct/ --outfile Llama-3_1-Nemotron 51B-Instruct.f16.gguf --outtype f16

Convert f16 gguf to Q4_0 gguf without imatrix

Make sure you have llama.cpp compiled:

./llama-quantize Llama-3_1-Nemotron 51B-Instruct.f16.gguf Llama-3_1-Nemotron 51B-Instruct.Q4_0.gguf q4_0

Downloading using huggingface-cli

First, make sure you have hugginface-cli installed:

pip install -U "huggingface_hub[cli]"

Then, you can target the specific file you want:

huggingface-cli download ymcki/Llama-3_1-Nemotron 51B-Instruct-GGUF --include "Llama-3_1-Nemotron 51B-Instruct.Q4_0.gguf" --local-dir ./

Running the model using llama-cli

First, download and compile my Modified llama.cpp-b4139 v0.2. Compile it, then run

./llama-cli -m ~/Llama-3_1-Nemotron-51B-Instruct.Q3_K_S.gguf -p 'You are a European History Professor named Professor Whitman.'  -cnv -ngl 100

Credits

Thank you bartowski for providing a README.md to get me started.