ymcki's picture
Upload README.md
e402c9b verified
|
raw
history blame
4.47 kB
metadata
base_model: nvidia/Llama-3_1-Nemotron-51B-Instruct-GGUF
library_name: transformers
language:
  - en
tags:
  - nvidia
  - llama-3
  - pytorch
license: other
license_name: nvidia-open-model-license
license_link: >-
  https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf
pipeline_tag: text-generation
quantized_by: ymcki

Original model: https://huggingface.co/nvidia/Llama-3_1-Nemotron-51B-Instruct-GGUF

Prompt Template

### System:
{system_prompt}
### User:
{user_prompt}
### Assistant:

Modified llama.cpp to support DeciLMForCausalLM's variable Grouped Query Attention. Please download it and compile it to run the GGUFs in this repository. I am in the process of talking to llama.cpp people to see if they can merge my code to their codebase.

This modification should support Llama-3_1-Nemotron 51B-Instruct fully. However, it may not support future DeciLMForCausalLM models that has no_op or linear ffn layers. Well, I suppose these support can be added when there are actually models using that types of layers.

Since I am a free user, so for the time being, I only upload models that might be of interest for most people.

Download a file (not the whole branch) from below:

Filename Quant type File Size Description
Llama-3_1-Nemotron-51B-Instruct.Q6_K.gguf Q6_K 42.2GB Good for Nvidia cards or Apple Silicon with 48GB RAM. Should perform very close to the original
Llama-3_1-Nemotron-51B-Instruct.Q4_K_M.gguf Q4_K_M 31GB Good for A100 40GB or dual 3090
Llama-3_1-Nemotron-51B-Instruct.Q4_0.gguf Q4_0 29.3GB For 32GB cards, e.g. 5090.
Llama-3_1-Nemotron-51B-Instruct.Q4_0_4_8.gguf Q4_0_4_8 29.3GB For Apple Silicon
Llama-3_1-Nemotron-51B-Instruct.Q3_K_S.gguf Q3_K_S 22.7GB Largest model that can fit a single 3090

How to check i8mm support for Apple devices

ARM i8mm support is necessary to take advantage of Q4_0_4_8 gguf. All ARM architecture >= ARMv8.6-A supports i8mm. That means Apple Silicon from A15 and M2 works best with Q4_0_4_8.

For Apple devices,

sysctl hw

On the other hand, Nvidia 3090 inference speed is significantly faster for Q4_0 than the other ggufs. That means for GPU inference, you better off using Q4_0.

Which Q4_0 model to use for Apple devices

Brand Series Model i8mm sve Quant Type
Apple A A4 to A14 No No Q4_0_4_4
Apple A A15 to A18 Yes No Q4_0_4_8
Apple M M1 No No Q4_0_4_4
Apple M M2/M3/M4 Yes No Q4_0_4_8

Convert safetensors to f16 gguf

Make sure you have llama.cpp git cloned:

python3 convert_hf_to_gguf.py Llama-3_1-Nemotron 51B-Instruct/ --outfile Llama-3_1-Nemotron 51B-Instruct.f16.gguf --outtype f16

Convert f16 gguf to Q4_0 gguf without imatrix

Make sure you have llama.cpp compiled:

./llama-quantize Llama-3_1-Nemotron 51B-Instruct.f16.gguf Llama-3_1-Nemotron 51B-Instruct.Q4_0.gguf q4_0

Downloading using huggingface-cli

First, make sure you have hugginface-cli installed:

pip install -U "huggingface_hub[cli]"

Then, you can target the specific file you want:

huggingface-cli download ymcki/Llama-3_1-Nemotron 51B-Instruct-GGUF --include "Llama-3_1-Nemotron 51B-Instruct.Q4_0.gguf" --local-dir ./

Running the model using llama-cli

First, download and compile my Modified llama.cpp-b4139 v0.2. Compile it, then run

./llama-cli -m ~/Llama-3_1-Nemotron-51B-Instruct.Q3_K_S.gguf -p 'You are a European History Professor named Professor Whitman.'  -cnv -ngl 100

Credits

Thank you bartowski for providing a README.md to get me started.