|
--- |
|
base_model: nvidia/Llama-3_1-Nemotron-51B-Instruct-GGUF |
|
library_name: transformers |
|
language: |
|
- en |
|
tags: |
|
- nvidia |
|
- llama-3 |
|
- pytorch |
|
license: other |
|
license_name: nvidia-open-model-license |
|
license_link: >- |
|
https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf |
|
pipeline_tag: text-generation |
|
quantized_by: ymcki |
|
--- |
|
|
|
Original model: https://huggingface.co/nvidia/Llama-3_1-Nemotron-51B-Instruct-GGUF |
|
|
|
## Prompt Template |
|
|
|
``` |
|
### System: |
|
{system_prompt} |
|
### User: |
|
{user_prompt} |
|
### Assistant: |
|
|
|
``` |
|
|
|
[Modified llama.cpp](https://github.com/ymcki/llama.cpp-b4139) to support DeciLMForCausalLM's variable Grouped Query Attention. Please download it and compile it to run the GGUFs in this repository. I am in the process of talking to llama.cpp people to see if they can merge my code to their codebase. |
|
|
|
This modification should support Llama-3_1-Nemotron 51B-Instruct fully. However, it may not support future DeciLMForCausalLM models that has no_op or linear ffn layers. Well, I suppose these support can be added when there are actually models using that types of layers. |
|
|
|
Since I am a free user, so for the time being, I only upload models that might be of interest for most people. |
|
|
|
## Download a file (not the whole branch) from below: |
|
|
|
| Filename | Quant type | File Size | Description | |
|
| -------- | ---------- | --------- | ----------- | |
|
| [Llama-3_1-Nemotron-51B-Instruct.Q4_K_M.gguf](https://huggingface.co/ymcki/Llama-3_1-Nemotron 51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.Q4_K_M.gguf) | Q4_K_M | 31GB | Good for A100 40GB or dual 3090 | |
|
| [Llama-3_1-Nemotron-51B-Instruct.Q4_0.gguf](https://huggingface.co/ymcki/Llama-3_1-Nemotron 51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.Q4_0.gguf) | Q4_0 | 29.3GB | For 32GB cards, e.g. 5090. | |
|
| [Llama-3_1-Nemotron-51B-Instruct.Q4_0_4_8.gguf](https://huggingface.co/ymcki/Llama-3_1-Nemotron 51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.Q4_0_4_8.gguf) | Q4_0_4_8 | 29.3GB | For Apple Silicon | |
|
| [Llama-3_1-Nemotron-51B-Instruct.Q3_K_S.gguf](https://huggingface.co/ymcki/Llama-3_1-Nemotron 51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.Q3_K_S.gguf) | Q3_K_S | 22.7GB | Largest model that can fit a single 3090 | |
|
|
|
## How to check i8mm support for Apple devices |
|
|
|
ARM i8mm support is necessary to take advantage of Q4_0_4_8 gguf. All ARM architecture >= ARMv8.6-A supports i8mm. That means Apple Silicon from A15 and M2 works best with Q4_0_4_8. |
|
|
|
For Apple devices, |
|
|
|
``` |
|
sysctl hw |
|
``` |
|
|
|
On the other hand, Nvidia 3090 inference speed is significantly faster for Q4_0 than the other ggufs. That means for GPU inference, you better off using Q4_0. |
|
|
|
## Which Q4_0 model to use for Apple devices |
|
| Brand | Series | Model | i8mm | sve | Quant Type | |
|
| ----- | ------ | ----- | ---- | --- | -----------| |
|
| Apple | A | A4 to A14 | No | No | Q4_0_4_4 | |
|
| Apple | A | A15 to A18 | Yes | No | Q4_0_4_8 | |
|
| Apple | M | M1 | No | No | Q4_0_4_4 | |
|
| Apple | M | M2/M3/M4 | Yes | No | Q4_0_4_8 | |
|
|
|
## Convert safetensors to f16 gguf |
|
|
|
Make sure you have llama.cpp git cloned: |
|
|
|
``` |
|
python3 convert_hf_to_gguf.py Llama-3_1-Nemotron 51B-Instruct/ --outfile Llama-3_1-Nemotron 51B-Instruct.f16.gguf --outtype f16 |
|
``` |
|
|
|
## Convert f16 gguf to Q4_0 gguf without imatrix |
|
Make sure you have llama.cpp compiled: |
|
``` |
|
./llama-quantize Llama-3_1-Nemotron 51B-Instruct.f16.gguf Llama-3_1-Nemotron 51B-Instruct.Q4_0.gguf q4_0 |
|
``` |
|
|
|
## Downloading using huggingface-cli |
|
|
|
First, make sure you have hugginface-cli installed: |
|
|
|
``` |
|
pip install -U "huggingface_hub[cli]" |
|
``` |
|
|
|
Then, you can target the specific file you want: |
|
|
|
``` |
|
huggingface-cli download ymcki/Llama-3_1-Nemotron 51B-Instruct-GGUF --include "Llama-3_1-Nemotron 51B-Instruct.Q4_0.gguf" --local-dir ./ |
|
``` |
|
|
|
## Running the model using llama-cli |
|
|
|
First, download and compile my [Modified llama.cpp-b4139](https://github.com/ymcki/llama.cpp-b4139) v0.2. Compile it, then run |
|
``` |
|
./llama-cli -m ~/Llama-3_1-Nemotron-51B-Instruct.Q3_K_S.gguf -p 'You are a European History Professor named Professor Whitman.' -cnv -ngl 100 |
|
``` |
|
|
|
## Credits |
|
|
|
Thank you bartowski for providing a README.md to get me started. |
|
|
|
|