Upload README.md
Browse files
README.md
CHANGED
@@ -1,6 +1,104 @@
|
|
1 |
-
---
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
base_model: nvidia/Llama-3_1-Nemotron-51B-Instruct-GGUF
|
3 |
+
library_name: transformers
|
4 |
+
language:
|
5 |
+
- en
|
6 |
+
tags:
|
7 |
+
- nvidia
|
8 |
+
- llama-3
|
9 |
+
- pytorch
|
10 |
+
license: other
|
11 |
+
license_name: nvidia-open-model-license
|
12 |
+
license_link: >-
|
13 |
+
https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf
|
14 |
+
pipeline_tag: text-generation
|
15 |
+
quantized_by: ymcki
|
16 |
+
---
|
17 |
+
|
18 |
+
Original model: https://huggingface.co/nvidia/Llama-3_1-Nemotron-51B-Instruct-GGUF
|
19 |
+
|
20 |
+
## Prompt Template
|
21 |
+
|
22 |
+
```
|
23 |
+
### System:
|
24 |
+
{system_prompt}
|
25 |
+
### User:
|
26 |
+
{user_prompt}
|
27 |
+
### Assistant:
|
28 |
+
|
29 |
+
```
|
30 |
+
|
31 |
+
[Modified llama.cpp](https://github.com/ymcki/llama.cpp-b4139) to support DeciLMForCausalLM's variable Grouped Query Attention. Please download it and compile it to run the GGUFs in this repository. I am in the process of talking to llama.cpp people to see if they can merge my code to their codebase.
|
32 |
+
|
33 |
+
This modification should support Llama-3_1-Nemotron 51B-Instruct fully. However, it may not support future DeciLMForCausalLM models that has no_op or linear ffn layers. Well, I suppose these support can be added when there are actually models using that types of layers.
|
34 |
+
|
35 |
+
Since I am a free user, so for the time being, I only upload models that might be of interest for most people.
|
36 |
+
|
37 |
+
## Download a file (not the whole branch) from below:
|
38 |
+
|
39 |
+
| Filename | Quant type | File Size | Description |
|
40 |
+
| -------- | ---------- | --------- | ----------- |
|
41 |
+
| [Llama-3_1-Nemotron 51B-Instruct.Q4_K_M.gguf](https://huggingface.co/ymcki/Llama-3_1-Nemotron 51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron 51B-Instruct.Q4_K_M.gguf) | Q4_K_M | 31GB | Good for A100 40GB or dual 3090 |
|
42 |
+
| [Llama-3_1-Nemotron 51B-Instruct.Q4_0.gguf](https://huggingface.co/ymcki/Llama-3_1-Nemotron 51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron 51B-Instruct.Q4_0.gguf) | Q4_0 | 29.3GB | For 32GB cards, e.g. 5090. |
|
43 |
+
| [Llama-3_1-Nemotron 51B-Instruct.Q4_0_4_8.gguf](https://huggingface.co/ymcki/Llama-3_1-Nemotron 51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron 51B-Instruct.Q4_0_4_8.gguf) | Q4_0_4_8 | 29.3GB | For Apple Silicon |
|
44 |
+
| [Llama-3_1-Nemotron 51B-Instruct.Q3_K_S.gguf](https://huggingface.co/ymcki/Llama-3_1-Nemotron 51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron 51B-Instruct.Q3_K_S.gguf) | Q3_K_S | 22.7GB | Largest model that can fit a single 3090 |
|
45 |
+
|
46 |
+
## How to check i8mm support for Apple devices
|
47 |
+
|
48 |
+
ARM i8mm support is necessary to take advantage of Q4_0_4_8 gguf. All ARM architecture >= ARMv8.6-A supports i8mm. That means Apple Silicon from A15 and M2 works best with Q4_0_4_8.
|
49 |
+
|
50 |
+
For Apple devices,
|
51 |
+
|
52 |
+
```
|
53 |
+
sysctl hw
|
54 |
+
```
|
55 |
+
|
56 |
+
On the other hand, Nvidia 3090 inference speed is significantly faster for Q4_0 than the other ggufs. That means for GPU inference, you better off using Q4_0.
|
57 |
+
|
58 |
+
## Which Q4_0 model to use for Apple devices
|
59 |
+
| Brand | Series | Model | i8mm | sve | Quant Type |
|
60 |
+
| ----- | ------ | ----- | ---- | --- | -----------|
|
61 |
+
| Apple | A | A4 to A14 | No | No | Q4_0_4_4 |
|
62 |
+
| Apple | A | A15 to A18 | Yes | No | Q4_0_4_8 |
|
63 |
+
| Apple | M | M1 | No | No | Q4_0_4_4 |
|
64 |
+
| Apple | M | M2/M3/M4 | Yes | No | Q4_0_4_8 |
|
65 |
+
|
66 |
+
## Convert safetensors to f16 gguf
|
67 |
+
|
68 |
+
Make sure you have llama.cpp git cloned:
|
69 |
+
|
70 |
+
```
|
71 |
+
python3 convert_hf_to_gguf.py Llama-3_1-Nemotron 51B-Instruct/ --outfile Llama-3_1-Nemotron 51B-Instruct.f16.gguf --outtype f16
|
72 |
+
```
|
73 |
+
|
74 |
+
## Convert f16 gguf to Q4_0 gguf without imatrix
|
75 |
+
Make sure you have llama.cpp compiled:
|
76 |
+
```
|
77 |
+
./llama-quantize Llama-3_1-Nemotron 51B-Instruct.f16.gguf Llama-3_1-Nemotron 51B-Instruct.Q4_0.gguf q4_0
|
78 |
+
```
|
79 |
+
|
80 |
+
## Downloading using huggingface-cli
|
81 |
+
|
82 |
+
First, make sure you have hugginface-cli installed:
|
83 |
+
|
84 |
+
```
|
85 |
+
pip install -U "huggingface_hub[cli]"
|
86 |
+
```
|
87 |
+
|
88 |
+
Then, you can target the specific file you want:
|
89 |
+
|
90 |
+
```
|
91 |
+
huggingface-cli download ymcki/Llama-3_1-Nemotron 51B-Instruct-GGUF --include "Llama-3_1-Nemotron 51B-Instruct.Q4_0.gguf" --local-dir ./
|
92 |
+
```
|
93 |
+
|
94 |
+
## Running the model using llama-cli
|
95 |
+
|
96 |
+
First, download and compile my [Modified llama.cpp-b4139](https://github.com/ymcki/llama.cpp-b4139) v0.2. Compile it, then run
|
97 |
+
```
|
98 |
+
./llama-cli -m ~/Llama-3_1-Nemotron-51B-Instruct.Q3_K_S.gguf -p 'You are a European History Professor named Professor Whitman.' -cnv -ngl 100
|
99 |
+
```
|
100 |
+
|
101 |
+
## Credits
|
102 |
+
|
103 |
+
Thank you bartowski for providing a README.md to get me started.
|
104 |
+
|