Nemotron 51B
Can you please provide a Q8 and Q6 GGUF quant for: https://huggingface.co/nvidia/Llama-3_1-Nemotron-51B-Instruct
And have you noticed any performance differences between this model and the 70B Nemotron model? I am trying to see what I can run on an M4 Macbook Pro (or Max) with 48GB or 64GB---and still have it be able to perform intelligent reasoning/etc.
The Q8 model from Bartowski runs exceptionally well and out-performs Claude3 in my opinion: https://huggingface.co/bartowski/Llama-3.1-Nemotron-70B-Instruct-HF-GGUF
However, there's no way I'm fitting that on a 64GB M4 Macbook Pro Max.
So I'm hoping to find something that I can run...and will work. If not I will just buy the M4 Macbook Pro 48GB and hope the M5 Macbook Pro Max will offer better performance when it comes out next year.
Thanks!
As already discussed in https://huggingface.co/mradermacher/model_requests/discussions/372 /Llama-3_1-Nemotron-51B-Instruct is currently not supported by llama.cpp.
Why don't you just run Llama-3.1-Nemotron-70B in i1-Q5_K_M from https://huggingface.co/mradermacher/Llama-3.1-Nemotron-70B-Instruct-HF-i1-GGUF which will fit in 64 GiB RAM and has such a minor quality difference compared to unquantized that you will not notice a difference in real world use cases. I know because I spent the past few months measuring and comparing the quality of quants and can share some plots if you want. Can M1 only run Q8 at reasonable speed or why are you so fixated on Q8? I would never use anything larger than i1-Q6 even if with 512 GiB I have easily enough RAM to run basically any model in F16.
I posted some plots under https://huggingface.co/mradermacher/Llama-3.2-3B-Instruct-uncensored-GGUF/discussions/2. This should help you to make an informed decision what hardware to buy.
I'm fixated on Q8 because I have been led to believe that at the end of the day---quants are essentially lobotimizing the full model. So, I hope that Q8 does not remove the part of the brain responsible for intelligent reasoning.
If I'm mistaken please correct me.
I'm a copywriter and I engage in intelligent conversations with AI when I am working. e.g. "What are your thoughts on this value prop and make a few suggestions on how to improve persuasiveness." Or "Rewrite this paragraph so that it does not contain complex sentences or redundant ideas from the paragraph below."
I have noticed that the smaller models don't produce as high quality output as the larger models. Now I fully realize that this is subjective on my parts and I could be completely wrong. But it seems to work.
And I'm not sure if I'm reading your chart correctly...but looking at Qwen 2 1.5B.....the Q8 shows 53.22% perplexity and the Q5_K_M shoulds 36.37% perplexity. Isn't that a fairly big difference? And I'm assuming that "perplexity" is associated with the model intelligence?
I haven't purchased the Mac yet--still gathering info so I can make a chart to see which model is best for me (48GB Pro or 64GB Pro Max).
I will download "Llama-3.1-Nemotron-70B in i1-Q5_K_M" right now and run it on my temporary AI rig (3x3090 on 7800x3d + 64GB DDR5 EXPO) and come back here and let you know my thoughts. I need a few hours because I am working right now and this is the perfect opportunity to test drive this quant and put it through its paces. My server motherboard is being RMA'd right now and I'm stuck with only being able to use 3x3090 on my gaming PC.
The size of the model obviously matters a lot how lowering precision affects quality. Let's look at Meta-Llama-3.1-70B-Instruct data which is the closes to Llama-3.1-Nemotron-70B I measured. A 70B model is 140 mal larger than a 0.5B model it was likely the least fitting you could have chosen and beside that Perplexity is a relatively dumb measurement. Instead focus on KV-divergence and token probability.
If we look at correct token probability of Meta-Llama-3.1-70B-Instruct we see that i1-Q5-K-M is 99.533% as good as the unquantized F16 model. KL-divergence is 0.0137 and perplexity is 1.7% larger than on F16. When we look at more real world beanchmarks like ARC, MMLU and WinoGrande we see that iQ5_K_M performed equal to any higher quant.
I'm fixated on Q8 because I have been led to believe that at the end of the day---quants are essentially lobotimizing the full model. So, I hope that Q8 does not remove the part of the brain responsible for intelligent reasoning.
Quantization just lowers the precision of wights. No parameters or layers are removed. Running a larger model at lower precision always results in higher quality answers than running a smaller model at Q8. Many would argue that even Q5 is already somewhat overkill as the optimal tradeoff between precision and parameter count is in the Q4 range.
There is different type of quants. There are static and weighted quants. Static quants just lower precision depending on the type of layers. For example an Q5_K_M model will have some layers in Q5 and other more important ones in Q6. There however are also wighted/imatrix/i1 quants where an imatrix is computed based on a training set so that what is unimportant gets quantized more than what is important. As you can see on above plot there is a significant quality/size improvement when using wighted/imatrix quants.
I'm a copywriter and I engage in intelligent conversations with AI when I am working. e.g. "What are your thoughts on this value prop and make a few suggestions on how to improve persuasiveness." Or "Rewrite this paragraph so that it does not contain complex sentences or redundant ideas from the paragraph below."
I have noticed that the smaller models don't produce as high-quality output as the larger models. Now I fully realize that this is subjective on my parts and I could be completely wrong. But it seems to work.
For your use case the larger the model the better. As soon you have a use-case that allows for some creativity precision starts to matter way less. The larger the model the more intelligent it gets and the higher quality text it produces. Precision only really matters if you use an AI model for math or multiple-choice tasks where only one answer is correct like ARC/MMLU/WinoGrande. For your use case Q4 might be optimal but for ease of mind I'm always going i1-Q5_K_M.
The reason why I'm using i1-Q5_K_M and not just F16 despite having 512 GiB of RAM is because memory usage directly translates in inference speed and that less than 0.533% where an answer of F16 would be better than on i1-Q5_K_M is so not worth it as with the inference performance I gain I can easily regenerate the answer if I'm not satisfied and still be much faster.
Heya, I won't be able to digest and respond intelligently to your above comments until after work today.
However, just wanted to say that I have been using your Llama-3.1-Nemotron-70B in i1-Q5_K_M for the past hour or so and I'm unable to tell a difference between this quant and Bartowski's Q8 quant of the same model. Also, I'm getting a way faster tokens per second on the Q5 vs. Q8 (using 72GB of VRAM). I just need to confirm that a 64GB Pro Max M4 can run a 50GB Q5 quant--because Apple does not give you 64GB...you get like 3/4 of that. But that can be remedied by a quick terminal command that frees up as much memory as you need.
I just wonder how long the context can go before time to first token or inference speed slows down. It would be really nice to run your Q5 quant on a Macbook Pro Max without having to spin up the AI server in my home office that sucks down electricity and sounds like a literal jet engine plane (even after reducing the power to ~250w per card).
Thanks!
I have been using your Llama-3.1-Nemotron-70B in i1-Q5_K_M for the past hour or so and I'm unable to tell a difference between this quant and Bartowski's Q8 quant of the same model
That is exactly what I expected based on my measurement and own experience.
I'm getting a way faster tokens per second on the Q5 vs. Q8 (using 72GB of VRAM)
As mentioned before the more you quantize a model the smaller it gets which will result in much faster speeds in such large models as they are usually fully memory bandwidth bottlenecked even if you run them on a GPU. This is exactly why I'm using i1-Q5_K_M even if I could run much larger models. The performance gain is far and the resulting faster retries on bad answer is far more important than the unnoticeable amount of quality loss you get. I'm currently working on the massive project to measure quality and performance with different devices on all quants of the Qwen 2.5 series of model so I unfortunately don't have any performance plot to share yet.
I just need to confirm that a 64GB Pro Max M4 can run a 50GB Q5 quant--because Apple does not give you 64GB...you get like 3/4 of that. But that can be remedied by a quick terminal command that frees up as much memory as you need.
You can always go i1-Q5_K_S or i1-Q4_K_M should i1-Q5_K_M not fit. For i1-Q5_K_S the quality difference is still not noticeable and for i1-Q4_K_M on such a large model you likely won't notice any difference as well for your use-case. I'm using Q4 on my Laptop as there I'm limited to 32 GiB regarding RAM and in the end a larger model in Q4 is much better than a smaller in Q5.
I just wonder how long the context can go before time to first token or inference speed slows down. It would be really nice to run your Q5 quant on a Macbook Pro Max without having to spin up the AI server in my home office that sucks down electricity and sounds like a literal jet engine plane (even after reducing the power to ~250w per card).
With a Macbook you will unfortunately miss out on Flash Attention 2 which you currently have on your RTX 3090 server but alternatives will work so the memory used for context should still be reasonable. You could also just keep two models. One in Q5 for normal context and one in Q4 for extra-large context. However my workload is very different than yours. I just ask a question and then collect maybe 50 responses from different AI models and then use a long context specific model to combine all of them in a single response. In case you wonder currently my favorite AI model is (nicoboss/Meta-Llama-3.1-405B-Instruct-Uncensored)[https://huggingface.co/nicoboss/Meta-Llama-3.1-405B-Instruct-Uncensored] which I finetuned myself based on (meta-llama/Llama-3.1-405B-Instruct)[https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct].
Thanks and I appreciate the lesson. Ok, just need to figure out how I'm going to pay for this M4 Max lol. I should have it in about 2-3 weeks.
I decided against selling the AI rig because I want to learn how to fine tune models and I know that's way better than a Mac.
I decided against selling the AI rig because I want to learn how to fine tune models and I know that's way better than a Mac.
Great decision to keep it. A Macbook is definitely not a replacement for a GPU server and not any good for finetuning. If you need any help finetuning models just let me know and I can help you. I helped finetuning many models and even created my own finetunes. But in short just use axolotl with a good dataset. For nicoboss/Meta-Llama-3.1-405B-Instruct-Uncensored I used Guilherme34/uncensor which is an awesome dataset to uncensor models but might also serve as an example how to create a dataset with only very limited training data. The entire axolotl configuration I used are inside the model card. 2x RTX 3080 is enough to finetune a 70B model using 4-bit precision which should be precise enough for this type of finetuning and if you are very satisfied with the result you can always rent container on RunPod for a few hours (depending on the size of your dataset) to redo the finetune in 16-bit but I don't think I will would make much of a difference.
Only two minor additions - I think there is a way to tell osx to give you most of the RAM and remove the artificial 3/4th limit (at least I often saw how people claimed it on reddit), and we also have a Q8_0 (in the static repository).
@nicoboss
I really appreciate the graphs. I've been looking for KL Divergence measurements of a recent large LLM.
I'm currently working on the passive project to measure quality and performance with different devices on all quants of the Qwen 2.5 series of model so I unfortunately don't have any performance plot to share yet.
I just want to bring your attention to this https://github.com/ikawrakow/ik_llama.cpp which has the goal of additional SOTA quants and improved performance.
I really appreciate the graphs. I've been looking for KL Divergence measurements of a recent large LLM.
Please take a look at https://huggingface.co/mradermacher/BabyHercules-4x150M-GGUF/discussions/2#674a7958ce9bc37b8e33cf55. I uploaded the measurmenets of the entire Qwen 2.5-series of models 3 days ago. You can find the raw measurements under https://www.nicobosshard.ch/LLM-Eval_Quality_v1.tar.zst
I also did measurements for Llama 405B and many other models which results you can find under https://www.nicobosshard.ch/LLM-Eval_v2.tar.zst and plots under https://huggingface.co/mradermacher/BabyHercules-4x150M-GGUF/discussions/2#6732972fc7b41d86099eb5d9. The code to generate the plots yourself is included as well.
I just want to bring your attention to this https://github.com/ikawrakow/ik_llama.cpp which has the goal of additional SOTA quants and improved performance.
Awesome I will for sure take a look at it.
@mradermacher This is now finally properly supported as llama.cpp merged https://github.com/ggerganov/llama.cpp/pull/11008 - it took four month and many requants from bartowski but better later than never.
Duplicate requests:
Uuuh, cool :) Will queue as soon as llama is done compiling.