It is quant of your own finetuned or original model?
Hi, I am wondering if these quants are of the original 70B model or did you first finetuned 70b and then generated quants from there? Because your quants will have lower context length. However, if these models have same context length as that of original (quants from original) then it would be more usefull.
Also, is q2 of the model useable (is it good enough)? How do you measure quality of quant after model is quantized? (I saw your blog about 4bit dynamic quant, does that apply to other quants? Did you use the method for q2, which will make it useable, otherwise perhaps a lot will be "lost in translation".
Thanks
Hi, I am wondering if these quants are of the original 70B model or did you first finetuned 70b and then generated quants from there? Because your quants will have lower context length. However, if these models have same context length as that of original (quants from original) then it would be more usefull.
Also, is q2 of the model useable (is it good enough)? How do you measure quality of quant after model is quantized? (I saw your blog about 4bit dynamic quant, does that apply to other quants? Did you use the method for q2, which will make it useable, otherwise perhaps a lot will be "lost in translation".
Thanks
We do not do any fine-tuning for these GGUFs. They are just the original model in GGUF format.
Our 4bit dynamic quant is only for 4-bit bnb models. Currently the only models using this methodology are located here: https://huggingface.co/collections/unsloth/unsloth-4-bit-dynamic-quants-67503bb873f89e15276c44e7
That is great, thank you. Please also share the link to guide where I can learn about dynamic quants and quantize llama3.1 8b (and 3.2 1b and 3b for text). Or if you have colab ready for that (either of the models) please share the link. Thank you.
That is great, thank you. Please also share the link to guide where I can learn about dynamic quants and quantize llama3.1 8b (and 3.2 1b and 3b for text). Or if you have colab ready for that (either of the models) please share the link. Thank you.
We have a blogpost explaining some of the process: https://unsloth.ai/blog/dynamic-4bit
We haven't uploaded the text models with dynamic 4bit but will soon.
Thanks but what is this: unsloth/Llama-3.3-70B-Instruct-bnb-4bit
It looks like 4bit quantization (original model is 70b parameters, but this copy is only about 40b GB). If it is quantized version and done with dynamic quants then it is probably as good as full 70b with 16bit precision, right?
Is it possible to quantize to 1bit or sub bit with dynamic quants (keeping quality almost as good as original)? Could you share the script for that as well?
Thanks but what is this: unsloth/Llama-3.3-70B-Instruct-bnb-4bit
It looks like 4bit quantization (original model is 70b parameters, but this copy is only about 40b GB). If it is quantized version and done with dynamic quants then it is probably as good as full 70b with 16bit precision, right?Is it possible to quantize to 1bit or sub bit with dynamic quants (keeping quality almost as good as original)? Could you share the script for that as well?
The bnb-4bit version is quantized using BitsandBytes and is irrelevant to the GGUF version.
Unfortunately it is not the Unsloth -dynamic 4bit quantized version so the accuracy might not be as good but it's good enough.
It is ok if unsloth quantized version of my desired model does not exist, however, it would be nice to be able to quantize model myself. For that can you share script/code which you used to quantize these models: https://huggingface.co/collections/unsloth/unsloth-4-bit-dynamic-quants-67503bb873f89e15276c44e7 , or better yet code that produces graphs on the blog: https://unsloth.ai/blog/dynamic-4bit .
Thank you