Context size
"For 8GB VRAM GPUs, I recommend the Q4_K_M-imat quant for up to 12288 context sizes."
Is Llama3 not trained for 8096 tokens context or did the finetuning on the additional trainings data increase the context size efficiently? If yes how big is the context size after finetuning?
I did say up to, you can stay at 8192 for native context, but...
KoboldCpp will do automatic Rope scaling and handle it. Use the latest version.
From https://github.com/LostRuins/koboldcpp/releases/tag/v1.63:
Reworked the Automatic RoPE scaling calculations to support Llama3 (just specify the desired --contextsize and it will trigger automatically).
It should Rope well. I imagine it will be alright for at least 16K.
Been running everything at 32k since instruct, needle in haystacks came back good up to 34k iirc. @Lewdiculous @WesPro
Even better. @WesPro Just watch your VRAM usage, don't let it fill up and it will be good. The IQ4_XS quant will still give you good quality to push for 16K at this VRAM size at good speeds.
Unfortunately, back when I bought my Laptop last year, I wasn't interested in LLM yet and I didn't even know about the possability of running locally, so I only looked out for a good CPU and lots of RAM because I thought I'll only use it for music production/audio software anyway and 99% of music software runs on CPU only and is pretty RAM intensive but there are no programs or plugins(yet) that can use and utilize GPU acceleration, so there was no need for a high-end GPU and that's why I only got a Geforce RTX 3050 with 4GB VRAM....
What quant would you recommend for that my case (i7-12700H, 64gb DDR5 RAM, Geforce 3050RTX 4GB VRAM) ? Is there a way to use my DDR5 RAM for the context tokens instead of the VRAM so I can load more layers to GPU? Should I generally try to load as many layers as possible to GPU or should I leave a certain amount of free VRAM space? I don't need the RTX3050 for general graphic tasks because I have another GPU chip in the Intel CPU which runs the basic stuff so there's no need to hold anything back to run any graphical task. I'm just not sure what the best settings are with my limited possibilities...
Honestly I think you'd need some IQ2 quants, they "might" fit all in VRAM.
But you can run as you're used to if you don't mind the speed of running only on CPU and RAM, this will work the same, just slow.
I'll try to make the smallest possible quants and you can test if they are coherent.
I'll try to make the smallest possible quants and you can test if they are coherent.
For science!
Surely...
IQ1_S : 1.56 bpw - https://huggingface.co/Lewdiculous/Poppy_Porpoise-v0.4-L3-8B-GGUF-IQ-Imatrix/blob/main/Poppy_Porpoise-v0.4-L3-8B-IQ1_S-imat.gguf (2 GB)
IQ2_XXS : 2.06 bpw - https://huggingface.co/Lewdiculous/Poppy_Porpoise-v0.4-L3-8B-GGUF-IQ-Imatrix/blob/main/Poppy_Porpoise-v0.4-L3-8B-IQ2_XXS-imat.gguf (2.34 GB)
IQ2_S : 2.5 bpw - https://huggingface.co/Lewdiculous/Poppy_Porpoise-v0.4-L3-8B-GGUF-IQ-Imatrix/blob/main/Poppy_Porpoise-v0.4-L3-8B-IQ2_S-imat.gguf (2.7 GB)
Honestly I think you'd need some IQ2 quants, they "might" fit all in VRAM.
But you can run as you're used to if you don't mind the speed of running only on CPU and RAM, this will work the same, just slow.
Why not offloading a certain amount of layers? Koboldcpp usually suggests to run ~13 layers of a Q5_K_M 7B model and then there are like ~2500mb on CuBLAS/VRAM and rest CPU/RAM, at least that's what Koboldcpp shows when loading the model. I mean the speed is ok for 7B. Once the first prompt is done which can take from 5 to 30 secs when the prompt is huge but then the following answers come fast and generate usually faster or as fast as I can read so it's ok speedwise, but if I can improve something it would be nice especially for models that are bigger than 11b.
I don't offload because it runs much faster only on GPU. But yeah, you can offload. Experiment with offloading then and use the model as usual.
Surely...
IQ1_S : 1.56 bpw - https://huggingface.co/Lewdiculous/Poppy_Porpoise-v0.4-L3-8B-GGUF-IQ-Imatrix/blob/main/Poppy_Porpoise-v0.4-L3-8B-IQ1_S-imat.gguf
IQ2_XXS : 2.06 bpw
IQ2_S : 2.5 bpw
Ok I will test them now and report how usuable they are and how big the difference is in speed
The goal is to offload everything with these small quants. Your VRAM shouldn't be TOTALLY full, I'd say to see what fits up to 3.7GB dedicated VRAM usage.
How does the IQ2_XXS do?
I only expect under Q3 to me usable with much bigger models. That seems to be the practical limit.
IQ3XXS is the next one I tested and it is the lowest quant that I would call usable but the loss of quality is pretty significant compared to Q5_K_M. I guess you are right with your statement that everything under (I)Q3 isn't really worth trying. I also tried using bigger models like 4x7b, 8x7b, 20b oder 30b in IQ3S or IQ3M as well and my experience is that the quants get to a level that isn't that much worse compared to a higher quant but everything under (I)Q4 is not something you should use if you can avoid it. I guess you would need a certain amount of chat experience with that model on a higher quant to actually notice a quality difference over time but the IQ3/Q3 seems to be the first quant that is usable if you are desperate and want to try a certain model you can't run with a higher quant. Also subjectively I feel like the IQ3 is a little better than the equivalent Q3_K quant that's closest to it. The speed of IQ1S was like 27 token/s and IQ3XXS 12-13 token/s so both still faster than my normal reading speed. I have no idea yet how imatrix affects quants though.
This is IQ2_XXS
Speed was like 17-18 Token per second. It is significantly better than IQ1_S but still not usable for anything.
This is IQ2_S
IQ2_S is again significantly better than IQ2_XXS and runs with 15 token/s so there's not a big difference between IQ2_XXS and IQ3XXS it's now at least coherent and you understand what is meant. I guess that even a little bit more precision will reduce he amount of "wrong" predicted tokens quite noticeably so the coherence and writing style get significantly better too. In my opinion the + in speed is not worth the loss in quality so I will stick to quants that are at least a IQ4_XS or higher.
Well, now we know for sure. The suspicions hold true. Let's wait for better quants, who knows what happens in the future. Thanks mate.