Question about choosing the right version
Hi mradermacher,
I want to choose between IQ4_NL, Q4_KS and Q4_KM. What I always thought is that IQ4_NL is worst quality of the 3 but the fastest, Q4_KS is better in quality but worse in speed compared to IQ4_NL and Q4_KM best quality but slowest of the three? Because it's confusing me that you write Q4_KM is fast and recommended but Q4_KS is also optimal too and how do they compare to IQ4_NL? Can you explain it to me?
Things are pretty complicated. What is fastest can change between whether you use cpu or a gpu to do calculations, whether and how many layers you offload to the gpu, your cpu and even how big the model is. For example, on a 4 core cpu, the IQ-quants might be cpu-bound, so a Q4_K_S might be much faster, while for a 20 core cpu, practically everything will be memory-speed bound, so a smaller IQ quant is faster. On top of that, different software handles things differently.
In the end, you'd have to benchmark in your configuration what is best, and be aware that this can change with model size (and other parameters - sometimes 4 threads are faster than 20, sometimes the opposite is true, for different models).
he guidance is meant for people who have absolutely no clue what these things are, so they cna chose the recommended option and not be totally off.
For some while now I am planning to do some benchmarks (cpu-only, cpu+gpu, gpu-only) with various quants to give better guidance, especially with the new quants, but I am not sure when I can get to it.
Ok thanks anyway.
I tried Q4_K_S and Q4_K_M and I didn't notice a big difference while offloading 10 layers to GPU. I have a i5 12500h CPU, 64GB 3200Mhz RAM and a 3050 RTX with 4GB VRAM and after it processed the first message and the system prompt, it is pretty much as fast as I can read of course after waiting a few seconds till the I see the first token.
Good to hear :) The Q4 quants are pretty good for cpus, and about the lowest you can go without noticable fidelity loss.