These are a whole bunch of conversions of qwen7b v2 in an attempt to fix the reduced performance while quantizing. The bf16 versions will NOT work with apple GPUs but will work with most cpus and newer nvidia cards (older ones like 1080 series don't support bf16 inference well).
Perplexity benchmarks will come later once an automated suite is written by me or whoemever, sorry have just been too busy and doing those properly for each quant takes all day.
Model names should be self explanatory. Just pick the biggest one that your hardware can run.
Overall in this experiment I noticed that quantising the embedding weight had much less effect on perplexity that expected, even at q4k it didnt harm the model much but under 4k it was drastic damage to intelligence.
Whereas quantising the output weight to q8 was fine and nearly as long as it was done from a bf16 instead of quantising to f16 ( which just deletes 3 bits of precision) and then quantising to q8. Going any lower had a lot of issues, there are further improvements coming in the future to this as imatrix optimizations still do not work well when applied to the output weight.
Anyway, just uploaded what I had experimented in so far if anyone wants to carry over the work. Feel free to just use the biggest weight you can run. q4k with 8bit output works very well and iq4xs with 8bit out is probably the best performance/intelligence ratio for any 7b model that exists right now in my opinion.
Cheers, Nisten
- Downloads last month
- 259
16-bit