README.md · Nexesenex/Meta_Llama-3.1-8b-it_iMat_Custom_Quant_Stategies-GGUF at ba7258b58dd5247b14b66efe1310ddd62e1839c1

metadata

license: llama3.1

Experimental .GGUF quants for https://huggingface.co/google/gemma-2-9b-it accordingly to LCPP PR (based on b_3529 and now b_3565 for the newer ones) : https://github.com/ggerganov/llama.cpp/pull/8836

These experimental quant strategies revisiting Ikawrakow's work are displaying a slight decrease of perplexity, including per bpw (from 10%+ for the lowest quants to 0.x% for the highest ones). This is significant enough to encourage you folks to test them, and provide feedback if pertinent.

The iMatrix I use is based on Group Merged V3 and enriched with a bit of French, a bit of Serbian, and a bit of Croatian languages.

ARC and PPL-512 DATA (Get the last data on the main post of the PR thread) :


IQ3_M

Master
Size : 3.52 GiB (3.76 BPW)  
PPL 512 wikitext : 7.9263 +/- 0.04943

IQ3_M

PR
Size : 3.49 GiB (3.73 BPW)
PPL 512 wikitext : 7.8704 +/- 0.04951

IQ4_XS

Master
Size : 4.13 GiB (4.42 BPW)
Arc-C 299     49.16387960    
Arc-E 570     72.10526316     
PPL 512 wikitext : 7.5226 +/- 0.04820

IQ4_XSR

PR
Size : 4.16 GiB (4.45 BPW)
Arc-C 299    
Arc-E 570      
PPL 512 wikitext : 7.5072 +/- 0.04814

FP16

MASTER : Gemma 2 9b It F16.
Size : 14.96 GiB (16.00 BPW)
Arc-C 299     49.49832776
Arc-E 570     73.85964912
PPL 512 wikitext : 7.3224 +/- 0.04674