IQ3_M Quant?
Sorry for being a bother again...
Is there somewhere I can read about IQ3_M "Medium-low quality, new method with decent performance comparable to Q3_K_M." . What is it supposed to deliver? be good at? how?
( if someone already knows then that saves a ton of time and gives even simpler easier explanation too, though ofc I will be googling it, yet I thought no harm asking :p )
Edit: after reading more I'm retracting this, will figure out a more correct answer and update
Okay I think I have a better idea of how it works (it's a doozy)
the TLDR is that, instead of storing the actual quantized weights themselves, IQ quants store lookup indices into a defined grid of values, and then stores a couple extra values to know how to adjust those retrieved values to yield the original weights (+/- signs, scale factor)
I only read up on IQ2, not 3, but they'll work similarly, and IQ2 makes some sense. Read on for a more in depth attempt at understanding and explaining...
the lookup grid (for IQ2) is 256 values
They group weights into "superblocks" of 256, and then again into subblocks of 32
each of those subsblocks has 4 "rows" of 8 weights, and for each of those rows they use 16 bits to store information (yielding ~2 bits per weight)
To achieve this, they use 8 bits to store an index that maps to one of the 256 values in the grid to define the "starting point"
then, of the remaining 8 bits, 7 are used to store signs (+/-) and 1 contributes to a scaling factor that is applied to the entire subblock
The reason 7 bits can be used for signs is it seems that it's enforced that there only be an even number of positive vs negative values, even going so far as to flip an "unimportant" weight to the wrong sign to guarantee only 7 bits are needed (only 128 possible values, other 128 have an odd number of +/-)
(btw my guess would be that IQ3 does away with this, since it's not trying to aggressively constrain down to 2 bits per weight, it can probably afford to store information)
on top of all this, i think the entire superblock is calculated basically in parallel, even using the threadID assigned to the function call to index into the block of weights, which i think may be part of the reason it's faster on CUDA/AVX because it can perform a lot of the memory lookups and calculations simultaneously
I'll ping @ikawrakow since he's the genius behind the actual implementation, but wouldn't blame him if he doesn't have time to fact check this
Okay I think I have a better idea of how it works (it's a doozy)
the TLDR is that, instead of storing the actual quantized weights themselves, IQ quants store lookup indices into a defined grid of values, and then stores a couple extra values to know how to adjust those retrieved values to yield the original weights (+/- signs, scale factor)
I only read up on IQ2, not 3, but they'll work similarly, and IQ2 makes some sense. Read on for a more in depth attempt at understanding and explaining...
the lookup grid (for IQ2) is 256 values
They group weights into "superblocks" of 256, and then again into subblocks of 32
each of those subsblocks has 4 "rows" of 8 weights, and for each of those rows they use 16 bits to store information (yielding ~2 bits per weight)
To achieve this, they use 8 bits to store an index that maps to one of the 256 values in the grid to define the "starting point"
then, of the remaining 8 bits, 7 are used to store signs (+/-) and 1 contributes to a scaling factor that is applied to the entire subblock
The reason 7 bits can be used for signs is it seems that it's enforced that there only be an even number of positive vs negative values, even going so far as to flip an "unimportant" weight to the wrong sign to guarantee only 7 bits are needed (only 128 possible values, other 128 have an odd number of +/-)
(btw my guess would be that IQ3 does away with this, since it's not trying to aggressively constrain down to 2 bits per weight, it can probably afford to store information)on top of all this, i think the entire superblock is calculated basically in parallel, even using the threadID assigned to the function call to index into the block of weights, which i think may be part of the reason it's faster on CUDA/AVX because it can perform a lot of the memory lookups and calculations simultaneously
I'll ping @ikawrakow since he's the genius behind the actual implementation, but wouldn't blame him if he doesn't have time to fact check this
Thanks for taking the time to read up, digest and update π±βπ€