IQ3_M Quant?

by PrimeD - opened Nov 13, 2024

Nov 13, 2024

Sorry for being a bother again...

Is there somewhere I can read about IQ3_M "Medium-low quality, new method with decent performance comparable to Q3_K_M." . What is it supposed to deliver? be good at? how?

( if someone already knows then that saves a ton of time and gives even simpler easier explanation too, though ofc I will be googling it, yet I thought no harm asking :p )

bartowski

Owner Nov 13, 2024

•

edited Nov 13, 2024

Edit: after reading more I'm retracting this, will figure out a more correct answer and update

bartowski

Owner Nov 13, 2024

Okay I think I have a better idea of how it works (it's a doozy)

the TLDR is that, instead of storing the actual quantized weights themselves, IQ quants store lookup indices into a defined grid of values, and then stores a couple extra values to know how to adjust those retrieved values to yield the original weights (+/- signs, scale factor)

I only read up on IQ2, not 3, but they'll work similarly, and IQ2 makes some sense. Read on for a more in depth attempt at understanding and explaining...

the lookup grid (for IQ2) is 256 values

They group weights into "superblocks" of 256, and then again into subblocks of 32

each of those subsblocks has 4 "rows" of 8 weights, and for each of those rows they use 16 bits to store information (yielding ~2 bits per weight)

To achieve this, they use 8 bits to store an index that maps to one of the 256 values in the grid to define the "starting point"

then, of the remaining 8 bits, 7 are used to store signs (+/-) and 1 contributes to a scaling factor that is applied to the entire subblock

The reason 7 bits can be used for signs is it seems that it's enforced that there only be an even number of positive vs negative values, even going so far as to flip an "unimportant" weight to the wrong sign to guarantee only 7 bits are needed (only 128 possible values, other 128 have an odd number of +/-)
(btw my guess would be that IQ3 does away with this, since it's not trying to aggressively constrain down to 2 bits per weight, it can probably afford to store information)

on top of all this, i think the entire superblock is calculated basically in parallel, even using the threadID assigned to the function call to index into the block of weights, which i think may be part of the reason it's faster on CUDA/AVX because it can perform a lot of the memory lookups and calculations simultaneously

I'll ping @ikawrakow since he's the genius behind the actual implementation, but wouldn't blame him if he doesn't have time to fact check this

PrimeD

Nov 15, 2024

Okay I think I have a better idea of how it works (it's a doozy)

the TLDR is that, instead of storing the actual quantized weights themselves, IQ quants store lookup indices into a defined grid of values, and then stores a couple extra values to know how to adjust those retrieved values to yield the original weights (+/- signs, scale factor)

I only read up on IQ2, not 3, but they'll work similarly, and IQ2 makes some sense. Read on for a more in depth attempt at understanding and explaining...

the lookup grid (for IQ2) is 256 values

They group weights into "superblocks" of 256, and then again into subblocks of 32

each of those subsblocks has 4 "rows" of 8 weights, and for each of those rows they use 16 bits to store information (yielding ~2 bits per weight)

To achieve this, they use 8 bits to store an index that maps to one of the 256 values in the grid to define the "starting point"

then, of the remaining 8 bits, 7 are used to store signs (+/-) and 1 contributes to a scaling factor that is applied to the entire subblock

The reason 7 bits can be used for signs is it seems that it's enforced that there only be an even number of positive vs negative values, even going so far as to flip an "unimportant" weight to the wrong sign to guarantee only 7 bits are needed (only 128 possible values, other 128 have an odd number of +/-)
(btw my guess would be that IQ3 does away with this, since it's not trying to aggressively constrain down to 2 bits per weight, it can probably afford to store information)

on top of all this, i think the entire superblock is calculated basically in parallel, even using the threadID assigned to the function call to index into the block of weights, which i think may be part of the reason it's faster on CUDA/AVX because it can perform a lot of the memory lookups and calculations simultaneously

I'll ping @ikawrakow since he's the genius behind the actual implementation, but wouldn't blame him if he doesn't have time to fact check this

Thanks for taking the time to read up, digest and update 🐱‍👤

PrimeD changed discussion status to closed Nov 15, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment