Lets Quantize
I've added a link to this repo in llama.cpp thread that attempts to add grok support to llama.cpp llama: add Grok support #6120
Maybe the author or someone from the community can help this effort as well. Many people would like to see 80GB-120GB Grok (2-3 bit quantized), that is able to run comfortably on a 2xA100 server.
I noticed the files are about 600GB. Is this a dequanted version from the 8-bit weights to 16-bits? Did x.com release higher precision weights somewhere? I don't see original class QuantizedWeight8bit:
or an analogy of it at first glance of the code.
If we quant these 16-bit weights back to 8-bit with llama.cpp
, would it be exactly the same as the originals? I would imagine it depends on the quanting algorithm, maybe yes, maybe no. I have not studied how llama.cpp quanting algorithm works. Inference I guess would be fine if you keep them at 16-bit with these particular files.
IMO if the answer is no, then we should not use these particular files for any quanting work. I'd rather go from the original 8-bit weights to use as accurate weights as possible as originally distributed. Or maybe they'll release a higher precision Grok-1 at some point and use those instead.
Then again maybe there's enough precision that this doesn't materially matter. Not sure.
Yeah 600GB doesn't sound right, should still be around 300GB~ in HF format at fp16 I thought? But I don't really know what I'm talking about.
Yes the model card states: Unofficial dequantized weight of grok-1 in HF Transformers format.
Obviously 314B parameters with two bytes float per parameter gets you 628 GB.
The original weights are supported by some Jax code. This repository a step into the right direction, standardizing the flow based on pytorch and transformers huggingface libraries. If llama.cpp would have an option to re-quantize 8-bit then you are right, but most models release their 16bit-hf versions, so in order to save headache to the developers, this repository should be used as the base for llama.cpp quantization, unless someone wants to do extra more job to support specifically 8-bit releases. We already have most of the flow written for 16bit-hf releases, but grok has a lot of nuances that no other model has, and llama.cpp doesn't support out of the box, so we have enough job even with the grok-1-hf, it's not like we have a real AI-developer to do all the work for us.
I don't think it really matters, 8-bit quantization is 99.99% similar to the 16bit hf, in the output. But even if it does matter - I would say first priority is to get it quantized somehow, then look for better quantizations and optimizations.
Well I have read the Q8 reference code in llama.cpp now; there's a helpfully an-easy-to-read reference version for it and it's pretty short: https://github.com/ggerganov/llama.cpp/blob/6c0b287748327741b113d7d6018b68c63039b1c5/ggml-quants.c#L727-L750
Simplified in pseudo-code it seems to be:
This is the output struct for a block:
struct block_q8_0:
float16 d # delta
int8_t qs[QK8_0] # quants
(analogously called `weight` and `scale` in Grok code)
---
QUANT:
QK8_0 = 32 (block size, defined in ggml-common.h)
For every block:
amax = compute maximum *absolute* value in the block
d = amax / 127
if d is not zero: # (would only be zero if every value in the block is 0)
id = 1 / d
else:
id = 0
block.d = d # float16
For each x in the block: # (x = value before we quant it, i.e. from same source as we computed amax for above)
quanted_x = x * id
block.qs[idx] = quanted_x # this will be between -128 and 127. Actually maybe between -127 and 127?
write(block)
DEQUANT:
y = block.qs[idx] * block.d # gives you a float16 back to use for compute
The dequant code in original Grok code is also of form scale * weight
. I stared at that a second and I think maybe if Grok's "block size" is bigger than llama.cpp's, then it would survive (Grok original) 8-bit -> (this HF repo) 16-bit -> back to 8-bit (llama.cpp) exactly so that the values you compute with ultimately are exactly the same (down to any impreciseness from floating point). I think Grok's "block size" is a lot bigger than llama.cpp's? That I think would suggest we'd get it exactly at least for llama.cpp quants (exactly same for llama.cpp Q8 quant, and as best as it can be for smaller quants, no matter if you start from this HF repository F16 or from the original Grok Q8s).
It would also tell me that if the Grok team releases 16-bit weights instead we should requant from those. E.g. if our block_size=4 and we saw a block like [-200.0, 1.0, 2.0, 3.0] in float16, quanted to int8 with the quant algorithm there, and then back to 16-bit for compute, the -200 would be exactly recovered but the 1, 2 and 3 would be messed up. I think.
My brain is not big enough to reason if I got that exactly right so if it comes to having to decide whether we want to use these F16s or not I'd do a small and quick dirty test to check. But if that reasoning is correct then deciding between original Q8 and these F16s wouldn't matter weight-value wise, only in "how easy it is to add support for a .gguf converter for llama.cpp" and I think HF model would win easily in that one, as a lot of support code is there already in place.
I decided I can't live a happy life, or at least not a happy March 19 evening if I didn't at least do a quick test to check if my intuition was right. (note: this doesn't change conclusion that using these F16s is fine; just wanted to check if my thinking was correct. I think technically we are throwing away original block size information by dequanting to F16 that could be made use of so it's not 100% exactly perfectly pedantically same as writing code to requant the original F8s, but so close I think it's totally irrelevant).
Behold, crappy test C code ripped off llama.cpp and modified for testing my intuition from previous comment:
quant_test.c
// gcc -Og quant_test.c -lm -o quant_test
// ./quant_test
#include <stdio.h>
#include <stdint.h>
#include <assert.h>
#include <math.h>
#include <string.h>
#include <stdlib.h>
typedef struct block_q8_0 {
float d;
int8_t* qs;
} block_q8_0;
static void quantize_row_q8_0_reference(const float * restrict x, block_q8_0 * restrict y, int k, int block_size) {
// in llama.cpp, block_size for Q8 is QK8_0 which is 32.
// in this test code, instead of QK8_0 I turned it into an argument
// 'block_size'.
assert(k % block_size == 0);
const int nb = k / block_size;
for (int i = 0; i < nb; i++) {
float amax = 0.0f; // absolute max
for (int j = 0; j < block_size; j++) {
const float v = x[i*block_size + j];
if (amax < fabsf(v)) {
amax = fabsf(v);
}
}
const float d = amax / ((1 << 7) - 1);
const float id = d ? 1.0f/d : 0.0f;
y[i].d = d;
for (int j = 0; j < block_size; ++j) {
const float x0 = x[i*block_size + j]*id;
y[i].qs[j] = roundf(x0);
}
}
}
static void dequantize_row_q8_0(const block_q8_0 * restrict x, float * restrict y, int k, int block_size) {
const int qk = block_size;
assert(k % qk == 0);
const int nb = k / qk;
for (int i = 0; i < nb; i++) {
const float d = x[i].d;
for (int j = 0; j < qk; ++j) {
y[i*qk + j] = x[i].qs[j]*d;
}
}
}
static void random_fill(float* x, size_t x_len) {
for (size_t i = 0; i < x_len; ++i) {
// 1B
uint32_t v = arc4random_uniform(1000000000);
// between -5 and 5)
x[i] = ((v / 1000000000.0f) - 0.5f) * 10.0f;
}
}
static void print_deviation(const float* x, const float* y, size_t nelems)
{
float largest_deviation = -INFINITY;
float avg_deviation = 0.0;
for (size_t i = 0; i < nelems; ++i) {
float deviation = fabsf(x[i] - y[i]);
largest_deviation = fmaxf(largest_deviation, deviation);
avg_deviation += deviation;
}
printf("Largest deviation: %f Average deviation %f\n", largest_deviation, avg_deviation / nelems);
}
int main(int argc, char** argv) {
((void) argc); // silence warnings
((void) argv);
/// TEST 1: (testing if quanting is idempotent in general)
///
/// random block of 1024 floats,
/// 1. quantize 32-bit to 8-bit
/// 2. dequantize 8-bit to 32-bit
/// 3. compare deviation (first printf) - is about ~0.02 typically
/// 4. quantize the dequantized block back to 8-bit
/// 5. dequantize 8-bit to 32-bit
/// 6. compare deviation (second printf) - should be 0.0 exactly
printf("TEST 1\n");
{
// simulate large block 1024
float src[1024];
block_q8_0 src_to_q8;
src_to_q8.qs = malloc(1024);
random_fill(src, 1024);
quantize_row_q8_0_reference(src, &src_to_q8, 1024, 1024);
float dst[1024];
dequantize_row_q8_0(&src_to_q8, dst, 1024, 1024);
// src = original 32-bit floats
// dst = 32-bit floats from 32-bit -> 8-bit -> 32-bit
printf("1024 random -5..5 uniform block from 32-bit -> 8-bit -> 32-bit\n");
print_deviation(src, dst, 1024); // deviation typically ~0.02
// quantize the dequantized block back to 8-bit
block_q8_0 dst_to_q8;
dst_to_q8.qs = malloc(1024);
quantize_row_q8_0_reference(dst, &dst_to_q8, 1024, 1024);
float dst2[1024];
dequantize_row_q8_0(&dst_to_q8, dst2, 1024, 1024);
printf("1024 random -5..5 uniform block from 32-bit -> 8-bit -> 32-bit -> 8-bit -> 32-bit\n");
print_deviation(dst, dst2, 1024); // largest deviation = 0.0 exactly
}
printf("TEST 2\n");
/// TEST 2: (simulating possible conditions with Grok, one model has
/// big block size, other one has small block size)
///
/// random block of 1024 floats,
/// 1. quantize 32-bit to 8-bit (block_size=1024)
/// 2. dequantize 8-bit to 32-bit
/// 3. quantize the dequantized block back to 8-bit, but do it in
/// block_size=32 (1024/32 = 32 total blocks)
/// 4. dequantize those 32 blocks of 8-bit to 32-bit
/// 5. compare deviation (second printf) - should be 0.0 exactly...but it's not
{
float src[1024];
block_q8_0 src_to_q8;
src_to_q8.qs = malloc(1024);
random_fill(src, 1024);
quantize_row_q8_0_reference(src, &src_to_q8, 1024, 1024);
float dst[1024];
dequantize_row_q8_0(&src_to_q8, dst, 1024, 1024);
// src = original 32-bit floats
// dst = 32-bit floats from 32-bit -> 8-bit -> 32-bit
// quantize the dequantized block back to 8-bit
block_q8_0 dst_to_q8[32];
for (size_t i = 0; i < 32; ++i) {
dst_to_q8[i].qs = malloc(32);
}
quantize_row_q8_0_reference(dst, dst_to_q8, 1024, 32);
float dst2[1024];
dequantize_row_q8_0(dst_to_q8, dst2, 1024, 32);
printf("1024 random -5..5 uniform block from 32-bit -> 8-bit -> 32-bit -> 8-bit -> 32-bit, but 1024 block size -> 32 block size\n");
print_deviation(dst, dst2, 1024);
}
return 0;
}
Tl;dr;
If you use same block size (in the C code there I tried a large 1024 one): 32-bit -> 8-bit -> 32-bit -> 8-bit -> 32-bit. The highlighted 32-bit are exactly the same. Arrows are quanting to 8-bit, or back from 8-bit. So this quanting algorithm is idempotent, as long as you keep block size the same.
If the block size is not the same between quants then it's not okay. 32-bit -> 8-bit (block_size=1024) -> 32-bit -> 8-bit (block_size=32) -> 32-bit. The highlighted 32-bits are not the same with the given algorithm. Although it seems slightly better than the deviation between the first and second 32-bit arrays.
So this suggests again that no it wouldn't be exactly the same. My intuition was only partly right.
Output example from the C code:
TEST 1
1024 random -5..5 uniform block from 32-bit -> 8-bit -> 32-bit
Largest deviation: 0.019615 Average deviation 0.009665
1024 random -5..5 uniform block from 32-bit -> 8-bit -> 32-bit -> 8-bit -> 32-bit
Largest deviation: 0.000000 Average deviation 0.000000
TEST 2
1024 random -5..5 uniform block from 32-bit -> 8-bit -> 32-bit -> 8-bit -> 32-bit, but 1024 block size -> 32 block size
Largest deviation: 0.019198 Average deviation 0.007479
I think some models were leaked even in 4-bit, mistral-medium I think, and they were converted back to FP16 and quantized differently, and they all work fine. Looks like the quantization is not losing much information if at all, and it really doesn't matter so much in the big picture. There are attempts to quantize up to 1.58 bit (-1,0,1) with moderate success.
Obviously working with 8-bit is preferably if we are talking about using CPU to quantize to Q2-Q3, the loading time and conversion time is measured in hours, and renting 8xA100 is about 15$/h, but in general I would value developer's time much more.
GGUF 2 bit version done by anyone resourceful?
I'm not really resourceful (will take 1-2 weeks maybe), but I am working on all quants, and imatrix quants, if possible.
https://huggingface.co/mradermacher/grok-1-test-GGUF
(the repo will be renamed once tested - whats there is completely untested). Also, there is https://huggingface.co/Arki05/Grok-1-GGUF, from the guy who wrote the actual code for all of this in llama.cpp.