Text Generation
Transformers
Safetensors
English
llama
Not-For-All-Audiences
conversational
text-generation-inference
Inference Endpoints

The "silly" test.

#2
by ZeroWw - opened

ZeroWw 'SILLY' version.

The original model has been quantized (fq8 version) and a percentage of it's tensors have been modified adding some noise.

Full colab: https://colab.research.google.com/drive/1a7seagBzu5l3k3FL4SFk0YJocl7nsDJw?usp=sharing

Fast colab: https://colab.research.google.com/drive/1SDD7ox21di_82Y9v68AUoy0PhkxwBVvN?usp=sharing

Original reddit post: https://www.reddit.com/r/LocalLLaMA/comments/1ec0s8p/i_made_a_silly_test/

I created a program to randomize the weights of a model. The program has 2 parameters: the percentage of weights to modify and the percentage of the original value to randmly apply to each weight.

At the end I check the resulting GGUF file for binary differences. In this example I set to modify 100% of the weights of Mistral 7b Instruct v0.3 by a maximum of 15% deviation.

Since the deviation is calculated on the F32 weights, when quantized to Q8_0 this changes. So, in the end I got a file that compared to the original has:

Bytes Difference percentage: 73.04%

Average value divergence: 2.98%

The cool thing is that chatting with the model I see no apparent difference and the model still works nicely as the original.

Since I am running everything on CPU, I could not run perplexity scores or anything computing intensive.

As a small test, I asked the model a few questions (like the history of the roman empire) and then fact check its answer using a big model. No errors were detected.

Update: all procedure tested and created on COLAB.

Example: https://huggingface.co/ZeroWw/L3.1-8B-Celeste-V1.5-SILLY

You're posting this everywhere now? :/

I tried to tell you in the other thread that this won't really do anything interesting... When you quantise a model (or perform any type of "lossy" compression) you introduce "noise" like this anyway:

https://en.wikipedia.org/wiki/Rate%E2%80%93distortion_theory

https://en.wikipedia.org/wiki/Quantization_(signal_processing)

So all you are doing is adding some kind of hybrid (uniformly distributed) distortion that ultimately has the same effect as quantisation...

In simple terms - when you use Q8_0 you are having to compress the weights to use 256 separate values:

  • Imagine each weight is a (uniformly distributed) value between 0 and 1.
  • This means you now have to round each weight to the nearest 1/256th (~0.004).
  • This is then approximately equivalent to adding a (uniformly distributed) random value of between 0 and 0.002 (0.004/2) to every weight.

In reality the weights tend to be Normally distributed but the same idea applies.

This guy is the only one on the whole site who does this aggressive "promotion" of his quants. Everywhere I look, he's already there with his "My quants" and "Silly test".

Nothing is Real org

tbh I don't really mind quants being sent right to our models' discussion pages, it's pretty convenient

tbh I don't really mind quants being sent right to our models' discussion pages, it's pretty convenient

Have you looked at his quants though? He claims that his Q5 and Q6 somehow capable of performing as well as fp16, with only his own words as proof and zero benchmarks.

Nothing is Real org
β€’
edited Jul 28

Have you looked at his quants though? He claims that his Q5 and Q6 somehow capable of performing as well as fp16, with only his own words as proof and zero benchmarks.

Well, I have looked into them now, specifically Q6. KoboldCPP reports wrong number of tensors, 291 instead of 292, and doesn't load. Removing from the model card, won't be adding them anymore.
Probably I should reorder quant list too, GGUFs should the lowest priority (bf16 > fp8 > exl2 > gguf), especially considering that fp8 and exl2 are first-party.

AuriAetherwiing changed discussion status to closed

This guy is the only one on the whole site who does this aggressive "promotion" of his quants. Everywhere I look, he's already there with his "My quants" and "Silly test".

sorry if I bothered anyone. I don't do this for money or anything. I thought it was useful.

lesson learned.

Have you looked at his quants though? He claims that his Q5 and Q6 somehow capable of performing as well as fp16, with only his own words as proof and zero benchmarks.

Well, I have looked into them now, specifically Q6. KoboldCPP reports wrong number of tensors, 291 instead of 292, and doesn't load. Removing from the model card, won't be adding them anymore.
Probably I should reorder quant list too, GGUFs should the lowest priority (bf16 > fp8 > exl2 > gguf), especially considering that fp8 and exl2 are first-party.

I checked them before posting them. They were working inside colab even.
If you have problems try the 8B, if that works, then I will check again the 12B (but I am sure it worked).
NOTE: You must update llama.cpp to the very latest version for these to work.

Sign up or log in to comment