EXL2 quants
Hi,
I have created exl2 quats of this model that are currently uploading:
3.00 bits per weight
3.50 bits per weight
4.00 bits per weight
4.50 bits per weight
5.00 bits per weight
6.00 bits per weight
8.00 bits per weight
Thanks! Would you mind giving specifics on settings you used when you made the quants? Length, rows, head bits.
I am using the defaults from exllamav2/convert.py
head bits = 6
length = 2048
dataset rows = 100
measurement rows = 16
measurement length = 2048
using wikitext-2-v1.parquet
Please let me know if any of these should be changed.
Thanks! I am not too familiar with EXL2 quants, but I think people try to do 8192 length and 200 rows. From my limited testing this takes a long while and such though.
I also believe people use a different calibration dataset when doing RP style quants.
What dataset is best to use for storywriter model quants?
I will test 8192 and 200 rows if I can get a dataset applicable and that large.
I think PIPPA is what people generally use for RP!
It is better to use default dataset from my limited testing. PIPPA makes models quite dumb, especially in coding tasks.
Makes sense. I think most people use these models primarily to RP though, not for coding.
I will give the following datasets a go:
Standard exllama dataset
https://huggingface.co/datasets/mpasila/LimaRP-PIPPA-Mix-8K-Context
https://huggingface.co/datasets/kootszhin/PIPPA-cleaned-formatted
https://huggingface.co/datasets/giganion/pippa_roleplay_standardized
https://huggingface.co/datasets/mpasila/PIPPA-ShareGPT-formatted-named
https://huggingface.co/datasets/KaraKaraWitch/PIPPA-ShareGPT-formatted
https://huggingface.co/datasets/grimulkan/PIPPA-augmented-dedup
https://huggingface.co/datasets/SulthanTriesToCode/PIPPALlama3SFT
Do you have any thoughts on the datasets and how to test the resultant models?
I am looking to use the exllama test_inference.py but the test dataset is in question.
not aimed at crestfall at all (bf16 safetensor loras in a versioned repo? that's about as far from baking an awful quant as you can get. thank you!) but I see the people they're talking about a lot and it bothers me. there's a reason that people who are smart enough to make things like exllamav2 and imatrix gguf quants use non-specific datasets by default (and encourage you to tweak everything else instead!)
people think the calibration data is like finetuning or something but it's about keeping the neural network intact at all. exl2 first ranks layer quantization schemas on that layer's activations - not future layers. that's why you can push 130 GB of llama through a 24/32 GB GPU/CPU and get a 2.4 bpw that works at the end. HQQ would be an example of a completely different approach. HQQ works because mathematicians are smarter than you. EXL2 works because tensor cores and large numbers are smarter than you.
It will work for the calibration data, (because that was the criterion](https://en.wikipedia.org/wiki/Goodhart%27s_law) But you weren't training it, just testing if you're microwaving its brain or not! You can't find the layer quant schemes that make sense (to the decoder) 30+ layers later ( L3 70B has 79!) by scoring it on data which doesn't even include every token type, you'll be trying to chisel it out of the noise floor, in the dark, with a single light source casting huge, misleading shadows.(1) It doesn't matter if the right thing happens to reach lm_head - you might as well just get an iron and smooth out all of those 3 bit brain wrinkles yourself. (2)
Quantization is destructive and exl2 is not backprop. to avoid some sort of horrific subtraction-only overfit, you need to test every angle (i think there's some mathematical sense that this true and if you pretend the space is euclidean metric you can do actual trig but-). Else you'll capture less of what's actually a feature and more noise that happened to be unusually excited by your one specific type of writing.
(or someone else's specific type of shlock. have you read pippa? if you can spell it doesn't speak for you.)
you can preserve big (70b) multi-layer features at low (2.3) bits per weight because the features are sparse. but because features are sparse and features aren't neurons - they completely vanish when only viewed/illuminated from a limited perspective (say a set of calibration contexts which are all clustered very near each other).
(1) this is a maths thing shut up (3)
(2) Cochrane Database, 2024, me
(3) it's actually like plato's cave except we're all more like diogenes than plato
Yours Regretfully,
A. Vramlet
@kim512 I don't want to lead you down a path that is unproductive, so you should know that I don't know what I'm talking about here. I have never heard of the datasets you mentioned. I thought there was only one PIPPA dataset.
@ProphetOfBostrom Delightful read, thanks. Using a general purpose dataset (or a good one, for starters -- pippa is garbage) sounds like the way to go. The fact pippa is mostly garbage and people still stick to it tells you a lot about exactly how optimized the field is at this point. Perhaps it's all placebo in the end.
That sounds like a good idea to use the exllama one.
I will re-quat this using the default calibration dataset.
@ProphetOfBostrom Thanks for all that information. I am going to use the exllama defaults for the quants and also some the other datasets for testing to see what the results are.
Corrected Quants using info from @ProphetOfBostrom . Thanks.
3.00 bits per weight
3.50 bits per weight
4.00 bits per weight
4.50 bits per weight
5.00 bits per weight
6.00 bits per weight
8.00 bits per weight
Details:
Created using the defaults from exllamav2
3.0bpw to 6.0bpw head bits = 6
8.0bpw head bits = 8
length = 8192
dataset rows = 200
measurement rows = 32
measurement length = 8192