The test mark was after initial upload and after people pointed it out :) glad it is a good label though
Bartowski PRO
AI & ML interests
Recent Activity
Organizations
bartowski's activity
Yes, they may be subpar and may require changes to llama.cpp to support the interleaved sliding window
Yes, I got excited when a conversion worked and released them ASAP
That said, generation seems to work right now and seems to mimic the output from spaces that are running the original model
I have appended -TEST to the model names in an attempt to indicate that they are not final or perfect, but if people still feel mislead and that it's not the right thing to do, please post (civilly) below your thoughts, I will highly consider pulling the conversions if that's what people think is best. After all, that's what I'm here for, in service to you all !
This argument really doesn't make any sense to me.. surely if you're aiming for the most accurate overall representation anyone can see that gathering as many data points across a diverse area would yield the most useful results? Sure ideally your single light will probably get a reasonably close overall value.. but also it might not?
Additionally, I think his point was that you don't necessarily want to increase performance against a given corpus, but rather increase faithfulness to the original model against a given corpus
You may be able to keep PPL the same or better than the original while simultaneously veering far from what the original model would have generated, which while great for that corpus of text, is not what the intention of the quantization itself is (in fact many people worry about this a lot, fearing that the quantization will favour the text used as a reference, which I'm luckily seeing is not what happens at least for imatrix)
The fact that 2 models can have identical PPL scores yet generate completely different text should be proof enough that PPL only tells a tiny part of a story. Yes it's good to know the model is good, but when quantizing I don't need to know how good it is, I need to know how similar it is to the original.
I suppose that's reasonable, I guess why I like KLD more is that I breaks it down into percentages, like mean, max, 99.99%, etc etc, where PPL is just a single all encompassing number that's more difficult to interpret
I don't know if I can put much value into IQ6 outperforming fp16 because lately we've been seeing benchmarks where Q3 beats bf16, so while useful I don't know that they can't definitively tell us quant quality, but I do think it's a good proof of competency
This is why KLD to me provides at least a slightly clearer image of how well the quantization does at recreating the original model. I see what you're saying still about PPL but (at least how llama.cpp does it) KLD gives a more thorough look. That and TOP p is nice to see how often the models agree on the token
That's not an invalid point, but also when the final goal is quantization that 0.03% is negligible compared to the rest of the losses.
If you're talking about running at full precision, yeah, bf16 > fp16 by all means
I'd also prefer to see KLD of fp16 vs bf16 since PPL is, to me, pretty meaningless. I'm sure it has value and probably more than I give it, but unless it's PPL against the dataset it was trained on I don't really find much merit to it.
I appreciate the breakdown though, and even 0.4% is not enough to worry me when again the final goal is quantization, not to run it at that DTYPE.
To that end, do you happen to know if when quantizing from BF16.. does it get converted to FP16 first? Does it even matter? BF16 -> Q8 vs BF16 -> FP16 -> Q8, I wonder how different it would be. Gut instinct says it's in the 0.01% range.
Bf16 can't be offloaded to GPUs so imatrix becomes slow to make :')
Just so you all know, I'll be on vacation for the following two weeks and away from home! I'm hoping to get on at least once a day to load up some quants, but I won't be as bleeding edge and on the ball :) feel free to shoot me a message if you see one I should make!
In the meantime if you need something bleeding edge make sure to check out @MaziyarPanahi or @bullerwins who both put out great work!
I suppose I should add, that this is more valuable as a pseudo comparison to bf16
Since bf16 can represent the range (1, -1) with more precision than fp16, there is much debate as to whether it's safe to convert from bf16 to fp16, or if you should keep bf16, or even upcast to fp32, in order to preserve the original quality of the model for as long as possible before quantizing to 8 bits
This test shows that fp16 is capable of represent 99.97% of the weights in an FP32 model precisely, and therefore represents a negligible at best difference
Additionally, since the weights it can't represent are between 6e-5 and -6e-5, the weights it can't represent are so small that they most likely do not contribute to the finally output of the model and are relatively safe to prune
The reason for this comparison is that it should represent the same percentage of squishing as bf16 to fp16
Had claude make me a script, using the new Reflection-70B, and these are the results:
Total weights: 70553706496
Fully representable: 70530215524
Squashed: 23490972
Percentage squashed: 0.03%
0.03%!!!!
A couple things to note, this uses a roundtrip of F32 -> F16 -> F32 and then torch.isclose to account for rounding errors that come up by the very nature of extremely accurate numbers, but it uses VERY small tolerances (rtol=1e-5, atol=1e-8)
This is also examining EVERY weight that was stored at F32, and for most layers I was somewhere between 0% and 0.03% of weights being squashed, no major outliers.
Overall, I feel even safer converting to F16 for llama.cpp, the extremely small number of weights that fall outside the range are likely so small that they don't actually play a role in the final output of the model at inference anyways.
also maybe there should be a new feature to be explicitly notified about new repositories
That would be amazing, probably for average users but especially for me, where I sometimes stumble upon a model uploaded days ago that I somehow didn't notice from a creator I enjoy
We will have to see if something like that is possible without cluttering up the profile pages too much. But we'll try.
That sounds awesome, could even consider something like a toggle in the settings for "show this model on my page" or something, and possibly as a variable when using huggingface-cli or the HF python API
I think we'll be doing a social features sprint soon and this is exactly the kind of feedback we need! Thank you so much!
Beautiful, I love this :D If you need feedback on anything specific feel free to reach out, would love to be a guinea pig or just early eyes !
Had a funny thought, would it be at all possible to rework what shows up on our personal HF page?
Picture this: I upload a model to an organization, someone who follows me now has no idea that I've uploaded a model or to where, unless they also watch those repos (which also floods them with other notifications)
What if our main Huggingface page was a collection of both models that we've uploaded specifically to our profile, as well as models we've uploaded to organizations? That way it would all be contained in one central followable location, and I wouldn't have concerns about losing followership if I wanted to upload to an organization all of a sudden.
Oh another big pain point: notifications
I would love to be able to subscribe to be notified of new models posted by people or organizations, but it's near impossible as is
I would love better filtering
First I think sort by created is broken, but haven't checked on desktop recently
Second, I would love date filtering, like show me trending models that were only posted or updated in the past 7 days and such
Especially noteworthy at a time when most AI startups wouldn’t survive a year or two without VC money. Yay!
I'm happy to hear this too, money in the bank is good, but upwards momentum makes it so much easier to justify investing in new technology and improving things!
Especially noteworthy at a time when most AI startups wouldn’t survive a year or two without VC money. Yay!
It starts true; imatrix runs the model against a corpus of text and tracks the activation of weights to determine which are most important
However what the quantization then does with that information is where I was wrong.
I think I made the accidental connection between imatrix and exllamav2's measuring, where ExLlamaV2 decides how many bits to assign to which weight depending on the goal BPW
Instead, what llama.cpp with imatrix does is it attempts to select a scale for a quantization block that most accurately returns the important weights to their original values, ie minimizing the dequantization error based on the importance of activations
The mildly surprising part is that it actually just does a relatively brute force search, it picks a bunch of scales and tries each and sees which one results in the minimum error for weights deemed important in the group
But yeah, turns out, the quantization scheme is always the same, it's just that the scaling has a bit more logic to it when you use imatrix
Huge shoutout to @compilade for helping me wrap my head around it - feel free to add/correct as well if I've messed something up
much more difficult though if you're trying to iterate, definitely an interesting final validation
oh god dammit haha, i did not think of that possibility AT ALL 🤦
KL Divergence is almost identical - though even then upsetting that it's "almost" - but yup there's huge differences in the top p...
====== Perplexity statistics ======
Mean PPL(Q) : 6.339378 ± 0.038949
Mean PPL(base) : 6.337070 ± 0.038896
Cor(ln(PPL(Q)), ln(PPL(base))): 99.99%
Mean ln(PPL(Q)/PPL(base)) : 0.000364 ± 0.000067
Mean PPL(Q)/PPL(base) : 1.000364 ± 0.000067
Mean PPL(Q)-PPL(base) : 0.002308 ± 0.000427
====== KL divergence statistics ======
Mean KLD: 0.000005 ± 0.000001
Maximum KLD: 0.113848
99.9% KLD: 0.000346
99.0% KLD: 0.000055
99.0% KLD: 0.000055
Median KLD: 0.000001
10.0% KLD: -0.000014
5.0% KLD: -0.000021
1.0% KLD: -0.000035
Minimum KLD: -0.000120
====== Token probability statistics ======
Mean Δp: 0.002 ± 0.000 %
Maximum Δp: 19.102%
99.9% Δp: 0.417%
99.0% Δp: 0.155%
95.0% Δp: 0.067%
90.0% Δp: 0.040%
75.0% Δp: 0.010%
Median Δp: 0.000%
25.0% Δp: -0.007%
10.0% Δp: -0.034%
5.0% Δp: -0.062%
1.0% Δp: -0.154%
0.1% Δp: -0.439%
Minimum Δp: -5.820%
RMS Δp : 0.078 ± 0.016 %
Same top p: 99.927 ± 0.007 %
Either way I appreciate the insight and now question all my life decisions, especially the ones that involved me uploading fp32 files and spending 3x the time calculating imatrix on bf16 instead of fp16