Lewdiculous/Infinitely-Laydiculous-9B-GGUF-IQ-Imatrix

Lewdiculous

Owner Mar 15

•

edited Mar 15

Please share any feedback or requests here, I appreciate all inputs.

@zebrox – If you can, tell me how bad this is in comparison to Layris :3

Lewdiculous pinned discussion Mar 15

ABX-AI

Mar 15

wake up babe, new imat dropped :D
Thank you @Lewdiculous for working on these, I will certainly test it, I am going to try both Q8 and q4_K_m to see how speed/quality seems to be, going to try keeping it to the exact same prompt/context and all that.

Lewdiculous

Owner Mar 15

•

edited Mar 15

@zebrox Nothing crazy just I random idea I requested, haha. Curious how it compares. Thanks to Nitral and Jeiku for doing them.

For Q4, actually, test the Q4_K_S or IQ4_XS if you can, just because it's something I can compare more directly to 7B Q5_K_M in terms of VRAM.

ABX-AI

Mar 15

•

edited Mar 15

What is the ideal quant for an RTX 3070 with 9B/13B or even 20B? I seem to get great results on 7B, but quants drastically change the speed for me.

[Infinitely-Laydiculus-9b-Q4_K_M-imat.gguf]

llm_load_tensors: ggml ctx size = 0.28 MiB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloaded 40/41 layers to GPU
llm_load_tensors: CPU buffer size = 2137.23 MiB
llm_load_tensors: CUDA0 buffer size = 4990.62 MiB
...................................................................................................

[CtxLimit: 317/8064, Process:0.24s (1.5ms/T = 665.25T/s), Generate:4.07s (25.4ms/T = 39.35T/s), Total:4.30s (37.19T/s)
CtxLimit: 477/8064, Process:0.02s (22.0ms/T = 45.45T/s), Generate:4.06s (25.4ms/T = 39.38T/s), Total:4.08s (39.17T/s)
CtxLimit: 575/8064, Process:0.02s (22.0ms/T = 45.45T/s), Generate:2.44s (24.9ms/T = 40.16T/s), Total:2.46s (39.81T/s)
CtxLimit: 317/8064, Process:0.02s (22.0ms/T = 45.45T/s), Generate:4.03s (25.2ms/T = 39.66T/s), Total:4.06s (39.45T/s)
CtxLimit: 477/8064, Process:0.02s (24.0ms/T = 41.67T/s), Generate:4.06s (25.4ms/T = 39.42T/s), Total:4.08s (39.19T/s)
CtxLimit: 438/8064, Process:0.02s (23.0ms/T = 43.48T/s), Generate:2.82s (22.6ms/T = 44.29T/s), Total:2.85s (43.94T/s)
CtxLimit: 438/8064, Process:0.02s (22.0ms/T = 45.45T/s), Generate:0.00s (1.0ms/T = 1000.00T/s), Total:0.02s (43.48T/s)
CtxLimit: 623/8064, Process:0.58s (1.7ms/T = 573.66T/s), Generate:3.59s (22.4ms/T = 44.58T/s), Total:4.17s (38.41T/s)
CtxLimit: 719/8064, Process:0.02s (22.0ms/T = 45.45T/s), Generate:2.20s (22.9ms/T = 43.62T/s), Total:2.22s (43.18T/s)
CtxLimit: 782/8064, Process:0.02s (23.0ms/T = 43.48T/s), Generate:1.47s (22.9ms/T = 43.69T/s), Total:1.49s (43.01T/s)
CtxLimit: 782/8064, Process:0.02s (22.0ms/T = 45.45T/s), Generate:0.00s (1.0ms/T = 1000.00T/s), Total:0.02s (43.48T/s)
CtxLimit: 983/8064, Process:0.76s (2.0ms/T = 503.30T/s), Generate:3.63s (22.7ms/T = 44.09T/s), Total:4.39s (36.48T/s)
CtxLimit: 1143/8064, Process:0.02s (22.0ms/T = 45.45T/s), Generate:3.65s (22.8ms/T = 43.79T/s), Total:3.68s (43.53T/s)
CtxLimit: 1154/8064, Process:0.02s (22.0ms/T = 45.45T/s), Generate:0.23s (21.3ms/T = 47.01T/s), Total:0.26s (42.97T/s)
CtxLimit: 1175/8064, Process:0.03s (25.0ms/T = 40.00T/s), Generate:0.49s (22.1ms/T = 45.17T/s), Total:0.51s (42.97T/s)
CtxLimit: 1175/8064, Process:0.02s (24.0ms/T = 41.67T/s), Generate:0.00s (1.0ms/T = 1000.00T/s), Total:0.03s (40.00T/s)
CtxLimit: 1379/8064, Process:0.79s (1.9ms/T = 527.18T/s), Generate:3.68s (23.0ms/T = 43.47T/s), Total:4.47s (35.78T/s)
CtxLimit: 1539/8064, Process:0.02s (23.0ms/T = 43.48T/s), Generate:3.70s (23.1ms/T = 43.23T/s), Total:3.72s (42.96T/s)
CtxLimit: 1591/8064, Process:0.02s (24.0ms/T = 41.67T/s), Generate:1.19s (23.0ms/T = 43.55T/s), Total:1.22s (42.69T/s)
CtxLimit: 1591/8064, Process:0.02s (23.0ms/T = 43.48T/s), Generate:0.00s (1.0ms/T = 1000.00T/s), Total:0.02s (41.67T/s)
CtxLimit: 1774/8064, Process:0.79s (1.9ms/T = 529.26T/s), Generate:3.72s (23.2ms/T = 43.01T/s), Total:4.51s (35.51T/s)
CtxLimit: 1934/8064, Process:0.02s (23.0ms/T = 43.48T/s), Generate:3.75s (23.4ms/T = 42.69T/s), Total:3.77s (42.43T/s)
CtxLimit: 1979/8064, Process:0.02s (23.0ms/T = 43.48T/s), Generate:1.05s (23.3ms/T = 42.98T/s), Total:1.07s (42.06T/s)
CtxLimit: 1979/8064, Process:0.02s (24.0ms/T = 41.67T/s), Generate:0.00s (1.0ms/T = 1000.00T/s), Total:0.03s (40.00T/s)
CtxLimit: 322/8064, Process:0.43s (9.7ms/T = 103.29T/s), Generate:3.58s (22.4ms/T = 44.73T/s), Total:4.00s (39.97T/s)
CtxLimit: 482/8064, Process:0.02s (22.0ms/T = 45.45T/s), Generate:3.62s (22.6ms/T = 44.17T/s), Total:3.64s (43.91T/s)
CtxLimit: 642/8064, Process:0.02s (24.0ms/T = 41.67T/s), Generate:3.59s (22.5ms/T = 44.53T/s), Total:3.62s (44.24T/s)
CtxLimit: 779/8064, Process:0.06s (56.0ms/T = 17.86T/s), Generate:3.29s (24.0ms/T = 41.69T/s), Total:3.34s (40.99T/s))

[Infinitely-Laydiculus-9b-Q8_K_M-imat.gguf]

llm_load_tensors: offloaded 32/41 layers to GPU
llm_load_tensors: CPU buffer size = 5595.88 MiB
llm_load_tensors: CUDA0 buffer size = 7073.00 MiB
....................................................................................................

CtxLimit: 317/8064, Process:2.07s (13.2ms/T = 75.77T/s), Generate:12.90s (80.6ms/T = 12.40T/s), Total:14.97s (10.69T/s)
CtxLimit: 477/8064, Process:0.08s (80.0ms/T = 12.50T/s), Generate:13.86s (86.6ms/T = 11.55T/s), Total:13.94s (11.48T/s)
CtxLimit: 637/8064, Process:0.09s (88.0ms/T = 11.36T/s), Generate:14.94s (93.4ms/T = 10.71T/s), Total:15.03s (10.65T/s)
CtxLimit: 317/8064, Process:0.08s (78.0ms/T = 12.82T/s), Generate:12.92s (80.7ms/T = 12.39T/s), Total:12.99s (12.31T/s)
CtxLimit: 477/8064, Process:0.08s (80.0ms/T = 12.50T/s), Generate:13.89s (86.8ms/T = 11.52T/s), Total:13.97s (11.45T/s)
CtxLimit: 637/8064, Process:0.09s (87.0ms/T = 11.49T/s), Generate:14.93s (93.3ms/T = 10.72T/s), Total:15.02s (10.65T/s)

@Lewdiculous I did this tests earlier today, I will report back with Q4_K_S or IQ4_XS
But damn. Already, this is SO good, the Q4_K_M is flying! Going much, much lower on the Q8.

ABX-AI

Mar 15

[Infinitely-Laydiculus-9b-Q4_K_S-imat.gguf]

llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors: CPU buffer size = 70.31 MiB
llm_load_tensors: CUDA0 buffer size = 4820.80 MiB
...................................................................................................

CtxLimit: 317/8192, Process:0.21s (1.3ms/T = 747.62T/s), Generate:3.43s (21.5ms/T = 46.61T/s), Total:3.64s (43.92T/s)
CtxLimit: 477/8192, Process:0.02s (18.0ms/T = 55.56T/s), Generate:3.44s (21.5ms/T = 46.51T/s), Total:3.46s (46.27T/s)
CtxLimit: 613/8192, Process:0.02s (19.0ms/T = 52.63T/s), Generate:2.87s (21.1ms/T = 47.39T/s), Total:2.89s (47.08T/s)
CtxLimit: 317/8192, Process:0.03s (33.0ms/T = 30.30T/s), Generate:3.48s (21.7ms/T = 46.03T/s), Total:3.51s (45.60T/s)
CtxLimit: 477/8192, Process:0.02s (18.0ms/T = 55.56T/s), Generate:3.44s (21.5ms/T = 46.46T/s), Total:3.46s (46.22T/s)
CtxLimit: 637/8192, Process:0.02s (18.0ms/T = 55.56T/s), Generate:3.47s (21.7ms/T = 46.11T/s), Total:3.49s (45.87T/s)
CtxLimit: 317/8192, Process:0.02s (20.0ms/T = 50.00T/s), Generate:3.38s (21.1ms/T = 47.30T/s), Total:3.40s (47.02T/s)
CtxLimit: 477/8192, Process:0.02s (18.0ms/T = 55.56T/s), Generate:3.42s (21.4ms/T = 46.81T/s), Total:3.44s (46.57T/s)
CtxLimit: 598/8192, Process:0.02s (20.0ms/T = 50.00T/s), Generate:2.62s (21.7ms/T = 46.17T/s), Total:2.64s (45.82T/s)

[Infinitely-Laydiculus-9b-IQ4_XS-imat.gguf]

llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors: CPU buffer size = 66.41 MiB
llm_load_tensors: CUDA0 buffer size = 4548.80 MiB
...................................................................................................

CtxLimit: 317/8192, Process:0.17s (1.1ms/T = 918.13T/s), Generate:3.43s (21.4ms/T = 46.69T/s), Total:3.60s (44.47T/s)
CtxLimit: 477/8192, Process:0.02s (18.0ms/T = 55.56T/s), Generate:3.39s (21.2ms/T = 47.25T/s), Total:3.40s (47.00T/s)
CtxLimit: 622/8192, Process:0.02s (20.0ms/T = 50.00T/s), Generate:3.25s (22.4ms/T = 44.67T/s), Total:3.27s (44.40T/s)
CtxLimit: 317/8192, Process:0.02s (17.0ms/T = 58.82T/s), Generate:3.35s (21.0ms/T = 47.69T/s), Total:3.37s (47.45T/s)
CtxLimit: 477/8192, Process:0.02s (18.0ms/T = 55.56T/s), Generate:3.40s (21.3ms/T = 47.00T/s), Total:3.42s (46.76T/s)
CtxLimit: 637/8192, Process:0.02s (20.0ms/T = 50.00T/s), Generate:3.40s (21.2ms/T = 47.09T/s), Total:3.42s (46.81T/s)
CtxLimit: 317/8192, Process:0.02s (18.0ms/T = 55.56T/s), Generate:3.36s (21.0ms/T = 47.66T/s), Total:3.38s (47.41T/s)
CtxLimit: 477/8192, Process:0.02s (18.0ms/T = 55.56T/s), Generate:3.38s (21.1ms/T = 47.34T/s), Total:3.40s (47.09T/s)
CtxLimit: 612/8192, Process:0.02s (20.0ms/T = 50.00T/s), Generate:2.86s (21.2ms/T = 47.22T/s), Total:2.88s (46.89T/s)

Almost identical, XS slightly faster. I would go K_M for the quality, maybe?
In terms of quality, I need to re-do the tests with that in mind :) this was for speed. got some work now, will test more later
@Lewdiculous

Lewdiculous

Owner Mar 15

•

edited Mar 15

Personally, I think like this:

7B: 8GB VRAM - Q5_K_M is a great balance at 8K-12K context.
9B: 8GB VRAM - Q4_K_S is a good option, and performs fast with decent quality. The new IQ4_XS is also an option, and will take slightly less VRAM if you have any issues with the Q4_K_S due to other software/operating system using your VRAM at the same time.

For the other sizes like 11-13B you'll need to try the IQ3 quants.

Lewdiculous changed discussion status to closed Mar 15

Lewdiculous changed discussion status to open Mar 15

ABX-AI

Mar 15

•

edited Mar 15

Thank you. I kind of veer into Q6/8 for quality. But in all fairness, it's hard to tell a difference.
For example, ALL models I have tried, 7-32B and Mixtral, suck hard at coding. Their code is almost always wrong. I sometimes send it to GPT4 to check, and it comes back with 10+ issues. So really, what quality are we talking about? The benchmarks are all pretty unreliable. Nothing except GPT4 is 0-shot capable of good coding specifically on what I can run on my GPU.

For RP, reasoning, general questions - even lower quants seem fine to me. Is it hallucinating more on that? IDK, hard to say, it seems pretty comparable and the rest is down to the training set as to what language style it has and how it follows directions. A thing I am more likely to notice is that bad, smaller models simply start repeating themselves way too much.

Oh, from what I saw, I really like the output of Infinitely-Laydiculus-9b. As always, it wasn't too good at coding things right, but the way it responds to me feels really good otherwise, and of course - the speed of these quants is good. I had it output a story with my crazy tentacle AI lewd card and the language and quality seemed very good to me from this model. I had some logical arguments and talks, it was good too.

I have basically started leaning a lot more into only models with imat/DPO, I'm not sure if it has any drawbacks. It also feels faster than non-imat? Is that a real thing or am I imagining it?

tl/dr: really like this model as well, keep the experiments coming @Lewdiculous :)

Lewdiculous

Owner Mar 15

•

edited Mar 15

Thanks for the detailed feedback!

@Nitral-AI - Hey chef! It's something.

As always, it wasn't too good at coding things right (...)

Ah I mean, haha, honestly, never expect that, that's the last thing I care about and considering the smaller number of parameters it's not something I would care about anyways, bigger and "better" models already make enough mistakes. I care more for general roleplaying "smarts", good formatting, character card adherence and lack of refusals.

For roleplaying the smaller sizes really provide unparalleled speed and they can be pretty creative as long as they are used wisely with the right samplers, and of course, inevitably, they might need some swipes or higher temps to kick them off a pattern but it's pretty quick and even on bigger models there's a point where they can also do the same, but yeah, it takes longer, generally speaking.

For RP, reasoning, general questions - even lower quants seem fine to me. Is it hallucinating more on that?

From Q5_K_M and up the perplexity loss is very small and shouldn't be too noticable for this use case – especially with imatrix quants.

Q6 is also great for that, since you have the VRAM, that should be the obvious choice.

I have basically started leaning a lot more into only models with imat/DPO, I'm not sure if it has any drawbacks.

Shouldn't have, so far I've only received feedback that it significantly improves quantization quality.

ABX-AI

Mar 15

Thanks for the info @Lewdiculous , appreciate it! Of course, I never would consider the "coding" test as legitimate or serious for RP models, but I test it anyhow. By the minimum, some of the models do seem to do better formatting or seemingly try to do more interesting logic, but ultimately, as you say, even the big ones are bad. Even GPT-4 is not reliable, actually. Maybe Claude 3 is a bit better, but that's not enough. GPT-5 might totally crush it, who knows... I dream of a small model that can reliably help my vacant brain with coding :D

Do you think the difference from Q4_K_M/S in perplexity is (in actual practice) significant compared to Q6_K_M?
On graphs, it looks big, but I am honestly not sure if perplexity is even a sensible way to measure true quality, creativity and reliability. Meanwhile, everyone and their grandmother is training models on benchmarks, so it's very hard to rely on that.

Seemingly, the blind chat arena tests are a good way to tell, and it's clear how the big private models take all the top wins there. Qwen 70B is way down the line, and they waaay further down is Mixtral 7x8B.
Sending prayers for something big and strong to come to open source soon so we can catch up a bit.

Otherwise, as you say, for RP, stories, and entertainment value, the ones we have are quite good. Even for some general education purposes, but I'd be careful with that one ofc.

[Part 1] General discussion.