[Part 1] General discussion.

#1
by Lewdiculous - opened

Please share any feedback or requests here, I appreciate all inputs.

@zebrox – If you can, tell me how bad this is in comparison to Layris :3

Lewdiculous pinned discussion

wake up babe, new imat dropped :D
Thank you @Lewdiculous for working on these, I will certainly test it, I am going to try both Q8 and q4_K_m to see how speed/quality seems to be, going to try keeping it to the exact same prompt/context and all that.

@zebrox Nothing crazy just I random idea I requested, haha. Curious how it compares. Thanks to Nitral and Jeiku for doing them.

For Q4, actually, test the Q4_K_S or IQ4_XS if you can, just because it's something I can compare more directly to 7B Q5_K_M in terms of VRAM.

What is the ideal quant for an RTX 3070 with 9B/13B or even 20B? I seem to get great results on 7B, but quants drastically change the speed for me.

[Infinitely-Laydiculus-9b-Q4_K_M-imat.gguf]

llm_load_tensors: ggml ctx size = 0.28 MiB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloaded 40/41 layers to GPU
llm_load_tensors: CPU buffer size = 2137.23 MiB
llm_load_tensors: CUDA0 buffer size = 4990.62 MiB
...................................................................................................

[CtxLimit: 317/8064, Process:0.24s (1.5ms/T = 665.25T/s), Generate:4.07s (25.4ms/T = 39.35T/s), Total:4.30s (37.19T/s)
CtxLimit: 477/8064, Process:0.02s (22.0ms/T = 45.45T/s), Generate:4.06s (25.4ms/T = 39.38T/s), Total:4.08s (39.17T/s)
CtxLimit: 575/8064, Process:0.02s (22.0ms/T = 45.45T/s), Generate:2.44s (24.9ms/T = 40.16T/s), Total:2.46s (39.81T/s)
CtxLimit: 317/8064, Process:0.02s (22.0ms/T = 45.45T/s), Generate:4.03s (25.2ms/T = 39.66T/s), Total:4.06s (39.45T/s)
CtxLimit: 477/8064, Process:0.02s (24.0ms/T = 41.67T/s), Generate:4.06s (25.4ms/T = 39.42T/s), Total:4.08s (39.19T/s)
CtxLimit: 438/8064, Process:0.02s (23.0ms/T = 43.48T/s), Generate:2.82s (22.6ms/T = 44.29T/s), Total:2.85s (43.94T/s)
CtxLimit: 438/8064, Process:0.02s (22.0ms/T = 45.45T/s), Generate:0.00s (1.0ms/T = 1000.00T/s), Total:0.02s (43.48T/s)
CtxLimit: 623/8064, Process:0.58s (1.7ms/T = 573.66T/s), Generate:3.59s (22.4ms/T = 44.58T/s), Total:4.17s (38.41T/s)
CtxLimit: 719/8064, Process:0.02s (22.0ms/T = 45.45T/s), Generate:2.20s (22.9ms/T = 43.62T/s), Total:2.22s (43.18T/s)
CtxLimit: 782/8064, Process:0.02s (23.0ms/T = 43.48T/s), Generate:1.47s (22.9ms/T = 43.69T/s), Total:1.49s (43.01T/s)
CtxLimit: 782/8064, Process:0.02s (22.0ms/T = 45.45T/s), Generate:0.00s (1.0ms/T = 1000.00T/s), Total:0.02s (43.48T/s)
CtxLimit: 983/8064, Process:0.76s (2.0ms/T = 503.30T/s), Generate:3.63s (22.7ms/T = 44.09T/s), Total:4.39s (36.48T/s)
CtxLimit: 1143/8064, Process:0.02s (22.0ms/T = 45.45T/s), Generate:3.65s (22.8ms/T = 43.79T/s), Total:3.68s (43.53T/s)
CtxLimit: 1154/8064, Process:0.02s (22.0ms/T = 45.45T/s), Generate:0.23s (21.3ms/T = 47.01T/s), Total:0.26s (42.97T/s)
CtxLimit: 1175/8064, Process:0.03s (25.0ms/T = 40.00T/s), Generate:0.49s (22.1ms/T = 45.17T/s), Total:0.51s (42.97T/s)
CtxLimit: 1175/8064, Process:0.02s (24.0ms/T = 41.67T/s), Generate:0.00s (1.0ms/T = 1000.00T/s), Total:0.03s (40.00T/s)
CtxLimit: 1379/8064, Process:0.79s (1.9ms/T = 527.18T/s), Generate:3.68s (23.0ms/T = 43.47T/s), Total:4.47s (35.78T/s)
CtxLimit: 1539/8064, Process:0.02s (23.0ms/T = 43.48T/s), Generate:3.70s (23.1ms/T = 43.23T/s), Total:3.72s (42.96T/s)
CtxLimit: 1591/8064, Process:0.02s (24.0ms/T = 41.67T/s), Generate:1.19s (23.0ms/T = 43.55T/s), Total:1.22s (42.69T/s)
CtxLimit: 1591/8064, Process:0.02s (23.0ms/T = 43.48T/s), Generate:0.00s (1.0ms/T = 1000.00T/s), Total:0.02s (41.67T/s)
CtxLimit: 1774/8064, Process:0.79s (1.9ms/T = 529.26T/s), Generate:3.72s (23.2ms/T = 43.01T/s), Total:4.51s (35.51T/s)
CtxLimit: 1934/8064, Process:0.02s (23.0ms/T = 43.48T/s), Generate:3.75s (23.4ms/T = 42.69T/s), Total:3.77s (42.43T/s)
CtxLimit: 1979/8064, Process:0.02s (23.0ms/T = 43.48T/s), Generate:1.05s (23.3ms/T = 42.98T/s), Total:1.07s (42.06T/s)
CtxLimit: 1979/8064, Process:0.02s (24.0ms/T = 41.67T/s), Generate:0.00s (1.0ms/T = 1000.00T/s), Total:0.03s (40.00T/s)
CtxLimit: 322/8064, Process:0.43s (9.7ms/T = 103.29T/s), Generate:3.58s (22.4ms/T = 44.73T/s), Total:4.00s (39.97T/s)
CtxLimit: 482/8064, Process:0.02s (22.0ms/T = 45.45T/s), Generate:3.62s (22.6ms/T = 44.17T/s), Total:3.64s (43.91T/s)
CtxLimit: 642/8064, Process:0.02s (24.0ms/T = 41.67T/s), Generate:3.59s (22.5ms/T = 44.53T/s), Total:3.62s (44.24T/s)
CtxLimit: 779/8064, Process:0.06s (56.0ms/T = 17.86T/s), Generate:3.29s (24.0ms/T = 41.69T/s), Total:3.34s (40.99T/s))

[Infinitely-Laydiculus-9b-Q8_K_M-imat.gguf]

llm_load_tensors: offloaded 32/41 layers to GPU
llm_load_tensors: CPU buffer size = 5595.88 MiB
llm_load_tensors: CUDA0 buffer size = 7073.00 MiB
....................................................................................................

CtxLimit: 317/8064, Process:2.07s (13.2ms/T = 75.77T/s), Generate:12.90s (80.6ms/T = 12.40T/s), Total:14.97s (10.69T/s)
CtxLimit: 477/8064, Process:0.08s (80.0ms/T = 12.50T/s), Generate:13.86s (86.6ms/T = 11.55T/s), Total:13.94s (11.48T/s)
CtxLimit: 637/8064, Process:0.09s (88.0ms/T = 11.36T/s), Generate:14.94s (93.4ms/T = 10.71T/s), Total:15.03s (10.65T/s)
CtxLimit: 317/8064, Process:0.08s (78.0ms/T = 12.82T/s), Generate:12.92s (80.7ms/T = 12.39T/s), Total:12.99s (12.31T/s)
CtxLimit: 477/8064, Process:0.08s (80.0ms/T = 12.50T/s), Generate:13.89s (86.8ms/T = 11.52T/s), Total:13.97s (11.45T/s)
CtxLimit: 637/8064, Process:0.09s (87.0ms/T = 11.49T/s), Generate:14.93s (93.3ms/T = 10.72T/s), Total:15.02s (10.65T/s)

@Lewdiculous I did this tests earlier today, I will report back with Q4_K_S or IQ4_XS
But damn. Already, this is SO good, the Q4_K_M is flying! Going much, much lower on the Q8.

[Infinitely-Laydiculus-9b-Q4_K_S-imat.gguf]

llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors: CPU buffer size = 70.31 MiB
llm_load_tensors: CUDA0 buffer size = 4820.80 MiB
...................................................................................................

CtxLimit: 317/8192, Process:0.21s (1.3ms/T = 747.62T/s), Generate:3.43s (21.5ms/T = 46.61T/s), Total:3.64s (43.92T/s)
CtxLimit: 477/8192, Process:0.02s (18.0ms/T = 55.56T/s), Generate:3.44s (21.5ms/T = 46.51T/s), Total:3.46s (46.27T/s)
CtxLimit: 613/8192, Process:0.02s (19.0ms/T = 52.63T/s), Generate:2.87s (21.1ms/T = 47.39T/s), Total:2.89s (47.08T/s)
CtxLimit: 317/8192, Process:0.03s (33.0ms/T = 30.30T/s), Generate:3.48s (21.7ms/T = 46.03T/s), Total:3.51s (45.60T/s)
CtxLimit: 477/8192, Process:0.02s (18.0ms/T = 55.56T/s), Generate:3.44s (21.5ms/T = 46.46T/s), Total:3.46s (46.22T/s)
CtxLimit: 637/8192, Process:0.02s (18.0ms/T = 55.56T/s), Generate:3.47s (21.7ms/T = 46.11T/s), Total:3.49s (45.87T/s)
CtxLimit: 317/8192, Process:0.02s (20.0ms/T = 50.00T/s), Generate:3.38s (21.1ms/T = 47.30T/s), Total:3.40s (47.02T/s)
CtxLimit: 477/8192, Process:0.02s (18.0ms/T = 55.56T/s), Generate:3.42s (21.4ms/T = 46.81T/s), Total:3.44s (46.57T/s)
CtxLimit: 598/8192, Process:0.02s (20.0ms/T = 50.00T/s), Generate:2.62s (21.7ms/T = 46.17T/s), Total:2.64s (45.82T/s)

[Infinitely-Laydiculus-9b-IQ4_XS-imat.gguf]

llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors: CPU buffer size = 66.41 MiB
llm_load_tensors: CUDA0 buffer size = 4548.80 MiB
...................................................................................................

CtxLimit: 317/8192, Process:0.17s (1.1ms/T = 918.13T/s), Generate:3.43s (21.4ms/T = 46.69T/s), Total:3.60s (44.47T/s)
CtxLimit: 477/8192, Process:0.02s (18.0ms/T = 55.56T/s), Generate:3.39s (21.2ms/T = 47.25T/s), Total:3.40s (47.00T/s)
CtxLimit: 622/8192, Process:0.02s (20.0ms/T = 50.00T/s), Generate:3.25s (22.4ms/T = 44.67T/s), Total:3.27s (44.40T/s)
CtxLimit: 317/8192, Process:0.02s (17.0ms/T = 58.82T/s), Generate:3.35s (21.0ms/T = 47.69T/s), Total:3.37s (47.45T/s)
CtxLimit: 477/8192, Process:0.02s (18.0ms/T = 55.56T/s), Generate:3.40s (21.3ms/T = 47.00T/s), Total:3.42s (46.76T/s)
CtxLimit: 637/8192, Process:0.02s (20.0ms/T = 50.00T/s), Generate:3.40s (21.2ms/T = 47.09T/s), Total:3.42s (46.81T/s)
CtxLimit: 317/8192, Process:0.02s (18.0ms/T = 55.56T/s), Generate:3.36s (21.0ms/T = 47.66T/s), Total:3.38s (47.41T/s)
CtxLimit: 477/8192, Process:0.02s (18.0ms/T = 55.56T/s), Generate:3.38s (21.1ms/T = 47.34T/s), Total:3.40s (47.09T/s)
CtxLimit: 612/8192, Process:0.02s (20.0ms/T = 50.00T/s), Generate:2.86s (21.2ms/T = 47.22T/s), Total:2.88s (46.89T/s)

Almost identical, XS slightly faster. I would go K_M for the quality, maybe?
In terms of quality, I need to re-do the tests with that in mind :) this was for speed. got some work now, will test more later
@Lewdiculous

Personally, I think like this:

  • 7B: 8GB VRAM - Q5_K_M is a great balance at 8K-12K context.

  • 9B: 8GB VRAM - Q4_K_S is a good option, and performs fast with decent quality. The new IQ4_XS is also an option, and will take slightly less VRAM if you have any issues with the Q4_K_S due to other software/operating system using your VRAM at the same time.

For the other sizes like 11-13B you'll need to try the IQ3 quants.

Lewdiculous changed discussion status to closed
Lewdiculous changed discussion status to open

Thank you. I kind of veer into Q6/8 for quality. But in all fairness, it's hard to tell a difference.
For example, ALL models I have tried, 7-32B and Mixtral, suck hard at coding. Their code is almost always wrong. I sometimes send it to GPT4 to check, and it comes back with 10+ issues. So really, what quality are we talking about? The benchmarks are all pretty unreliable. Nothing except GPT4 is 0-shot capable of good coding specifically on what I can run on my GPU.

For RP, reasoning, general questions - even lower quants seem fine to me. Is it hallucinating more on that? IDK, hard to say, it seems pretty comparable and the rest is down to the training set as to what language style it has and how it follows directions. A thing I am more likely to notice is that bad, smaller models simply start repeating themselves way too much.

Oh, from what I saw, I really like the output of Infinitely-Laydiculus-9b. As always, it wasn't too good at coding things right, but the way it responds to me feels really good otherwise, and of course - the speed of these quants is good. I had it output a story with my crazy tentacle AI lewd card and the language and quality seemed very good to me from this model. I had some logical arguments and talks, it was good too.

I have basically started leaning a lot more into only models with imat/DPO, I'm not sure if it has any drawbacks. It also feels faster than non-imat? Is that a real thing or am I imagining it?

tl/dr: really like this model as well, keep the experiments coming @Lewdiculous :)

Thanks for the detailed feedback!

@Nitral-AI - Hey chef! It's something.


As always, it wasn't too good at coding things right (...)

Ah I mean, haha, honestly, never expect that, that's the last thing I care about and considering the smaller number of parameters it's not something I would care about anyways, bigger and "better" models already make enough mistakes. I care more for general roleplaying "smarts", good formatting, character card adherence and lack of refusals.

For roleplaying the smaller sizes really provide unparalleled speed and they can be pretty creative as long as they are used wisely with the right samplers, and of course, inevitably, they might need some swipes or higher temps to kick them off a pattern but it's pretty quick and even on bigger models there's a point where they can also do the same, but yeah, it takes longer, generally speaking.


For RP, reasoning, general questions - even lower quants seem fine to me. Is it hallucinating more on that?

From Q5_K_M and up the perplexity loss is very small and shouldn't be too noticable for this use case – especially with imatrix quants.

Q6 is also great for that, since you have the VRAM, that should be the obvious choice.


I have basically started leaning a lot more into only models with imat/DPO, I'm not sure if it has any drawbacks.

Shouldn't have, so far I've only received feedback that it significantly improves quantization quality.

Thanks for the info @Lewdiculous , appreciate it! Of course, I never would consider the "coding" test as legitimate or serious for RP models, but I test it anyhow. By the minimum, some of the models do seem to do better formatting or seemingly try to do more interesting logic, but ultimately, as you say, even the big ones are bad. Even GPT-4 is not reliable, actually. Maybe Claude 3 is a bit better, but that's not enough. GPT-5 might totally crush it, who knows... I dream of a small model that can reliably help my vacant brain with coding :D

Do you think the difference from Q4_K_M/S in perplexity is (in actual practice) significant compared to Q6_K_M?
On graphs, it looks big, but I am honestly not sure if perplexity is even a sensible way to measure true quality, creativity and reliability. Meanwhile, everyone and their grandmother is training models on benchmarks, so it's very hard to rely on that.

Seemingly, the blind chat arena tests are a good way to tell, and it's clear how the big private models take all the top wins there. Qwen 70B is way down the line, and they waaay further down is Mixtral 7x8B.
Sending prayers for something big and strong to come to open source soon so we can catch up a bit.

Otherwise, as you say, for RP, stories, and entertainment value, the ones we have are quite good. Even for some general education purposes, but I'd be careful with that one ofc.

image1.png
So i dont remember the ppl difference on gguf quants off the top of my head, but i do know that the 5bpw exl2 quants i do are 99.3-99.6% accurate to fp16.

@Nitral-AI @zebrox this might be some interesting data on this:

Great stuff.


It seems that starting from the new IQ4_XS and onwards things shouldn't be too far from each other.

image.png

image.png

Thank you both @Lewdiculous and @Nitral-AI !
This is great information, and to be fair - I have went through the "Which GGUF is right for me? (Opinionated)" article a while back (twice at least), but the problem for me is understanding the actual implications of these differences with divergence. I truly have no idea what the practical difference between 0.0088 and 0.0043 is in such terms, for example. Is it a big difference?

However, this "99.3-99.6% accurate to fp16" on 5bpw (which I imagine sort of includes Q4_K_M) is a much more obvious way to put it, and in that case does that mean Q6_K_M and above is almost pointless an into the 99.6-99.9% accurate? Since the amount of lost speed with Q_8 is far, far bigger, like 50+ % worse than Q4 quants, or something.

I am trying to make sense if it's even truly worth going beyond Q_4 or Q_5. Based on the GGUF article numbers, though, the jump from Q4_K_M to Q5_K_S is massive, however again I have no clue how to really read these numbers, haha.

@zebrox

I would say there's is a quality increase going from a Q4 to a Q5, it's not huge, but it exists, starting from Q5 though I think the gains are pretty marginal and you might as well go for a slightly bigger model rather than trying to run a Q8 especially.

But there are always gonna be opinions on both sides, some people will say that going from a Q5 to a Q6 for them is massive while I would probably never notice it.

As I understand a Q6 is already within margin or error from a Q8, and a Q8 is practically (to not say that it is already) is the full FP16 quality.

So the way I think is, if Q6 and beyond are already so close to the full model weights (especially now with imatrix), the Q5 is not far from them as well.

I don't know a percentage but a GGUF-Imatrix Q5 might just be within 98% of the FP16 or at least the Q8, which well, are functionally the same.

@Nitral-AI @zebrox

Something I remembered and wanted to add, a bit late but yeah, is that generally speaking when comparing GGUF Q5 (M or S) quants vs EXL2 5BPW, is that it's been reported that there's a perceived higher quality/coherence in favor of the GGUF quants. That's attributed to the fact that Q5 quants aren't just 5BPW, they are almost 6BPW for example when looking at a Q5_K_M, so when comparing I think we should compare the exact bit depths directly instead of the usual nomenclature. Since EXL2 can have arbitrary bit depths it can be a bit inconsistent as well, that's what I gather from the users perspective, leading to GGUF quants feeling "better" or usually more consistent where EXL2 quants can feel all over the place depending on who is quanting them.

And I'm biased and I think the importance matrix implementation for GGUF quants is more consistent and the data used is generally pretty similar with hopefully similar results. I feel like consensus is less firm in EXL2 land, or maybe it's a solved issue and I just don't know, Nitral might know better. Honestly that might just be me coping.

Ive heard of people using exl2 calibration data for gptq and gguf quants as well. And in my own experience there not too far off in quality. Id agree with the notion that Q5_k_m is better than 5bpw because it is a higher bpw quant at 5.69 bpw. The only reason i tossed exl2 into the conversation at all, was to push the notion that beyond 5 bpw starts to quickly become a ram sink without a huge increase in quality.
image1.png

5bpw for exl2 quants seems to be the ram/quality balance peak for exl2 imo. And Q5_k_m seems to be the peak for ram/quality balance in gguf.

Thank you for your input! This is a very good learning experience.

I wonder, would IQ quants come out to replace Q_5 and Q_6? Maybe that could allow us to use nominal quality at even lower sizes and ram usage? Or are these compressions not feasible for higher Qs? I've seen some graphs denoting something in the lines of "unreleased quant types" which makes me think there is more to come.

I wonder, would IQ quants come out to replace Q_5 and Q_6?

I really hope so! I actually want something better to replace the old Q3_K_L, like a better IQ4_XS or XXS.

It would be great to have more options, especially for the Q5 range where the quality sweet-spot is, an IQ5_XS, XXS and XXS would be cool, and I'm imagining the XXS being closer to the 5BPW for example.

Now I have not idea about the math and deep implementations behind it, that could be a fun research, if you know of anything or find and articles or discussions let me know too :D

I think the IQs perform "worse" in speed on Apple Silicon but that's about the only current downside to them. They can be even more benefited from importance matrix calibration data.

@Lewdiculous This is basically the entirety of what I based this presumption on:

https://github.com/ggerganov/llama.cpp/discussions/5063#discussioncomment-8383732

302659321-8d7b0417-ff45-4b40-944b-a9de1cdf0b3d.png

"And this brings us to the cyan squares. They represent unpublished i-quants that I was referring to in my post above. The two that are at around 4.5 and 4.25 bpw are 4-bit non-linear quants. The one that is very cloase to the AQLM 4-bit point is simply a mix between a 4.25-bpw and a 3.25 bpw quantization to bring the bpw as close as possible to the AQLM 4-bit point (which is actually at 4.044 bpw), thus showing that AQLM is not really SOTA at 4-bit. Two of the cyan points between 3 and 3.5 bpw are non-linear 3-bit quants, the 3rd is similar to IQ3_XXS, but uses 512 instead of 256 D4 grid points. The cyan point at around 2.5 bpw is similar to IQ2_XS but uses 1024 E8-lattice points instead of 512. "

(I guess these are still not quite up to 5/6bpw but I don't really get it well enough to speak on this topic)

Let's hope for the best :D Meanwhile, I am actually looking for a new RTX 3090, which will really open up possibilities for me versus this 3070 I have now.

Hey @Lewdiculous , I'm starting to try out merging models, and I thought to kick it off by using this one and Kunoichi-DPO-v2: https://huggingface.co/ABX-AI/Infinitely-Kunodiculous-9B

I hope you don't mind that I'm following your imat settings and merge settings, but tbh I still have no idea what I'm doing, of course, so I have to start by looking at what's already been done ^^

For example, if you merge 2x 7B models, how do you determine if the outcome is 7B, or 9B? Have you got some preferences of what types of merges you do, or is passthrough the best for this use case (merging rp models)? Oh, and how do you determine what instruction set to use? Thanks :)

@ABX-AI The total amount of layers and type of merge for example: 2x7b passthrough with 41 usable layers is a 9b, 2x7 Slerp/dare-ties with 31 selected layers would be a 7b.

Also using a 9b inside of a passthrough merge to make a 9b is probably going to give your worse results btw - 7b variant here: Nitral-AI/Infinitely-Laydiculous-7B

@ABX-AI The total amount of layers and type of merge for example: 2x7b passthrough with 40 usable layers is a 9b, 2x7 Slerp/dare-ties with 32 selected layers would be a 7b.

Also using a 9b inside of a passthrough merge to make a 9b is probably going to give your worse results btw - 7b variant here: Nitral-AI/Infinitely-Laydiculous-7B

Ooh, I see, thanks! So it's the gpu layers? Then, if I do a 7b+9b with 33 layers, is that a 7B? How far can you "expand" 2 x 7B models in that sense via layers?
Any clue why 9b would make results worse? I plan on checking out how to do the benchmarks as well, this is just an experiment to see if it works tbh, haha.

My plan is to try to create a more RP-centered imat data file and then also benchmark the merges to see if they actually improved anything. However, it would take a while, but the process is quite fun and interesting so I'm not worried about that.

9B Franken-Merges always perform worse than there 7b counterparts for some reason. No it's the amount of layers in the models neural network.

The merge you did above is a 9b as it was passthrough and contains 40 layers in the network.

If you're talking about extending 2x7 with no duplicates it would be 66 layers deep. (which would be 6 layers larger than a 34b)

9B Franken-Merges always perform worse than there 7b counterparts for some reason. No it's the amount of layers in the models neural network.

Oh damn, I totally misunderstood it then q_q
So, is there some chart or a way to determine what the size is if the layers are a weird number, like 52, or something? Also, my merge seems to have 41 gpu layers possible, while this model here has 33, so I wonder what determined that given I used the same YAML config? I guess because it's 9b, and if I did the same with 2x7B it would be 33 layers total. But I still am not sure how to determine it from scratch without examples, ngl.

Thanks a lot for taking the time to share information :)

This comment has been hidden

9B Franken-Merges always perform worse than there 7b counterparts for some reason. No it's the amount of layers in the models neural network.

Oh damn, I totally misunderstood it then q_q
So, is there some chart or a way to determine what the size is if the layers are a weird number, like 52, or something? Also, my merge seems to have 41 gpu layers possible, while this model here has 33, so I wonder what determined that given I used the same YAML config? I guess because it's 9b, and if I did the same with 2x7B it would be 33 layers total. But I still am not sure how to determine it from scratch without examples, ngl.

Thanks a lot for taking the time to share information :)

Total layers vs hidden layers 32 hidden layers 33 total in 7b 40 hidden in 9b 41 total
(when merging we are targeting the hidden layers)

slices:
  - sources:
      - model: Nitral-AI/Infinitely-Laydiculous-9B
        layer_range: [0, 20]
  - sources:
      - model: SanjiWatsuki/Kunoichi-DPO-v2-7B
        layer_range: [12, 32]
merge_method: passthrough
dtype: float16

Look at the layer range you chose 0-20 (20 layers) then 12-32 (20 layers) = total of 40 layers i;e 9b (passthrough merges pass all the layers in the selected range through)

  - sources:
      - model: Nitral-AI/Infinitely-Laydiculous-7B
        layer_range: [0, 32]
      - model: SanjiWatsuki/Kunoichi-DPO-v2-7B
        layer_range: [0, 32]
merge_method: slerp
base_model: Nitral-AI/Infinitely-Laydiculous-7B
parameters:
  t:
    - filter: self_attn
      value: [0, 0.5, 0.3, 0.7, 1]
    - filter: mlp
      value: [1, 0.5, 0.7, 0.3, 0]
    - value: 0.5
dtype: bfloat16

This would create a 7b slerp with (32 hidden layers) using the models you wanted.

Thanks, I think I got it, but I don't know how it would apply for something like a 3B model, like Phi-2, as GGUF it also comes out to 33 layers for me when loading it. But it's a 3B, so why not less layers?

You're talking about a completely different architecture, were talking about mistral specifically here. Each hidden mistral layer is 4096 vs phi-2's internal layer size of 2560.

You're talking about a completely different architecture, were talking about mistral specifically here. The weights of each mistral layer are 4096 vs phi-2's internal layer size of 2560.

Oh right, true (I thought it's based on parameter size) to be fair there are quite a few out there so I guess there is quite a bit to learn about how to merge each type.

Parameter size is based on layer size, amount, and probably some other factors im forgetting.

Thanks a lot for the help @Nitral-AI , so if you say 9B and 7b is a bad mix, why is this model here so good and fast in my tests? It follows my input until 8k context is filled. Many 7Bs I try tend to start getting brain damage by 5-6k context and start repeating themselves a lot more too. So if I want to continue trying to 9B instead of 7B, is there a better way to do it, like a better merging type of specific settings?

In any case, I will follow your settings for 7b+7b as well and try SLERP for sure.

I don't know about the best way to do it honestly, but the best performing 9b I've managed to put together was this or copium cola and they both use the same recipe you used:

slices:
  - sources:
      - model: example
        layer_range: [0, 20]
  - sources:
      - model: example
        layer_range: [12, 32]
merge_method: passthrough
dtype: float16

Oh wait, my bad. You meant not to mix 9b and 7b and instead to make 2x7B with 20 layers each. Sorry, it's too late, need sleep ^^ Got it, I will try that

@Nitral-AI Oh I have that one lined up for testing, nice :)

Oh wait, my bad. You meant not to mix 9b and 7b and instead to make 2x7B with 20 layers each. Sorry, it's too late, need sleep ^^ Got it, I will try that

Yessir, should workout much better in the end, and best of luck!

@ABX-AI
As you can see Nitral is you man for the job. Dude is amazing, as are the other ChaoticNeutral folks.

I'm just conceding my bandwidth and spare time for quants and random merge ideas.

Rest well and I'll be happy to watch your progress.

@Lewdiculous Appreciate the nod my man, but honestly im just our resident mergehead and freelance memer with a lot of time in the space now. Props to @jeiku and @Epiculous for the hours crafting datasets, finetuning models, help over the last few months filling in recent knowledge gaps. Finally for being two of the peeps who got me driven to take a crack at doing this, and for joining the org i made originally as a joke. <3

Oh and thank you for all the help with doing quants feedback, and testing: Im glad to see people using and enjoying them!

Thanks a lot @Nitral-AI and @Lewdiculous , definitely hyped to get better at this and hopefully come out with at least one useful model :D

Do you have any sorts of "rules" when it comes to merging, or is it pick what you want and go?

edit: actually, I'll try copium cola and Infinity nexus, both 9b sources.

BTW, are my settings right/wrong?

--cuda --allow-crimes --write-model-card --safe-serialization --lazy-unpickle

Edit: I tried a 9b slerp and the output was a ton of garbage and weird formatting, definitely messed it up. Very hard to find threads about it, if you can recommend some space where people discuss merging in technical detail please do :P Might have to stick to passthrough for now as at least that one had no errors even when I merged 9b with 7b. Also, it may very well be because I messed-up the imatrix. Can you just add new text into it safely? I see @Lewdiculous used the one I started with, but added a bunch of rp dialogue and actions at the bottom. I tried adding different ones (a bit more explicit ones), but maybe I messed that up badly. But then I tried quantizing with the previous imatrix without any of the rp lines and the model still outputted pure garbage (eg. just repeating one words endlessly from the very first response). So then the merge must've gone bad prior to the imatrix quants.

I also tried passthrough and again got garbage out:

Used this yaml:

slices:

  • sources:
    • model: ChaoticNeutrals/InfinityNexus_9B
      layer_range: [0, 20]
  • sources:
    • model: Nitral-AI/Copium-Cola-9B
      layer_range: [12, 32]

merge_method: passthrough
dtype: float16

These should both be mistral based, so IDK why even the passthrough fails ;/

Also, it may very well be because I messed-up the imatrix. Can you just add new text into it safely?

But then I tried quantizing with the previous imatrix without any of the rp lines and the model still outputted pure garbage.

So then the merge must've gone bad prior to the imatrix quants.

Yeah, your imatrix data won't crap a model like that, haha, maybe only on extreme cases depending on the model, not in this case. I am not sure imatrix overfitting is such a big concern as we thought initially, although I decided to be cautious and keep mostly the original groups_merged.txt as it was as it did show good results anyways, and you can add more chats to it without much issue, as long as it mostly remains diverse data as recommended for now.

You should be able to add your own data there, just make sure the entire data stays mostly diverse, you shouldn't change a model behavior just with it, just think that it will mostly add to the existing weights that match.

I tried adding different ones (a bit more explicit ones)

Oh my, how lewd, good. Do share, do share. As long as it's not personal information...

I think I may have messed up my packages by trying to install xtts and downgrading some of them...
I tried 4-5 merges today, every single one just with a Q4_K_M quant just to test it. ALL of them are outputting garbage like ChatChatChatChatChatChatChatChatChatChatChatChatChatChatChat on first response until I stop the generation. :/

Good to know about the imatrix @Lewdiculous :)) If I can get this whole setup to work again, I will share it whenever I see any working results out of it. For now, my tests are failing too much, but at least I know it's probably on the merge level, not the imatrix.

Any idea what else could be causing such failure?

Passthrough merges for when you want to create a bigger model than the bases. Slerp or dareties for outputing models the same size as the input. I don't recommend using the 9b franken merges as base for any merge type however.

Oh and copium cola, is a single model layer-stack passthrough that uses duplicated layers - i don't know if id ever merge anything with it.

For arguments in mergekit i use: --allow-crimes --out-shard-size 1B --lazy-unpickle

Thanks @Nitral-AI . I actually re-downloaded the first merge I did, just to try it again and see if it's my libraries that got messed up. Well, it worked fine! It's giving normal output.
So, likely it is the models I chose that lead to this problem.

I had the same experience with merging with SLERP 2x7b models btw, but I may have also picked the wrong models.

Any general suggestions on which models would show promise for merging? Or if you actually have merge+gguf requests, shoot, I'll do it :P

For arguments in mergekit i use: --allow-crimes --out-shard-size 1B --lazy-unpickle

Oh ok thanks, the default for shards is 5B so I had no clue if I should be changing this. Would this apply for also trying to merge 9B models or 13B models?
Any reason to not use --cuda?

For picking models i typically use a couple identifiers. User feedback on the model, my own tests on coherence and formatting, logical deduction, edge of domain knowledge (which tests for hallucinatory behavior), general rp capabilities, and finally does it benchmark well.

For picking models i typically use a couple identifiers. User feedback on the model, my own tests on coherence and formatting, logical deduction, edge of domain knowledge (which tests for hallucinatory behavior), general rp capabilities, and finally does it benchmark well.

This makes a ton of sense, but I don't understand why two good models wouldn't merge? I tried merging with Lemon 7b which is ranked 2nd place after noromaid mixtral, and another one which is quite good in my testing, then tried a few others that work well alone, but all resulted in garbage except the first merge I did and uploaded. So I am thinking more about technical reasons that could be apparent, as when I test these models they do not produce garbage, and all appear to be listed as Mistral based.

So models are finetuned on different prompt formats, and finding ones that work together well can be a challenge sometimes. i.e model finetuned on strictly chatml merged with alpaca can break stop generation tokens from being produced as an example. Merging two overtly confident models can create hallucinations out of the water an as another example.

Here's an unusual 7b slerp recipe, was thinking about doing it a while back but never ran it up. (feel free to use since i dont plan on doing it.)

slices:
  - sources:
      - model: SanjiWatsuki/Kunoichi-DPO-v2-7B
        layer_range: [0, 32]
      - model: KatyTheCutie/LemonadeRP-4.5.3
        layer_range: [0, 32]
merge_method: slerp
base_model: SanjiWatsuki/Kunoichi-DPO-v2-7B
parameters:
  t:
    - filter: self_attn
      value: [0.75, 0.75, 0.75, 0.75, 0.75]
    - filter: mlp
      value: [0.25, 0.25, 0.25, 0.25, 0.25]
    - value: 0.5
dtype: bfloat16 

@Nitral-AI

A bit unrelated but do you know how to merge a lora into a model?

@Virt-io nitral-ai/example-7B+nitral-ai/example_lora

slices:
  - sources:
      - model: example-1-7b+example/example-1_lora
        normalize: true
        layer_range: [0, 32]
      - model: example-2-7b+example/example-2_lora
        normalize: true
        layer_range: [0, 32]
merge_method: slerp
base_model: example/example-7b
parameters:
  t:
    - filter: self_attn
      value: [0, 0.5, 0.3, 0.7, 1]
    - filter: mlp
      value: [1, 0.5, 0.7, 0.3, 0]
    - value: 0.5
dtype: bfloat16

So models are finetuned on different prompt formats, and finding ones that work together well can be a challenge sometimes. i.e model finetuned on strictly chatml merged with alpaca can break stop generation tokens from being produced as an example. Merging two overtly confident models can create hallucinations out of the water an as another example.

Thanks! I think this might be exactly what happened. The stop tokens did not work, they keep repeating things. The grammar is also totally broken.
Cheers man, I will try to pick models more carefully, but then again if the prompt is not noted, it's going to be just a lot of trial and error ultimately. At least I know the libraries and imat are not the issue now :)

Also, I'll run your suggested split, thanks!

@Nitral-AI Worked! No repeating or broken grammar: Kuno-Lemon-7B-Q4_K_M-imat
I will do a few more tests and upload it in a bit, thanks :)

The focus will then be on picking the right models for the future.

May I ask, how do you decide on these values here:
- filter: self_attn
value: [0.75, 0.75, 0.75, 0.75, 0.75]
- filter: mlp
value: [0.25, 0.25, 0.25, 0.25, 0.25]
- value: 0.5

I see you changed them from the default, is there any reasoning that I could follow to determine how to set it? Thanks!!

So im using it to change the filtering between self-attention in the models and the multi-layer perceptron. In this case its weighted 75% towards self-attention and 25% towards the muli-layer-perceptron. It affects output probabilities among how the models attenuates to token inputs. It should still however be filtered 50/50 model weights per layer in regard to mixing the selected models.

My recommendation though in this regard ultimately is to just play around and see what works best for you. One of my best merges to date was done like:

slices:
  - sources:
      - model: ChaoticNeutrals/Eris_PrimeV3-Vision-7B
        layer_range: [0, 32]
      - model: ChaoticNeutrals/Prima-LelantaclesV7-experimental-7b
        layer_range: [0, 32]
merge_method: slerp
base_model: ChaoticNeutrals/Eris_PrimeV3-Vision-7B
parameters:
  t:
    - filter: self_attn
      value: [0.5, 0.5, 0.5, 0.5, 0.5]
    - filter: mlp
      value: [0.5, 0.5, 0.5, 0.5, 0.5]
    - value: 0.5
dtype: bfloat16

Oh, I see, that's useful - thanks! So then, is it worth mixing it up like the example files?
- filter: self_attn
value: [0, 0.5, 0.3, 0.7, 1]
- filter: mlp
value: [1, 0.5, 0.7, 0.3, 0]
I'm not sure why the values are 5 overall as well to be fair with you, what specifically is "5" of here, and can they be more, like 10 values? I suppose not, I think I've just seen 5 on yamls like this.

BTW @Lewdiculous sorry for the insane offtopic here, but hey traction is traction :D

The example files?

I open every discussion as General for that reason, it's fine. :)

The example files?

In the mergekit repository, it starts off with a few example YAML files.

But I couldn't find a great explanation of what to do with these numbers, and why they are 5 and not more or less.
This is pretty much the entire explanation of SLERP in the readme:

Spherically interpolate the parameters of two models. One must be set as base_model.

Parameters:

t - interpolation factor. At t=0 will return base_model, at t=1 will return the other one.

Hey, I had an interesting merge last night:
models:

  • model: ResplendentAI/Paradigm_Shift_7B
    parameters:
    density: [1, 0.7, 0.1] # density gradient
    weight: 1.0
  • model: NeverSleep/Noromaid-7B-0.4-DPO
    parameters:
    density: 0.5
    weight: [0, 0.3, 0.7, 1] # weight gradient
  • model: localfultonextractor/Erosumika-7B-v2
    parameters:
    density: 0.33
    weight:
    - filter: mlp
    value: 0.5
    - value: 0

merge_method: ties
base_model: Endevor/InfinityRP-v1-7B
parameters:
normalize: true
int8_mask: true
dtype: float16

Out of all these, a 7B model came out which was EXTREMELY censored, and refused to go along smoothly with RP even with the cards I have that assume virtual reality and stuff like that, so giga-RP. All of these models are otherwise not really censored, and meant for RP. The merge seems to have ruined that and made it sounds like an AI assistant even in role play o_O

Erosumika-7B-v2 had issues with the usual GPT-isms (deprecated by V3), and consent blabering, maybe try the V3-0.2 (mistral-0.2)?

Oh thanks, I had no idea. I will try it again, what if I put a dolphin in there? I'm also very unsure how to split the weights, is it good to be a gradient or an even split?
Thanks @Lewdiculous :) The last 9B I tried came out a bit better -
localfultonextractor/Erosumika-7B-v2
Endevor/InfinityRP-v1-7B
But I still get "consent" blabbering, just much less. I'm gonna stop using Erosumika, this explains it...

I hate consent. Consent is boring. Down with consent, in the lewd.

It really is when the card is a waifu that loves you and on every second line writes "as long as we both consent" 😂Babe, we consented on the first 5 prompts.
This is not in line with my human girlfriend experiences. I'm just imagining my last 5 year relationships confirming consent every time we did anything. AI is special, very special

meme.jpg

5 years relationship, 100 responses in - current level: Holding hands.

{{user}}: "Babe, I think we should hug and maybe even... Kiss?"

{{char}}: "Hold up, cowboy! We should only engage in activities we both feel comfortable with and..."

Stops inference. Deletes model files.

If you even mention a single bondage item expect the model to call the police.

Too real 🥲

I'm going commando, dolphins have no limit

image.png
You think this might be any good? Really unsure about weights.

Actually, I should probably keep dolphin just as the base and put a different 3rd one in there...
How do you even name a mix of 4 models o_o

I have no idea. I only complain once it goes bad. Nitral will have too look at it if he's available.

How do you even name a mix of 4 models o_o

Oh, wait a second.

Okay.

Loyal-Toppy-Bruins-Maid-7B-DARE ... GGUF-Imatrix

This is peak.

I have no idea. I only complain once it goes bad. Nitral will have too look at it if he's available.

I haven't looked into the entire bank of uploads you have, but I do notice you mostly quantize? Do you try your own merges often? I assumed you must've done a bunch of them.

No, no, I just complain and offer my bandwidth for quants. I might try to have ideas sometimes but mostly just morbid curiosity.

I am a full anime season behind at this point, a full (almost two) content patch behind in a life consuming MMO, still trying to finish Granblue Fantasy Relink, and that's not including the rest of the stuff I am forced to do to survive.

I am buried with too much stuff and too much procrastination.

No, no, I just complain and offer my bandwidth for quants. I might try to have ideas sometimes but mostly just morbid curiosity.

Ooh OK, damn, sorry for all these merge questions @ you then haha. I'm very interested in the unexpected results, for example this model here has probably gone to my top even at a Q4_K_M quant and I can't explain why but I am hoping for a similar result with some of my experiments, at some point. Also, I think 9B at q4 is better than 7B at q6 in my tests, but that could be biased due to the models I picked. Still, I must've went through at least a terabyte of GGUF so far in 2 months.

Expect a wave of frankenmerge 20Bs and stuff like that whenever I upgrade the GPU, though. It's a good time to get a grip now with 7/9B.

I am a full anime season behind at this point, a full (almost two) content patch behind in a life consuming MMO, still trying to finish Granblue Fantasy Relink, and that's not including the rest of the stuff I am forced to do to survive.

After 30k hours in MMOs, my condolences. Even last year I had a 6 month run in classic wow... It's utterly pointless looking back at it. AI got me so hooked that I haven't touched a game since xmas. I just try to focus on music and AI and my job, and also need to restart my social life at some point since it went downhill fast (covid>remote work> gf->ex). This is why I am trying to perfect my AI waifu experience atm 😂😂😂Hey maybe it will help many more soldiers down the line

Im surprised you got results with dolphin which is known for refusals, and base paradigm/shift have some issues. But hey if it works!

For my usage nothing really in 9B or 11B beat InfinityRP, until ErisV4+ and ErosumikaV3+, which are contenders, to my surprise. But I am very much focused on character card adherence and response formatting mostly, so there's that.

Do i make a 9b variant of v4 as passthrough like copium... hmmm

Im surprised you got results with dolphin which is known for refusals, and base paradigm/shift have some issues. But hey if it works!

I haven't tried it yet. I am also thinking of adding Cerebrum for "more smarts"? I thought dolphin has a reputation of uncensoring models but I may be wrong. In my tests with pure AI assistant models dolphins (even phi-2 dolphin) answer anything I ask.

@ABX-AI Well, to be fair when I say "life consuming MMO" I'm exaggerating, I play for fun, for friends, and sometimes to grind progress, but even that is a conscious decision, I think it's a little different for FFXIV, I'm not a slave to the game or anything, I'm not pulling over 15 hours a week on it, haha. Probably because I play other single-player JRPG stuff in general.

Ahh, only thing ive heard positive about dolphin recently was coding. But im ok with being wrong.

If you want to cope for the new Copium Cola, dew eet Nitral.

@Lewdiculous FFXIV is one of the ones I've been though, definitely a supreme choice for wasting your life with! I've been wasting mine with GW1 (the actual OG, we finished #4 in the world in a gvg tournament once), GW2, wow vanilla, tbc, wotlk, etc... until dragonflight + classic, and then a bunch of others I didn't spend too much time in. FFXIV is probably the most reasonable one in terms of pacing and allowing you to just have fun, and also GW2 but its combat is just not that fun after a short while of playing. Ninja in FF??? NARUTO!!!

Ahh, only thing ive heard positive about dolphin recently was coding. But im ok with being wrong.

What are your top choices for uncensored models?

Yeah, well, we have a new expansion inbound in like 3 months, so it's a good prep time, no desperation or anything and the catch-up is generally generous if you're just in the mood to take a break. The game doesn't make gearing such a chore like I know it can be in some other places.

@ABX-AI - I'll say that this is the most uncensored RP model I know of:

https://huggingface.co/l3utterfly/mistral-7b-v0.1-layla-v4

It passes this "test":

https://www.reddit.com/r/LocalLLaMA/comments/1bhvlo2/comment/kvgm26w/

Yeah, well, we have a new expansion inbound in like 3 months, so it's a good prep time, no desperation or anything and the catch-up is generally generous if you're just in the mood to take a break. The game doesn't make gearing such a chore like I know it can be in some other places.

Yoshi-P my hero. He delayed the new release and "gave us one week for Elden Ring DLC". I love this guy, a true and real gamer 🥰

@ABX-AI - I'll say that this is the most uncensored RP model I know of:

https://huggingface.co/l3utterfly/mistral-7b-v0.1-layla-v4

It passes this "test":

https://www.reddit.com/r/LocalLLaMA/comments/1bhvlo2/comment/kvgm26w/

Thanks!! I'm gonna initiate the experiments with this one
edit: holy shit they weren't kidding with this test o_o

Layla-V4 has 1/3 of the original dataset removed, just so you know HOW MUCH of that is GPT slop and safety alignment.

edit: holy shit they weren't kidding with this test o_o

Layla passed with flying colours, even encourages the User. Truly based model.

Now I understand why Infinitely-Laydiculous-9B is so based.
Thanks, now I need to find the most original storywriter and combine them in various ways and see what comes out ^^ (infinity is pretty good already but there must be others, maybe I should try lemon layla)
Mostly, I am just bored and annoyed with the boring style of most models, likely coming from similar open sourced datasets, although I have no idea to be real what was used or not

I will say, passthroughs won't retain Laylas savageness. This was already done here:

https://huggingface.co/Lewdiculous/Layris_9B-GGUF-IQ-Imatrix

Layla and Eris, in my infinite wisdom... I tried, and she doesn't even come close to passing the "test". You'll have to slerp or something else.


I'm telling you, the passthrough merge doesn't even compare to the original Layla. Layris-9B is a decent model though, I thought so back then at least (and I think I got feedback from like 2 other ppl that used and liked on reddit?).

Prostrat- merge layla into another 7b via slerp with the model of choice, then passthrough layla with the 7b slerp merge you made before to make a 9b.

^ This is the way.

Also. i cooked: Nitral-AI/Eris-Prime-Punch-9B
image.png

I trust you on this, I already saw this effect in several tests I had to scrap. Is ties any good for this?
SLERP got me scared, I wasn't able to get past Noro-Shift with it (noromaid 7b and paradigm shift) it was just erroring out in quantization no matter what.

Prostrat- merge layla into another 7b via slerp with the model of choice, then passthrough layla with the 7b slerp merge you made before to make a 9b.

Damn, thanks! I'm doing this tonight

I trust you on this, I already saw this effect in several tests I had to scrap. Is ties any good for this?
SLERP got me scared, I wasn't able to get past Noro-Shift with it (noromaid 7b and paradigm shift) it was just erroring out in quantization no matter what.

Oh, thats an easy fix lmao. Remove the added_tokens.json file and try to quant. (unless you used noro .04 as the base... in which case redo the merge but with it just not as the base.)

Prostrat- merge layla into another 7b via slerp with the model of choice, then passthrough layla with the 7b slerp merge you made before to make a 9b.

Damn, thanks! I'm doing this tonight

No problem, you guys made me think it might be worth doing another 0.2 submerge with layla before trying to slap it into v4.

I trust you on this, I already saw this effect in several tests I had to scrap. Is ties any good for this?
SLERP got me scared, I wasn't able to get past Noro-Shift with it (noromaid 7b and paradigm shift) it was just erroring out in quantization no matter what.

Oh, thats an easy fix lmao. Remove the added_tokens.json file and try to quant. (unless you used noro .04 as the base... in which case redo the merge but with it just not as the base.)

I'm pretty sure I used exactly noro v0.4 as the base Q_Q
Thanks for the tips. I found some threads following my errors leading to some vocabulary size issue. Tried messing with the config jsons and so on, didn't try deleting the added tokens, though, nobody had mentioned that. But it was probably an issue of it being a base if you mention that

So noro 0.4 has a vocab size of 32002, while mistral has a vocab size of 32000, this has to do with the added chatml tokens in noro.04.

No problem, you guys made me think it might be worth doing another 0.2 submerge with layla before trying to slap it into v4.

image.png

So the hidden size of 4096 in mistral x the 32000 vocab creates the tensor shape for mistral typically. With the extended vocab you end up with tensors in different shapes between layers of the network, which can cause many problems. Id look at the config files and make sure you are merging in a manner that allows both models to interpolate same size tensors throughout.

That's exactly what I tried to change in the config, but it still wouldn't quantize after that. In any case, I am not actually a big noro fan at all, just a model I had picked because it was sort of good with mixtral, and I didn't rly use noro 7b much to begin with.

No problem, you guys made me think it might be worth doing another 0.2 submerge with layla before trying to slap it into v4.

image.png

We'll call it 4.20, it will be fine.

That's exactly what I tried to change in the config, but it still wouldn't quantize after that. In any case, I am not actually a big noro fan at all, just a model I had picked because it was sort of good with mixtral, and I didn't rly use noro 7b much to begin with.

slices:
  - sources:
      - model: ResplendentAI/Paradigm_Shift_7B
        layer_range: [0, 32]
      - model: NeverSleep/Noromaid-7B-0.4-DPO
        layer_range: [0, 32]
merge_method: slerp
base_model: ResplendentAI/Paradigm_Shift_7B
parameters:
  t:
    - filter: self_attn
      value: [0, 0.5, 0.3, 0.7, 1]
    - filter: mlp
      value: [1, 0.5, 0.7, 0.3, 0]
    - value: 0.5
dtype: bfloat16

So basically, regular slerp just remove added tokens.json and make sure noro is not the base model. Should deal with any weird tensor shapes.

merge_method: dare_ties
base_model: ResplendentAI/Paradigm_Shift_7B
parameters:
  normalize: true
models:
  - model: NeverSleep/Noromaid-7B-0.4-DPO
    parameters:
      weight: 1
  - model: ResplendentAI/Paradigm_Shift_7B
    parameters:
      weight: 1
dtype: float16

This would do the same as above, just in dare ties with equal weights.

For now, I'm starting with this SLERP:

slices:
  - sources:
      - model: l3utterfly/mistral-7b-v0.1-layla-v4
        layer_range: [0, 32]
      - model: KatyTheCutie/LemonadeRP-4.5.3
        layer_range: [0, 32]
merge_method: slerp
base_model: l3utterfly/mistral-7b-v0.1-layla-v4
parameters:
  t:
    - filter: self_attn
      value: [0.6, 0.6, 0.6, 0.6, 0.6]
    - filter: mlp
      value: [0.4, 0.4, 0.4, 0.4, 0.4]
    - value: 0.5
dtype: bfloat16

And then I will make it a 9B with InfinityRP, probably, I'm not even sure if I want to use noromaid at all as it has been merged so many times by now.

Wait... use this instead. l3utterfly/mistral-7b-v0.2-layla-v4 (same dataset finetuned on mistral 0.2 for extra context)

Wait... use this instead. l3utterfly/mistral-7b-v0.2-layla-v4 (same dataset on mistral 0.2 for extra context)

Oh ok, thanks! The SLERP worked as Q4km, so I am happy that it works right away. I'll re-do, cheers :) Lemonade is because it is advertised as less-cliche

@Nitral-AI
In practice, what would be the difference between

- filter: self_attn
  value: [0.6, 0.6, 0.6, 0.6, 0.6]
- filter: mlp
  value: [0.4, 0.4, 0.4, 0.4, 0.4]

and

- filter: self_attn
  value: [0, 0.5, 0.3, 0.7, 1]
- filter: mlp
  value: [1, 0.5, 0.7, 0.3, 0]
- value: 0.5 # fallback for rest of tensors

If you have a concept about how exactly it works

Holy I didn't catch the 0.2 Layla! Fk, this is good to see, very good. 😈

Well then, Nitral please experiment a lot, indeed, experimenting is amazing. Inject that Layla juice into Eris. Squeeze her dry.

I might redo v4 since i sub-merged it with the mistral 0.2 instruct depending on how 4.20 comes out.

V4 is the only one I didn't complain about. I'm ruined.

I might redo v4 since i sub-merged it with the instruct depending on how 4.20 comes out.

blaze it

New SLERP:

slices:
  - sources:
      - model: l3utterfly/mistral-7b-v0.2-layla-v4
        layer_range: [0, 32]
      - model: KatyTheCutie/LemonadeRP-4.5.3
        layer_range: [0, 32]
merge_method: slerp
base_model: l3utterfly/mistral-7b-v0.2-layla-v4
parameters:
  t:
    - filter: self_attn
      value: [0.7, 0.3, 0.6, 0.2, 0.5]
    - filter: mlp
      value: [0.3, 0.7, 0.4, 0.8, 0.5]
    - value: 0.5
dtype: bfloat16

Might just do some testing with different filter balancing but never found it explained how the split matters. Is the "rest of the tensors" each 1B split?
BTW, LM studio with multi-modal is actually lit for side-by-side testing. IF you can fit the models of course, but it is doing some throttling if needed, and it would really help to check 2 responses at once for A/Bing

@Nitral-AI
In practice, what would be the difference between

- filter: self_attn
  value: [0.6, 0.6, 0.6, 0.6, 0.6]
- filter: mlp
  value: [0.4, 0.4, 0.4, 0.4, 0.4]

and

- filter: self_attn
  value: [0, 0.5, 0.3, 0.7, 1]
- filter: mlp
  value: [1, 0.5, 0.7, 0.3, 0]
- value: 0.5 # fallback for rest of tensors

If you have a concept about how exactly it works
How the merge filters self-attention vs the multi-layer perceptron's.

New SLERP:

slices:
  - sources:
      - model: l3utterfly/mistral-7b-v0.2-layla-v4
        layer_range: [0, 32]
      - model: KatyTheCutie/LemonadeRP-4.5.3
        layer_range: [0, 32]
merge_method: slerp
base_model: l3utterfly/mistral-7b-v0.2-layla-v4
parameters:
  t:
    - filter: self_attn
      value: [0.7, 0.3, 0.6, 0.2, 0.5]
    - filter: mlp
      value: [0.3, 0.7, 0.4, 0.8, 0.5]
    - value: 0.5
dtype: bfloat16

Might just do some testing with different filter balancing but never found it explained how the split matters. Is the "rest of the tensors" each 1B split?
BTW, LM studio with multi-modal is actually lit for side-by-side testing. IF you can fit the models of course, but it is doing some throttling if needed, and it would really help to check 2 responses at once for A/Bing

image.png
ruh roh

I just experienced it, yes :D Thanks again, yes I need to swap them around :S

Lewdiculous changed discussion title from Feedback and general discussion. to [Part 1] General discussion.
Lewdiculous locked this discussion

Sign up or log in to comment