General discussion.

#1
by Lewdiculous - opened
    quantization_options = [
        "Q4_K_M", "Q4_K_S", "IQ4_XS", "Q5_K_M", 
        "Q5_K_S", "Q6_K", "Q8_0", "IQ3_M", "IQ3_S", "IQ3_XS", "IQ3_XXS"
    ]

Open feedback and discussion thread.

Lewdiculous pinned discussion

@Lewdiculous

I was wondering if you could quantize my merge https://huggingface.co/Virt-io/FuseChat-Kunoichi-10.7B

My upload speed is painfully slow.

@Virt-io – I can do that in the evening.

:3

It will be under https://huggingface.co/Lewdiculous/FuseChat-Kunoichi-10.7B-GGUF-IQ-Imatrix

Also, cool merge!

@Virt-io – By the way, can you please add a waifu image to your model page?

It is tradition after all.

Doesn't have to be anything in particular but just a generated image like in this page for example is enough.

@Lewdiculous Sure

Edit: Image has been added

@Virt-io Q8_0 should be there in a few minutes, if any other quant is needed let me know.

@jeiku – I wasn't able to test for the intelligence but anecdotally it's been writing well and that idea about merging Layla for her absolute lack of refusals and disregard for correctives in alignment seems to have worked, Layris still retained the high level of compliance with any kind of user request.

Who knows, maybe the numbers aren't even that much worse. :'3

What context was it trained on?
I tried this model today, and it's been really fun. Surprised by the quality of the responses given the size. I also went into an [ooc] talk with my character card, and had an interesting AI-assistant discussion too - this is at 8k context.

@zebrox - I'm not the author, but I would say 8192 as always is a safe context size to use.

@l3utterfly - Speaking about Layla-V4, how's the handling of larger context with your model from your experience? -- If I may say, based unhinged model, more of this of course.

But it is a merge of:

Maybe it gives some insight.

@Lewdiculous - Thanks, I normally try to dig into the merges as well, but in many cases it isn't mentioned. In any case, this model seems to do good with my "broken code" character which is instructed to replace letters and do weird things with text overall. The same with normal 7B doesn't really work that well at all.

@zebrox - This is good to hear, the idea behind it was slightly better "intelligence" due to slightly more parameters and the high benchmark performances form the used Eris merge, which seems to have worked, and also retained the intended lack of refusals from the based Layla-V4 model.

Depending on the model you tested, improvements to Quant PPL could also be behind it, it is still a work in progress, but it seems the diverse Imatrix data makes a measurable improvement, potentially almost a full "tier" up in quantization with the correct data.

7B Layla is here with the same imatrix calibration data if you want to test head to head:

https://huggingface.co/Lewdiculous/mistral-7b-v0.1-layla-v4-GGUF-IQ-Imatrix

7B Eris_Remix:

https://huggingface.co/Lewdiculous/Eris_Remix_7B-GGUF-IQ-Imatrix


There were way too many merges to arrive at Eris_Remix, I'll say that at least! It is not a long context model in that front. xD

@Lewdiculous Thanks!!! I will certainly try this one out as well, downloading it now :)
BTW, I am not sure if this is a dumb question, but could a model like this be improved via similar techniques: https://huggingface.co/Severian/Nexus-4x7B-IKM-GGUF ? Since out of curiosity I tried it, and it follows my fake code instructions really well. Maybe it could be a good intelligent lewd base, or something like that.

https://huggingface.co/Lewdiculous/mistral-7b-v0.1-layla-v4-GGUF-IQ-Imatrix - Tried this, so far I think it's a bit worse than Layris9B (it starts OK and later on sort of drops the writing style), and I would probably stick with Layris if I have to choose, however mistral layla runs lighting fast, and also it's not the best comparison unless I also spend time tweaking the instruction sets and possibly the samplers

but could a model like this be improved via similar techniques

Of course, you can make new quants using Imatrix-data and they are likely to be an improvement compared to regular GGUF quants - I just don't do it because of my hardware access limitations, but you can check the GGUF-Imatrix-Quantization Script at @FantasiaFoundry 's profile if you have an NVIDIA GPU.

Tried this, so far I think it's a bit worse than Layris9B

Try Eris_Remix_7B, this should be a much better comparison to Layris.

@Lewdiculous - thank you, downloading it now ^^
Is it feasible to do these merges without serious hardware, btw? So far, I've only been trying out models for a month. I haven't done any quants, or used any matrices, just because I assumed I would need some serious HW (on a rtx3070, ryzen 5900x)

Is it feasible to do these merges without serious hardware, btw?

I think @jeiku and @Test157t might know better on this front. For Quants your GPU is good, as long as your not doing big models...

I do all my merges for free in collab. But most merges can be done in ram in under 16gb. you can also split the load in mergekit between devices and offload to gpu which should work in 12gb of vram. @Lewdiculous @zebrox

I'm a local bastard.

I prefer local as well, but i have bad internet. I can upload merges in fp16 from collab to hf in about 1 min. Not like im using collab for inference.

(for reference it takes several hours to upload a 5bpw/qptq quant from my pc.)

@Test157t

Do you think we can make a Colab for my dirty script?

https://huggingface.co/FantasiaFoundry/GGUF-Quantization-Script

I could look into it for sure @Lewdiculous . Might not be able to do everything from free tier collab. (i forget the hw requirements for cpp quants.)

That would be very pretty handy.

@Lewdiculous - thank you, downloading it now ^^
Is it feasible to do these merges without serious hardware, btw? So far, I've only been trying out models for a month. I haven't done any quants, or used any matrices, just because I assumed I would need some serious HW (on a rtx3070, ryzen 5900x)

The only thing that is going to consume VRAM is going to be LoRA training/DPO finetuning, but I do all of mine via Colab Pro, using T4 for LoRA training and A100 40GB for DPO tuning. Merges are relatively lightweight by comparison and can run in RAM. I do mine in Colab with a High-RAM instance since I'm often running 2-6 models through the grinder.

@Test157t

(i forget the hw requirements for cpp quants.)

Hardware wise it's not too demanding, just enough to load the model really, disk space is what I'm worried about, for quants I usually am left with with about 50-70GB of data to upload myself. It's not a big issue since I'm storing locally, but I don't know how that works on Colab.

Sign up or log in to comment