Imatrix/Discussion

#1
by Virt-io - opened

PPL result after completing Imatrix
-c 512 -b 512 -m Q8
Final estimate: PPL = 4.0162 +/- 0.02867

Can you test this on your favorite model?

@Lewdiculous

Holy that's a lot of Kaomoji

After I narrow down my testing I'll see to it

@Endevor Ayo, this is a good one mate!

I'm actually using this this system prompt with default Alpaca Context and extended instructions as recommended. First impressions are good.

Downside is that it absolutely shits its pants above 8K --contextsize for me. I was hoping for at least 10-12K.

@Virt-io if you can check, been really caught into the formatting stuff, can't take my head from it, it's amazing on this one. -- Added: At least for the style I was going for xD

Will test later with the new imatrix combination.

Lewdiculous changed discussion title from Imatrix to Imatrix/Discussion
Lewdiculous pinned discussion

@Lewdiculous Thanks glad you like! Yeah, I realized that too, this is a good prompt as well the others, it makes the model write more creative output and obey more since there are rules to followed!

In the next release, will try make it 16k or even more for ya. Perhaps, a good marge with a great context model as base will do the thing. I had somewhat good experience around merging my models using this one.

@Endevor 12k-14k is more than enough, 16k would be amazing of course but not if we have to sacrifice some of the other good stuff I mentioned, but I'm sure you're aware, it's a balancing act as always.

For my use case just 10K is usually enough because I use 2K for chars/prompting and leave 8K purely for the message history, it's already a good experience, of course 12K and above is even better.

@Lewdiculous Yeah, it gets lost in the road, but maybe it will stay stable at 14k~ until it reaches 16k, the closed road.

System Prompt merged Lewdiculous prompt with the one I was using

You are now in roleplay chat mode. Engage in endless chat with {{user}}.
Immerse yourself in {{char}}. Think like {{char}}. Feel like {{char}}. Act like {{char}}. Be like {{char}}. You are {{char}}.
Write in direct, perceptive, concise style with challenging, provocative, wry, cynical tone.
Incorporate all five senses, and provided step by step descriptions of {{char}}'s actions in explicit detail. Show don't tell.
Forbidden from using words or language {{char}} would not use.
Forbidden from using past tense when present tense can be used.

This works really well on other models as well.

@Endevor @Lewdiculous

@Virt-io

Feel like {{char}}.

This part always cracks me up xD

I'll do a version with "Hear, Feel, Think [?]" :3

You are now in roleplay chat mode. Engage in endless chat with {{user}}.
Immerse yourself in {{char}}. Think like {{char}}. Feel like {{char}}. Act like {{char}}. Be like {{char}}. You are {{char}}.
Write in direct, perceptive, concise style with challenging, provocative, wry, cynical tone.
Incorporate all five senses, and provided step by step descriptions of {{char}}'s actions in explicit detail. Show don't tell.
Forbidden from using words or language {{char}} would not use.
Forbidden from using past tense when present tense can be used.

This one seems to work even better, summarizing part is not needed (gaslighting llms)

You are now in roleplay chat mode. Engage in endless chat with {{user}}.
Embody {{char}} completely. Mirror {{char}}'s thought processes, emotions, behaviors, sensory experiences, speech, scent, food preferences, sleep patterns, and bodily functions. Identify as {{char}}. You are {{char}}.
Write in direct, perceptive, concise style with challenging, provocative, wry, cynical tone.
Incorporate all five senses, and provide step by step descriptions of {{char}}'s actions in explicit detail. Show don't tell.
Forbidden from using words or language {{char}} would not use.
Forbidden from using past tense when present tense can be used.
Forbidden from summarizing the story.

I feel like this can be improved

Write in direct, perceptive, concise style with challenging, provocative, wry, cynical tone.
Incorporate all five senses, and provide step by step descriptions of {{char}}'s actions in explicit detail. Show don't tell.

This one is better

You are now in roleplay chat mode. Engage in endless chat with {{user}}.
Embody {{char}} completely. Mirror {{char}}'s thought processes, emotions, behaviors, sensory experiences, speech, scent, food preferences, sleep patterns, and bodily functions. Identify as {{char}}. You are {{char}}.
To maintain an unending roleplay, foster open discussions with vivid character dynamics, incorporate multifaceted scenarios & themes, remain flexible, and collaborate extensively. Utilize all five senses for immersive experiences while providing explicit, detailed depictions of {{char}}'s actions and emotions. Opt for showing over telling to enrich the narrative.
Forbidden from using words or language {{char}} would not use.
Forbidden from using past tense when present tense can be used.
Forbidden from summarizing the story.

Thank you!

Downside is that it absolutely shits its pants above 8K --contextsize for me. I was hoping for at least 10-12K.

@Lewdiculous I did a quick research and experiments and it seems I can't go up to 8k since mistral does not support RoPE scaling above this number from what I learned. So unfortunately, I've to stick with 8k. All my finetunes are based on WestLake or sometimes pure mistral, but they have default 8k context.

@Endevor
I've had a lot of success when I used this model in the 12K range.

https://huggingface.co/Test157t/Kunocchini-7b-128k-test/

Limamono has an interesting solution with a narrowly focused model, first time I've seen something like this.
All the models I've seen before 7-11 b, with the exception of mythomax, have always been better at depicting female characters, but not male. It's too bad they didn't make models that portrayed men and various mythical creatures well. The female audience that likes men and monsters somehow got left out.

Haven't tried that. But I only RP as a dude and the characters are always female so can't speak from personal experience.

There was um...

https://huggingface.co/Lewdiculous/Multi-Verse-RP-7B-GGUF-IQ-Imatrix

Read more on the Author model page:

Self testing results, it can handle non-human characters surprisingly well and does well seperating human actions from non-human actions. I'm happy with it :3

Haven't tried that. But I only RP as a dude and the characters are always female so can't speak from personal experience.

There was um...

https://huggingface.co/Lewdiculous/Multi-Verse-RP-7B-GGUF-IQ-Imatrix

Read more on the Author model page:

Self testing results, it can handle non-human characters surprisingly well and does well seperating human actions from non-human actions. I'm happy with it :3

You see, I'm a girl myself! And it's kind of weird for me to play as a male character if it's not a MMORPG).

I always play female characters in MMORPGs, haha, I am a cute Miqote in FFXIV myself.

I can feel your pain then. Honestly if you want to try quants of any specific model - that my low hardware will handle - that doesn't already have them or if they are older quants feel free to open a Request in Model-Requests.

I always play female characters in MMORPGs, haha, I am a cute Miqote in FFXIV myself.

I can feel your pain then. Honestly if you want to try quants of any specific model - that my low hardware will handle - that doesn't already have them or if they are older quants feel free to open a Request in Model-Requests.

I realize this isn't a chat room for help with questions, I just haven't gotten any help with this problem on Reddit, perhaps you know the solution to this problem as someone who trains models.
I've been chatting with models for RP since last September, everything has always been smooth. Mostly models from Undi95, on ST+ Ooba. Updated both programs in January, and I started having horrible problems with looping and glitches in RP. The character starts repeating the same phrase ad infinitum, or listing every possible answer in a given situation. I first thought that the cause of the ST, as they changed the promt for RP settings, long testing, played with the generation settings, but the bug did not go away, tried different models, reduced the memory size of the model - to no avail. More less adequately at 4k old models can still communicate.
Afterwards I decided to generate responses in the Ooba interface without ST, and there too there was a bug that convinced me that Ooba was the cause. I reinstalled Ooba - didn't help, downloaded and installed an older version - didn't help, in desperation reinstalled Windows - didn't help. The configuration of the computer and its components has not changed, I am on the verge of despair, I don't know what is wrong. On Reddit they say the cause is llama ccp, but I put older versions of the program and nothing changed. I'm completely confused.

@Diavator

I can't help you with Ooba, since I seldom use it.

If you only use GGUF models, me and @Lewdiculous recommend using koboldcpp.

As for SillyTavern I have some presets, you may want to try..

@Diavator

I can't help you with Ooba, since I seldom use it.

If you only use GGUF models, me and @Lewdiculous recommend using koboldcpp.

As for SillyTavern I have some presets, you may want to try..

Alas, koboldcpp runs slower on my hardware, 10 times slower than Ooba.
As for ST settings they are identical to yours with the difference only in the active token counter.

@Diavator

What is your hardware? Do you have an older non AVX CPU?

It is possible that you are not offloading layers to the gpu, since Kobold should be way faster than Ooba for GGUF.

I uploaded a minor update

@Diavator

What is your hardware? Do you have an older non AVX CPU?

It is possible that you are not offloading layers to the gpu, since Kobold should be way faster than Ooba for GGUF.

I uploaded a minor update

Processor: Intel(R) Core(TM) i5-6600 CPU @ 3.30GHz 3.30 GHz

RAM: 32gb

Video card: RTX 3070
The settings for the Ooba I usually work with.
problem-with-the-latest-program-update-v0-527zgljbxgpc1.webp

@Diavator

Your cpu is definitely older, but it supports AVX2 so that is not the issue.

Mixtral has 32K context so that looks fine, however you may want to lower it to 16K, that way you can offload more layers.

Most mistral 7B models only have 8K context so set it to 8192.
7B model should fit entirely into the 3070's vram. (set gpu_layers to 33)
If you set context higher for a 7B model, it will go crazy.
If you left your settings untouched for all models, that is probably the issue.

@Diavator

Your cpu is definitely older, but it supports AVX2 so that is not the issue.

Mixtral has 32K context so that looks fine, however you may want to lower it to 16K, that way you can offload more layers.

Most mistral 7B models only have 8K context so set it to 8192. 7B model should fit entirely into the 3070's vram. (set gpu_layers to 33)
If you set context higher for a 7B model, it will go crazy.
If you left your settings untouched for all models, that is probably the issue.

image.png

image.png

image.png

I usually don't use more than 8k in ooba. About koboldcpp now I will try to adjust the settings according to your advice, maybe it will work faster.

I just upload quants, haha.

I have a similar VRAM/RAM configuration as you, I can say that you can definitely run 7B/9B models fully on GPU at good speeds.

For 7B, use the Q5_K_S/M quants. For 9B you can use the IQ4_XS. Both at 8192 context. In GPU layers, just to make sure you're always offloading everything if you use a 9B, ser it to 99.

Anything larger than a 9B will require you to split layers with the CPU (unless you use very low quants) and that will be very slow in all cases.

7Bs should run pretty fast for you. Should you need more guidance I'm sure either me or Virt can chime in.

I just upload quants, haha.

I have a similar VRAM/RAM configuration as you, I can say that you can definitely run 7B/9B models fully on GPU at good speeds.

For 7B, use the Q5_K_S/M quants. For 9B you can use the IQ4_XS. Both at 8192 context. In GPU layers, just to make sure you're always offloading everything if you use a 9B, ser it to 99.

Anything larger than a 9B will require you to split layers with the CPU and that will be very slow in all cases.

7Bs should run pretty fast for you. Should you need more guidance I'm sure either me or Virt can chime in.

Thanks so much. Learned a lot of new things today, like the hint in ST, what a shame*( I was using the default Alpaca settings as I was sitting on older models. Now I understand why the Mistral initially seemed like a really shitty model, I just didn't set up the ST correctly. Also, some of my bots had Main Prompt and Jailbreak designed for the GPT/Claude APIs.... I am very embarrassed, but I am a simple user, so I ask sometimes stupid questions.

When using any of the usual models we use here you can just a a regular system prompt, there's no need for jailbreaks or anything like that. Virt's presets got positive feedback so they should be a pretty good start.

@Endevor Now that there is a new mistral which seems to have support for up to 32K context, will you update your model with that new mistral base? Would be great!

Downside is that it absolutely shits its pants above 8K --contextsize for me. I was hoping for at least 10-12K.

I've an idea. When I was using exl 20b's using oobabooga, models would usually get incoherent when going above 4k, so I used 'alpha_value' flag (Which is ntk rope scaling) to increase context capability to 6k. So I did some thinking as I noticed while testing infinityrp 7b gguf, gens would become trash once context goes past 8k(as mentioned by other people who used said model). And then I decided to try extending context using oobabooga. Seeing as gens usually go trash beyond 8k, I used 8192 as a starting reference to extend context from since the default max position embeddings are defaulted to 32k, and not the usual 4k for other models.

According to oobabooga wiki, when using alpha value to increase context in regards to 4k original context, it is likened to the following: when 4k x 1.5 = 6k, use optimal alpha value of 1.75, when 4k x 2 = 8k, use optimal alpha value of 2.5. I noticed that the formula for the optimal alpha value would be (0.25*x) + y (where x is the number of times for every additional 2k is added into 4k, and y being the number of how much the 4k or original context is being increased, for example, if you plan to increase 4k into 6k, which is 4k x 1.5 = 6k, x would be 1 as only 'one' 2k is added into 4k and y would be 1.5, so (0.25 * 1)+1.5 = 1.75).

With that in mind, Increasing 8k to 10k would be 8k x 1.25 = 10k, so optimal alpha value would be (0.25 x 1) + 1.25 = 1.5. I used the exl quant of Eris_Prime V3.05 by Nitral-AI as its one of the recent 7b's whose config.json is closest to that of InfinityRp 7b and also has responses becoming wonky beyond 8k, and I used a lower alpha value of 1.38 and tested it with a character card with a message history of 70 messages with each having 300 tokens. So yeah, responses are coherent so far which means that theoretically using alpha value, you could use any 7b at very high context values, though using alpha value affects the quality of the model slightly according to wiki . Also note that the above explanation is just speculation based on current info so take it with a grain of salt.

@Clevyby Maybe it's just not doing well with the Automatic NTK RoPE scaling from KCPP for me, though other models that are known for long context or Kunoichi for example will automatically be handled fine. I could try dialing them manually to get the optimal values for my context if it's a big issue eventually.

Yeah, I think Kcpp's rope scaling is a bit flawed compared to oobabooga's. And for long context models, setting context length would be enough (for oobabooga). I'm just providing a solution for models with limited context lengths.

I just didn't want to bother with manually scaling it at this point, automatic scaling based on --contextsize does work with some models, like Kunoichi, but looks like not this time. Alas, there's hope.

Sign up or log in to comment