Visibly Degraded Quality, at Least on Q8 GGUF
As you already mentioned in the model, the reasoning and text formatting took a hit after the dataset was modified. This is visible almost 1/3 of the time, as the model starts using brackets for narration. Another con is the fact that it really wants to reply as the user, at times answering the user's replies without the user actually inputting anything if it makes sense. It's also become more stubborn with author's notes; it no longer obeys them like 3.2 did. For example, [SYSTEM: Use {{random:1,2,2,3}} paragraphs for the next reply.] set to a depth of 0 as user is no longer followed. Stheno v3.3 also got confused when using an alichat + lorebook combination character card, suddenly assuming the role of some random character that's barely relevant to the card. I hope to God it's because of the quant somehow being faulty.
- Using LLaMA3 instruct format
- 1.1 temp, 0.075 min-P, 1.07 rep pen, and top k 50
Hmm, honestly I got brackets for narration a few times, and they're always fixed after a manual edit the first time it happens, no issues after. This isn't the dataset issue, I think.
I also never had issue with the model wanting to write as {{user}} ? Seems fine on my end. I did test it with multiple cards at multple context lengths.
I never use lorebooks with alichat so I can't help you there? But on my end it's able to handle group / multi-char cards well.
Instruction following did take a small hit, probably due to extended context, but a swipe is still fast since is only 8b.
Seems fine on my end.
Running the unquanted BF16 helps I suppose. Will test further.
Huh, nevermind. Faulty quant. Good job! You cooked! Further down.
And so that’s it, I’m wondering why the model’s answers are so incoherent and random. I used a quant from mradermacher 8q, I’ll look at other quants.
upd:
meh, tested it on my group chat with 300+ messages, at 32k he responds completely incoherently, even if the temperature is reduced to 0.4, as the context decreases down to 16k (the default value for me) llm begins to be more coherent relative to at least a dozen past messages without trying come up with them again.
I have a strong feeling that 8b models scale very poorly over 16k context, I’ve tried many different variations of Llama3 and every time I see approximately the same situation. Either the model repeats words in a loop, or it completely loses the context of what is happening and looks more like nonsense.
settings:
virt-IO/SillyTavern-Presets Prompts/LLAMA-3 2.0
0.6 temp, 0.075 min-P, 1.1 rep pen, and top k 50
@Sao10K Sorry for reopening the thread, but after two evenings of testing, I still think something's wrong. Using a temp of 1 with 0.1 min p solved formatting issues and bracketing for the most part, and also the model's tendency to speak as {{user}}. So that's neat. Yet, there's still that bad reasoning problem I mentioned.
I tried an 8.0 bpw, three 8q static gguf quants (one self-made) and one imat gguf, and it still confused details multiple times, especially about the character's family tree in this specific chat. It even contradicted concise and precise information. It's worse than 3.1 when it comes to reasoning but seems less slopped. I also noticed that it has a stronger roleplay/storytelling bias rather than conversational/RP. It really seems to love writing short stories. This time I used a well written and well paced plaintext card that also had 3 nice example dialogues set to never get pushed out of context. 3.2 is my go-to for now.
@Sao10K Sorry for reopening the thread, but after two evenings of testing, I still think something's wrong. Using a temp of 1 with 0.1 min p solved formatting issues and bracketing for the most part, and also the model's tendency to speak as {{user}}. So that's neat. Yet, there's still that bad reasoning problem I mentioned.
I tried an 8.0 bpw, three 8q static gguf quants (one self-made) and one imat gguf, and it still confused details multiple times, especially about the character's family tree in this specific chat. It even contradicted concise and precise information. It's worse than 3.1 when it comes to reasoning but seems less slopped. I also noticed that it has a stronger roleplay/storytelling bias rather than conversational/RP. It really seems to love writing short stories. This time I used a well written and well paced plaintext card that also had 3 nice example dialogues set to never get pushed out of context. 3.2 is my go-to for now.
Have you tried exporting to fp32 gguf and then quanting the fp32 gguf to q8? This has worked for me in the past, not sure if it will help here.
Have you tried exporting to fp32 gguf and then quanting the fp32 gguf to q8? This has worked for me in the past, not sure if it will help here.
It's surprising to hear this has helped in the past, I wouldn't imagine this would be beneficial. But interesting..
Both 3.2 and 3.3 are now in the Open LLM Leaderboard, and the latter scores significantly worse , so that confirms it.
Hey, guys, can you help me with something ? How can I run GGUF models on Ollama ? Recently I installed Open WebUI, which is an excellent frontend to Ollama, it gives us an excellent interface, like Chatgpt, and has text to speech and speech to text, you can talk hands free with Lamma 3, Gemma 2, etc. Excellent 👌. But I'd like I could use HF GGUF files on it. Is there any way , an easy way ? Thanks 👍