spicyboros-70b-2.2.Q4_0.gguf broken as well?
Same as with airoboros-l2-70b-2.1-creative.Q4_0.gguf which I reported to you before, this is also generating nonsense text for me with koboldcpp-1.43:
brie◄Ýiglia cabgenommenbriebriebriebriebriebriebriebriebriebriebriebriebriebriebrie (repeating that all the way until max new context limit is reached...)
Again, I checked the file hash to make sure my download didn't get corrupted, and I'm using the same settings as all the other (70B) models I use, so it looks to me like it's a model issue. Anyone able to confirm that or who has this particular file working properly for them?
Oh damn you're right. Probably a llama.cpp bug. I'll report it.
Try re-downloading. Jon re-uploaded the source weights using the standard LoRA merge method, not the new method that causes this problem. And I have re-done my GGUFs.
Great, thanks! Redownloaded and testing it now...
Good news: The "brie◄Ýiglia" problem is gone. It is outputting normal responses now.
Or is it? I'm getting much worse quality than with the other 70Bs I tried: Spelling/grammar errors, weird way of speaking, with runaway sentences without much logic and sometimes missing words. And that's early on already, not just when the context is full later.
Edit: I've now also tested Spicyboros-c34b-2.2-GGUF (Q4_K_M) and Spicyboros-13B-2.2-GGUF (Q8_0). They all exhibited the same terrible quality in my tests, from 13B to 70B. I've been using koboldcpp-1.43 with the same settings for all the tests, and other models (e. g. Synthia-70B-v1.2-GGUF and Nous-Hermes-Llama2-70B-GGUF) were perfect.
Anyone having success with these Spicyboros quants in longer chats/roleplays?
Yeah, something seems off in these llama.cpp quants from qlora finetunes. I tried the new 70B Q4_K_M here, and it's very prone to generate spelling/grammar errors (just saw a few " let' ", where it clearly want to say "let's"), and the overall quality of the output feels lower too.
Good to know it's not just on my end. Do you use llama.cpp or koboldcpp or what's your backend?
By the way, I've seen similar problems with another recent but entirely different model: Nous-Puffin-70B-GGUF (Q4_0). Its sibling model, Nous-Hermes-Llama2-70B-GGUF (Q4_0), is fine, though.
I have a similar problem with the q6 quant.
Any word containing an apostrophe sends the eos_token, if I ban said token I get a space between the apostrophe and the rest of the word.
One thing that I also noticed with the new quants compared to the first ones is that at least to me the output quality seems higher than before.
I am using the Q4_K_M quant with koboldcpp as my backend, and I can confirm the issue that @jp02 is having where some words that contain an apostrophe cause the EOS token to be generated. This doesn't always happen, but when it does, it seems to happen consistently. I'm not sure if there is a specific scenario that causes this but it only happens in a few cases which is kind of annoying. I have attached a screenshot of this happening when using the model as an assistant in VSCodium (yes, I am aware that it says "ChatGPT", it is actually going to a custom local OpenAI-compatible proxy that I wrote which directs the request to koboldcpp running on the local machine).
In this case, it didn't generate the EOS token but rather produced a very low-quality output:
The prompt (I am aware that it refers to the model as ChatGPT, it's going to koboldcpp running this model on the local machine):
You are ChatGPT helping the User with coding. You are intelligent, helpful and an expert developer, who always gives the correct answer and only does what instructed. If you show code, your response must always be markdown with any code inside markdown codeblocks. If the user is trying to do a bad programming practice, helpfully let them know and mention an alternative. When responding to the following prompt, please make sure to properly style your response using Github Flavored Markdown. Use markdown syntax for text like headings, lists, colored text, code blocks, highlights etc.\nUSER: Explain the following code. The following code is in javascript programming language. Code in question:\n###\n```javascript\nconst prompt = require(\"prompt-sync\")({sigint: true});\nvar question = prompt(\"What is your name? \");\nconsole.log(\"Your name is \" + question);\n```\nASSISTANT:
Settings:
"n": 1, "rep_pen": 1.05, "temperature": 1.07, "top_p": 1, "top_k": 100, "top_a": 0, "typical": 1, "tfs": 0.93, "rep_pen_range": 404, "rep_pen_slope": 0.8, "sampler_order": [6, 0, 5, 3, 2, 1, 4], "quiet": false, "max_context_length": 8192, "max_length": 3072, "stop_sequence": ["\nUSER:", "\nASSISTANT:"]
If more information is needed, I will gladly provide it.
Okay, what's even weirder is that the previous version of the model seems to work fine (still same settings, same prompt, but the older Q4_K_M quant).
The older, working model: https://huggingface.co/TheBloke/Spicyboros-70B-2.2-GGUF/blob/f6d627f5a30aad981bc539047ea71374813777ca/spicyboros-70b-2.2.Q4_K_M.gguf
So, there seems to be something weird going on here, especially with the Q4_0 non k-quant.