broken output when larger than 4k context
Adjusting the parameter 'llama.context_length' in gguf from 8192 to 4096 resolves the issue.
llama.cpp doesn't have sliding window support, so it only has 4096 context length.
Automatic rope scaling --contextsize 8192
in koboldcpp:
Automatic RoPE Scaling: Using (scale:1.000, base:10000.0).
llama_new_context_with_model: n_ctx = 8272
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
When it should be:
Automatic RoPE Scaling: Using (scale:1.000, base:32000.0).
llama_new_context_with_model: n_ctx = 8272
llama_new_context_with_model: freq_base = 32000.0
llama_new_context_with_model: freq_scale = 1
For the fix:
https://github.com/ggerganov/llama.cpp/tree/master/gguf-py/scriptspython gguf-set-metadata.py path/to/model.gguf llama.context_length 4096
This actually affects my imatrix quants too. I'll reupload them.
Could this be caused by using the wrong tokenizer when quantizing or just mismatched model types?
Does this render my ggufs broken? If so i may just direct people to your imatrix versions and take this repo down as I'm away from my pc for a few days.
This results in backends that do not support Mistral's sliding window attention (such as llama.cpp/koboldcpp) not being able to load 8192 context correctly. Using --contextsize 8192
results in koboldcpp determining that the model is 8192 (when it can only do 4096 since SWA is unsupported), so it does not need to "extend" the max context. So, any context greater than 4096 is incoherent.
llama.context_length should be changed to 4096
Edit: I might be wrong in assuming that SWA is used when training/finetuning SOLAR or is being used when greater than 4096 context since it's retrained with 4096 max context only.
The first test was ran with 8k, I ran another with 16k and that's when I noticed it was changing the config.
The line under the file name in the screenshot says
MaxCtx: 8192
If you have first benchmarked --contextsize 8192
it should output Coherent: False
I've uploaded the resulting csv files to the repo, i believe they're correct
The commands used were
koboldcpp.exe --contextsize 8192 --benchmark results.csv --model C:\Users\sai\Downloads\Fimbulvetr-Kuro-Lotus-10.7B-Q3_K_M.gguf
koboldcpp.exe --contextsize 16384 --benchmark results.csv --model C:\Users\sai\Downloads\Fimbulvetr-Kuro-Lotus-10.7B-Q3_K_M.gguf
Can confirm it is broken, when using multi turn conversation it goes insane past 4k but when feeding in one large chunk of context larger than 4k it remains coherent.
It's cursed
Was initially confused too why the same prompt (1000 tokens) * 5 = 5k would produce coherent text. But multi turn with unique replies would break past 4k.
Thank you for all the effort, i wouldn't of had a clue where to start.