broken output when larger than 4k context

by Bakanayatsu - opened Mar 19

Discussion

Bakanayatsu

Mar 19

•

edited Mar 19

Adjusting the parameter 'llama.context_length' in gguf from 8192 to 4096 resolves the issue.

llama.cpp doesn't have sliding window support, so it only has 4096 context length.

Automatic rope scaling --contextsize 8192 in koboldcpp:

Automatic RoPE Scaling: Using (scale:1.000, base:10000.0).
llama_new_context_with_model: n_ctx      = 8272
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1

When it should be:

Automatic RoPE Scaling: Using (scale:1.000, base:32000.0).
llama_new_context_with_model: n_ctx      = 8272
llama_new_context_with_model: freq_base  = 32000.0
llama_new_context_with_model: freq_scale = 1

For the fix:
https://github.com/ggerganov/llama.cpp/tree/master/gguf-py/scripts
python gguf-set-metadata.py path/to/model.gguf llama.context_length 4096

This actually affects my imatrix quants too. I'll reupload them.

saishf

Owner Mar 19

Could this be caused by using the wrong tokenizer when quantizing or just mismatched model types?
Does this render my ggufs broken? If so i may just direct people to your imatrix versions and take this repo down as I'm away from my pc for a few days.

Bakanayatsu

Mar 19

•

edited Mar 19

This results in backends that do not support Mistral's sliding window attention (such as llama.cpp/koboldcpp) not being able to load 8192 context correctly. Using --contextsize 8192 results in koboldcpp determining that the model is 8192 (when it can only do 4096 since SWA is unsupported), so it does not need to "extend" the max context. So, any context greater than 4096 is incoherent.

llama.context_length should be changed to 4096

Edit: I might be wrong in assuming that SWA is used when training/finetuning SOLAR or is being used when greater than 4096 context since it's retrained with 4096 max context only.

saishf

Owner Mar 20

I haven't used it before but koboldcpps' benchmark says it stays coherent to 8k ctx using Q3_K_M

saishf

Owner Mar 20

Also i just saw this in cmd

Would this mean it is automatically correcting the incorrect config?

Bakanayatsu

Mar 20

•

edited Mar 20

I haven't used it before but koboldcpps' benchmark says it stays coherent to 8k ctx using Q3_K_M

Are you sure you are running --contextsize 8192 instead of --contextsize 16384?

saishf

Owner Mar 20

The first test was ran with 8k, I ran another with 16k and that's when I noticed it was changing the config.
The line under the file name in the screenshot says
MaxCtx: 8192

Bakanayatsu

Mar 20

If you have first benchmarked --contextsize 8192 it should output Coherent: False

saishf

Owner Mar 20

I've uploaded the resulting csv files to the repo, i believe they're correct
The commands used were
koboldcpp.exe --contextsize 8192 --benchmark results.csv --model C:\Users\sai\Downloads\Fimbulvetr-Kuro-Lotus-10.7B-Q3_K_M.gguf
koboldcpp.exe --contextsize 16384 --benchmark results.csv --model C:\Users\sai\Downloads\Fimbulvetr-Kuro-Lotus-10.7B-Q3_K_M.gguf

Bakanayatsu

Mar 20

•

edited Mar 20

For some reason, koboldcpp benchmark is outputting Coherent: True even though the outputs are gibberish. Try to test with real examples, a prompt with greater than 4096 context.

saishf

Owner Mar 20

Can confirm it is broken, when using multi turn conversation it goes insane past 4k but when feeding in one large chunk of context larger than 4k it remains coherent.
It's cursed

Bakanayatsu

Mar 20

•

edited Mar 20

Was initially confused too why the same prompt (1000 tokens) * 5 = 5k would produce coherent text. But multi turn with unique replies would break past 4k.

saishf

Owner Mar 22

Thank you for all the effort, i wouldn't of had a clue where to start.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment