AssertionError: Total sequence length exceeds cache size in model.forward
I'm getting this error when running past 2k context despite having the modeled loaded for 32k on runpod on an a6000.
I belive it is related to this: https://github.com/oobabooga/text-generation-webui/issues/5750#issuecomment-2024442282
But I am not knowledgable enough to be sure.
I use text-generation-webui from May 19 and do not have this issue. I use 4bit cache. What are your settings and what version do you use?
BTW, I made a small update in config.json and tokenizer_config.json - I believe it is unrelated to your problem, but please update those files.
Max length is at 32k. Alpha value is at 1, compress_pos_emb at 1. I have tried both 8 and 4 bit cache and neither worked. I can get successful generations up to about 2k then it will simply fail. Also on textgen webui.
This is my pod template: text-generation-webui-oneclick-UI-and-API
ID: vmg0ubbuwtesbw
Maybe you need to update ExLlama or textgen webui? I have no idea how to help you.