Doesn't work for me with ExLlama loader only with transformers

#2
by kriss - opened

Great work!

Getting the following error when trying to load it with the new ExLlama loader via WebUi - text-generation-webui/repositories/exllama/model.py”, line 554, in init with safe_open(self.config.model_path, framework=“pt”, device=“cpu”) as f: safetensors_rust.SafetensorError: Error while deserializing header: HeaderTooLarge
Is it just a vram issue on my end?

Works well with transformers though

I haven't used exllama, but taking a quick peek at the repo, it looks like it's intended for the 4-but GPTQ versions of the models. TheBloke has kindly quantized all of the GPTQ (and GGML) versions of all of these (7b through 65b). The 7b version is here:
https://huggingface.co/TheBloke/airoboros-7B-gpt4-1.2-GPTQ

I used qlora for all versions this time, rather than a full fine-tune, so the smaller 7b/13b models may be a bit worse than 1.1 versions for some prompts but I don't have any direct evidence for that.

The 33b and 65b versions perform quite well with qlora tuning however.

Sign up or log in to comment