.pt version uses 2gb less VRAM for me than the non-groupsized .safetensors

#10
by Monero - opened

I'm using KoboldAI with a RX 6800xt and Vega 64 combined for 24gb VRAM on Linux Mint

I've noticed the safetensors versions use significantly higher VRAM compared to the .ckpt version.

In comparison, the same prompt and context tokens for the 30b-int4.pt model totals 22,061MB and for the no-groupsize safetensors it uses 24146MB

Is there anyway to quantize the new one so that it's not using as much VRAM as the new ones?

I'm not sure how that's the case as I was maxing out my VRAM with the original version at max context. I have not used Kobold AI in a while now since I had not heard of them supporting 4bit (or were working on implementing it)? Maybe some of the model is being offloaded to swap, honestly I have no clue.

I'm not sure how that's the case as I was maxing out my VRAM with the original version at max context. I have not used Kobold AI in a while now since I had not heard of them supporting 4bit (or were working on implementing it)? Maybe some of the model is being offloaded to swap, honestly I have no clue.

Hello, elinas, and thank you very much for your work. When you say maxing out your VRAM at max content do you mean a 24GiB card? Because I'm having issues and needing to limit tokens to about 1k for it to fit in my 3090. Can you explain how you are using the model? I am using the johnsmit0031 repo.

Thank you very much.

Can you explain how you are using the model? I am using the johnsmit0031 repo.

I am not really familiar with that repo (other than it promises 4bit training and loras) and only have 2 official options, 3 if you use the KoboldAI 4bit version.

Sign up or log in to comment