What's the VRAM usage?

#2
by CamiloMM - opened

I'd like to know what quant I can run, if any, on 24GB, and if so at what context.

I heard people saying the cache of Command-R eats up a lot of memory on GGUF, but dunno if that applies to exl2?

3.75 bpw proved to be too large for me on a 3090, but i'm running Windows and use my GPU for my monitor. I was able to get it to load at 500 tokens so it DOES technically fit. A bit more headroom in the form of Linux etc might make it useable at that bpw.

That said, i'm giving a 3.5 bpw quant available elsewhere and hoping that will be the sweet spot for someone like me.

Since I favor tokens, I'm using 2.6 with 10k tokens. Couldn't go further than that on a 3090.

Since I favor tokens, I'm using 2.6 with 10k tokens. Couldn't go further than that on a 3090.

Isn't that too quantized? To the point where maybe a 8x7b with 32k tokens would be much more coherent? And are you using 4-bit cache?

Since I favor tokens, I'm using 2.6 with 10k tokens. Couldn't go further than that on a 3090.

Isn't that too quantized? To the point where maybe a 8x7b with 32k tokens would be much more coherent? And are you using 4-bit cache?

Yes 4-bit. I believe you are right. I'm trying this model since yesterday and the outputs are not very consistent. To the point older models becomes better options.
I'm kinda new to this world. I still don't understand everything.

On 3x3060:

  • 4.0 bpw: 24576 ctx, 7-7-12 split, 4bit cache, peak vram usage 11.3, 11.3, 11.5
  • 3.75 bpw: 27000 ctx, 6.4-6.7-12 split, 4bit cache, peak vram usage 11.5, 11.5, 11.5
  • 3.0 bpw: 32768 ctx, 5.5-5.5-12 split, 4bit cache, peak vram usage 11.3, 11.2, 11.4

Better using ExLlamav2 than ExLlamav2_HF, faster and more accurate.

Since I favor tokens, I'm using 2.6 with 10k tokens. Couldn't go further than that on a 3090.

I know this is an old thread but in case anyone sees this, I was able to fit 10752 context (Q4 cache) with the 3.0bpw.
Windows 11, 3090, display running on GPU, was only using 22.4GB VRAM. No reason to use the 2.6bpw quant on a 3090 imo.

In case anyone is still interested, updating exllamav2 to 0.1.5 allowed me to squeeze the much smarter 3.5bpw with 14K context into a single 24GB 4090 (should probably work with 3090 too). I'll download this one just to see how far I can go in terms of the context, but it seems that for Command-R-v01 the sweet spot is 3.5bpw (according to benchmarks, it's even better than the higher quants, but the real-world situation can be different, of course).

In case anyone is still interested, updating exllamav2 to 0.1.5 allowed me to squeeze the much smarter 3.5bpw with 14K context into a single 24GB 4090 (should probably work with 3090 too). I'll download this one just to see how far I can go in terms of the context, but it seems that for Command-R-v01 the sweet spot is 3.5bpw (according to benchmarks, it's even better than the higher quants, but the real-world situation can be different, of course).

True, thanks for the tip. I just loaded Bartowsky 3.5bpw with 14080k context in my 3090. It is squeezed tight, but it loads responses at a nice speed. But 13k is faster and I keep my usage around 23gb

I tested it, and with exllamav2 0.1.5 I can run this one (3.0bpw) with 20K context. But 3.5bpw is much more intelligent, and 14K context is just enough for my uses (while 9K is painfully too short).
I don't use that card for anything other than inference, so 14K context with about 23.5GB VRAM usage is acceptable in my case.

Sign up or log in to comment