3 x 4096 - won't load

#4
by bdambrosio - opened

just pulled latest exllamav2, 5-bit won't load.
at the very end of load it suddenly starts adding more to gpu1.
I've tried all the way down to {16,19,23}. Initial load and allocations look ok, then blows up gpu1 vram.
Plenty of room left on gpu3...
Tried setting cache max_seq_len down to 8192, same behaviour
I'll try a smaller quant and post what happens here.

Same behavior with 3-bit. Guess I'll re-install exllamav2...

8k context is still a lot for this model. You'd need 20 GB of VRAM just for the cache at 8k (no GQA), plus 47 GB for the weights at 5 bpw, and the large vocabulary means the implementation has to reserve 2.4 GB of temp buffer on the last device to accommodate the output layer. So with activations and Torch/CUDA overhead, 3x24 GB isn't going to cut it unless you drop the bitrate or the context some more.

Update for those on same path:
with context size 8192 you can load 4-bit with {13, 15, 23}
turboderp says context is really expensive, so presumably you can also try smaller context if you can live with it.

bdambrosio changed discussion status to closed

Sign up or log in to comment