3 x 4096 - won't load

by bdambrosio - opened Feb 17, 2024

Feb 17, 2024

•

edited Feb 17, 2024

just pulled latest exllamav2, 5-bit won't load.
at the very end of load it suddenly starts adding more to gpu1.
I've tried all the way down to {16,19,23}. Initial load and allocations look ok, then blows up gpu1 vram.
Plenty of room left on gpu3...
Tried setting cache max_seq_len down to 8192, same behaviour
I'll try a smaller quant and post what happens here.

bdambrosio

Feb 17, 2024

Same behavior with 3-bit. Guess I'll re-install exllamav2...

turboderp

Owner Feb 17, 2024

8k context is still a lot for this model. You'd need 20 GB of VRAM just for the cache at 8k (no GQA), plus 47 GB for the weights at 5 bpw, and the large vocabulary means the implementation has to reserve 2.4 GB of temp buffer on the last device to accommodate the output layer. So with activations and Torch/CUDA overhead, 3x24 GB isn't going to cut it unless you drop the bitrate or the context some more.

bdambrosio

Feb 17, 2024

Update for those on same path:
with context size 8192 you can load 4-bit with {13, 15, 23}
turboderp says context is really expensive, so presumably you can also try smaller context if you can live with it.

bdambrosio changed discussion status to closed Feb 17, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment