What GPU split gives the best performance?

#2
by nmitchko - opened

Hi,

Thank you for the quant, this is a very cool model. I am currently on 2xA40 is there an optimal VRAM split that will best optimize for performance?

Cheers,

The standard 70B miqu model takes over 48 GB VRAM for the model + 32K context length. If this model is similar to that one, you will probably want to keep as much of the model + 32K context as possible on the first GPU and the rest on the second GPU. If you are using exllamav2 + exui or tabbyAPI with this model, you should use speculative decoding to speed up inference. Grab the TinyLLaMA 32K 3.0bpw model from my models list to use as the draft model there. If you are using ooba, you won't be able to do speculative decoding.

@LoneStriker Thanks for the advice! I'm using Text-gen-webui so specdec is unavailable but I will certainly try exui

Most online advice recommends leaving 20% of the GPU:0 VRAM empty for the cache and context, so I'll have to try a few configurations to see what is most performant.

exl2 with exui will pre-allocate the space needed when loading the model. I believe ooba allocates it when needed, so you can go OOM during inference.

Maybe I'm just spoiled coming from mixtral, my generation speeds are okay at best:

Exui with spec-dec is anywhere from 10-14tk/s

image.png

Without speculative decoding, Exui and ooba is anywhere from 5-8 tk/s

image.png

With exui on my 4090 + 3090 Ti + 3x 3090 box, I get between 17-19 t/s with this model using SD:
prompt: 84 tokens, 115.70 tokens/s ⁄ response: 796 tokens, 19.68 tokens/s
prompt: 933 tokens, 1429.78 tokens/s ⁄ response: 1067 tokens, 17.38 tokens/s

It's definitely a very heavy model and a lot slower than Mixtral. But this model also seems to be very good. 70B and 110/120B models had fallen way behind Mixtral for quality, but the miqu-based models and merges are definitely some of the top models currently.

Totally agree, the speed / quality tradeoff is worth it. I could be running into other issues, my whole box is 3xA40/RTX A6000 and GPU:2 is running production mixtral for various use cases. I might be missing something simple here.

Do you have an nvidia-smi or nvitop output you can share?

image.png

nvtop

image.png

nvidia-smi

image.png

I'm thinking that the miqu-model is simply much better than tinyllama, and that misalignment isn't boosting the tokens

Will try miqu-120b + SD with mixtral and share my results

You want the smallest, fastest compatible model generally for the draft model for SD. If it can guess the "easy" tokens, you'll get a big boost already. I'm not sure you can use Mixtral as a draft model for miqu-70B, but I'd be interested to see your results.

Mixtral as SD gave me about 9tk/s
TinyLLAMA SD gave me up to 19tk/s.

The key factor was the model split settings:

Auto: 15tk/s (42GB:0 and 30GB:1)
Manual: 19tk/s (36GB:0 and 36GB:1)
Manual: 6tk/s (30GB:0 and 42GB:1)

Sign up or log in to comment