What GPU split gives the best performance?
Hi,
Thank you for the quant, this is a very cool model. I am currently on 2xA40 is there an optimal VRAM split that will best optimize for performance?
Cheers,
The standard 70B miqu model takes over 48 GB VRAM for the model + 32K context length. If this model is similar to that one, you will probably want to keep as much of the model + 32K context as possible on the first GPU and the rest on the second GPU. If you are using exllamav2 + exui or tabbyAPI with this model, you should use speculative decoding to speed up inference. Grab the TinyLLaMA 32K 3.0bpw model from my models list to use as the draft model there. If you are using ooba, you won't be able to do speculative decoding.
@LoneStriker Thanks for the advice! I'm using Text-gen-webui so specdec is unavailable but I will certainly try exui
Most online advice recommends leaving 20% of the GPU:0 VRAM empty for the cache and context, so I'll have to try a few configurations to see what is most performant.
exl2 with exui will pre-allocate the space needed when loading the model. I believe ooba allocates it when needed, so you can go OOM during inference.
With exui on my 4090 + 3090 Ti + 3x 3090 box, I get between 17-19 t/s with this model using SD:prompt: 84 tokens, 115.70 tokens/s ⁄ response: 796 tokens, 19.68 tokens/s
prompt: 933 tokens, 1429.78 tokens/s ⁄ response: 1067 tokens, 17.38 tokens/s
It's definitely a very heavy model and a lot slower than Mixtral. But this model also seems to be very good. 70B and 110/120B models had fallen way behind Mixtral for quality, but the miqu-based models and merges are definitely some of the top models currently.
I'm thinking that the miqu-model is simply much better than tinyllama, and that misalignment isn't boosting the tokens
Will try miqu-120b + SD with mixtral and share my results
You want the smallest, fastest compatible model generally for the draft model for SD. If it can guess the "easy" tokens, you'll get a big boost already. I'm not sure you can use Mixtral as a draft model for miqu-70B, but I'd be interested to see your results.
Mixtral as SD gave me about 9tk/s
TinyLLAMA SD gave me up to 19tk/s.
The key factor was the model split settings:
Auto: 15tk/s (42GB:0 and 30GB:1)
Manual: 19tk/s (36GB:0 and 36GB:1)
Manual: 6tk/s (30GB:0 and 42GB:1)